2025-01-06 | disclaimer

PY0794: Advanced Quantitative research methods

Goals (today)

  • Markdown
  • R
  • Notebooks (jupyter or R notebooks) OR
  • RStudio
  • Basic statistics.

Goals (course)

KU3 - Formulate balanced judgements with regard to complex, incomplete, ambiguous or sensitive data.

KU 4 - Contribute to the creation of new knowledge and practical applications within the discipline through a critical understanding of the processes through which knowledge is created

IPSA 2 - Use a variety of techniques, advanced research methods and technological skills applicable to psychological enquiry

PVA 1 - Apply relevant ethical, legal and professional practice frameworks (e.g., BPS), and maintain appropriate professional boundaries.

Housekeeping

Bring your laptop if you want to. ONLY class relevant stuff.

Course manual.

Reading list.

Attendance. / Be punctual. / Be engaged.

Exercise after each lecture. Keep up!

Appointment via: thomas.pollet@northumbria.ac.uk

Assignments

The bit you care about most: marks.

30% each (remaining 40% Qual. components)

Deadlines: see Turnitin briefs (1pm) / MRes. Handbook.

Graded via rubrics

Empty .rmd shell which you will turn into a .pdf

Screenshot + .pdf + .rmd

Assignments (bis)

Complete the exercises in class (+ any bonus), you can find them under ‘study skills’.

Questions via elearning environment, but only if you attempted the corresponding exercise.

Why oh, Why R?

Most of you are familiar with using Excel and SPSS?

Why change to R?

There are quite a few reasons:

  • It will make it easier to do repetitive tasks (e.g., formatting tables).
  • R is maintained by statisticians and computer scientists (“Experts”).
  • Some techniques are not available in standalone commercial packages (e.g., SEM in SPSS).
  • Free and open source! (not dependent on often expensive software licenses). RStudio desktop also open source. Unlike SPSS / SAS / Statistica, R will always be free.

Even more benefits

  • Forces you to think about what you are doing. If something doesn’t work the way you like, you can change it.
  • Active help community. Stack overflow
  • Many companies rely on R (e.g., Facebook, Google, Shell, Thomas Cook,… ). Some newspapers rely on R to make their graphics. The BBC makes their figures/infographics with it!

Downsides

  • R can be slower than ‘real’ programming languages (still it’ll beat SPSS comfortably at tasks such as bootstrapping).
  • There is no GUI.

Markdown and RMarkdown

Before we move to R.

Let’s have a look at Markdown, which is a very basic language.

Familiarize yourself:

Work your way through this: www.markdowntutorial.com

In a duo, work through this tutorial.

Cheat sheet to be framed above your bed.

Cheatsheet

Tutorial Done. Woohoo!

Some things which are not (fully) covered in the tutorial.

Mathematical symbols.

sub/super-script: \(x^2\)

Greek symbols. For example alpha –> \(\alpha\) ; beta –> \(\beta\) ; etc.

–> Don’t worry too much about those for now. You can read up later here and here

RStudio.

  • Open RStudio
  • Should look something like this:

RStudio - New file.

Click File New –> R markdown. –> Document—> Html. (Many other options incl. presentations)

This will be the core in which you will complete your work.

RMarkdown can be rendered in .html / .word / .pdf

RMarkdown

Press the knit button!

HTML

Congrats. You generated a webpage!

The bit between the ticks are R code. The text in between is Markdown.

Occasionally .html or . latex code interspersed.

You can make .pdf , which you’ll learn later, but .html is suitable for most purposes including your assignment.

If you want to make PDFs you’ll need a latex distribution. On Windows, you need Miktex, installed here in the lab. On OSX, MacTeX. On Linux, TexLive.

More info here.

From .html to .pdf

RStudio can also make .PDFs more later about that.

For now: Simply open your webpage save as .pdf –> via print command.

First coding ever.

Delete what’s between the ticks. Enter:

  • Sys.Date() and Click “Run Current Chunk”

Should give you:

Sys.Date()
## [1] "2025-01-06"

Sys.time()

  • Sys.time() and Click “Run Current Chunk”

Should give you:

Sys.time()
## [1] "2025-01-06 11:16:16 GMT"

Install packages.

R is not really a programme but rather works based on packages. Some basic operations can be done in base R, but mostly we will need packages.

First we install some packages. This can be done via the install.packages command. In RStudio you also have a button to click.

Thomas shows Rstudio button

Try installing the ‘ggplot2’ package via the button.

Loading a package.

  • packages: and then tick ggplot2
  • Or:
library(ggplot2) #loading ggplot2

‘#’ to write comments in your code

Workspace

Most of these you might not need as you have RStudio!

In RStudio, the loaded objects are listed in the “Environment”-tab in the window in the top right corner.

ls() list objects in workspace

rm(…) remove objects from workspace

rm(list = ls()) remove all objects from workspace

save.image() saves workspace

load(“.rdata”) loads saved workspace

history() view command history

loadhistory() load command history

savehistory() save command history

R as a calculator.

Use ; if you want several operations.

2+3; 5*7; 3-5
## [1] 5
## [1] 35
## [1] -2

Remember log / exp.

# The log function gives logs to the base e (e = 2.718282)
log(2.71828182845905) 
## [1] 1
exp(1) # the antilog function is exp
## [1] 2.718282
log10(10)
## [1] 1

Rounding

Rounding to next integer is straightforward. floor() and ceiling()

floor(6.9); ceiling (6.9); floor(-6.9); ceiling(-6.9)
## [1] 6
## [1] 7
## [1] -7
## [1] -6

Trunc()

Stripping the decimal part is also straightforward. trunc()

trunc(9.75); trunc(-9.75)
## [1] 9
## [1] -9

More mathematical functions.

Further mathematical functions are shown below (Crawley, 2013:17).

Notation of (larger) numbers.

1.3e3 means 1300 because the e3 means ‘move the decimal point 3 places to the right’;

1.5e-2 means 0.015 because the e-2 indicates ‘move the decimal point 2 places to the left’

(For those of who you who had advanced maths: Complex numbers: 3.6+4.2i is a complex number with real (3.6) and imaginary (4.2) parts, and i is \(\sqrt[]{-1}\).)

Let’s make a variable

We often want to store things on which we’ll do the calculations.

thomas_age<-35

IMPORTANT

Variable names in R are case sensitive, so Thomas is not the same as thomas.

Variable names should not begin with numbers (e.g. 2x) or symbols (e.g. %x or $x).

Variable names should not contain blank spaces (use body_weight or body.weight not body weight).

Make a variable, for one of the people in your duo

Terminology

Mostly for your reference: 1. Object modes 2. Object classes

Object modes (atomic structures)

integer whole numbers (15, 23, 8, 42, 4, 16)

numeric real numbers (double precision: 3.14, 0.0002, 6.022E23)

character text string (“Hello World”, “ROFLMAO”, “DR Pollet”)

logical TRUE/FALSE or T/F

Object classes

vector object with atomic mode

factor vector object with discrete groups (ordered/unordered)

matrix 2-dimensional array

array like matrices but multiple dimensions

list vector of components

data.frame “matrix –like” list of variables of same # of rows –> This is the one you care most about.

Many of the errors you potentially run into have to do with objects being the wrong class. (For example, R is expecting a data.frame, but you are offering it a matrix).

Assignment, or how to label a vector (or variable).

<- assign, this is to assign a variable. At your own risk you can also use = . Why?

c(…) combine / concatenate

seq(x) generate a sequence.

[] denotes the position of an element.

Examples.

# Now let's do some very simple examples.
seq(1:5) # print a sequence
## [1] 1 2 3 4 5
thomas_height<-188.5
thomas_height # prints the value.
## [1] 188.5
# number of coffee breaks in a week
number_of_coffees_a_week<-c(1,2,0,0,1,4,5) 
number_of_coffees_a_week
## [1] 1 2 0 0 1 4 5
length(number_of_coffees_a_week) # how many elements
## [1] 7

Days of the week

days<-c("Mon","Tues","Wed","Thurs","Friday", "Sat", "Sun")
days
## [1] "Mon"    "Tues"   "Wed"    "Thurs"  "Friday" "Sat"    "Sun"
days[5] # print element number 5 -- Friday
## [1] "Friday"
days[c(1,2,3)] # print elements 1,2,3
## [1] "Mon"  "Tues" "Wed"

Replacing things.

days[5]<-"Fri" # replace Friday with Fri
days
## [1] "Mon"   "Tues"  "Wed"   "Thurs" "Fri"   "Sat"   "Sun"
days[c(6,7)] <- rep("Party time",2) # write Sat and Sun as Party time
days
## [1] "Mon"        "Tues"       "Wed"        "Thurs"      "Fri"       
## [6] "Party time" "Party time"

Try it yourself (in duos)

Use # to annotate your code.

  1. Make an atomic vector with your height. If you don’t know your metric height: ‘guess’.
  2. Make a vector for the months of the year.
  3. Print the 6th and 9th month
  4. Replace the July/August with vacation in your vector.

Special Values

NULL object of zero length, test with is.null(x)

NA Not Available / missing value, test with is.na(x)

NaN Not a number, test with is.nan(x) (e.g. 0/0, log(-1))

Inf, -Inf Positive/negative infinity, test with is.infinite(x) (e.g. 1/0)

Is.numeric / etc.

is.numeric(thomas_age)
## [1] TRUE
is.numeric(days)
## [1] FALSE
is.atomic(thomas_age)
## [1] TRUE
is.character(days)[1]
## [1] TRUE

Checking for missings: is.na()

is.na(thomas_age)
## [1] FALSE
is.na(days)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Combining vectors into a matrix.

Combining vectors is easy, use c(vector1,vector2)

Combining column vectors into one matrix goes as follows.

cbind() column bind

rbind() row bind

Example with coffee data

coffee_data<-cbind(number_of_coffees_a_week,days)
coffee_data # this is what the matrix looks like.
##      number_of_coffees_a_week days        
## [1,] "1"                      "Mon"       
## [2,] "2"                      "Tues"      
## [3,] "0"                      "Wed"       
## [4,] "0"                      "Thurs"     
## [5,] "1"                      "Fri"       
## [6,] "4"                      "Party time"
## [7,] "5"                      "Party time"
coffee_data<-as.data.frame(coffee_data) # make it a dataframe.
is.data.frame(coffee_data)
## [1] TRUE

Try it yourself.

Together with your partner:

  1. combine the two vectors with your heights. (Remember the order!) (or make a new one!)
  2. make a vector with your ages (in the same order as 1.)
  3. make a dataframe called ‘team’ using cbind
  4. check that it is a dataframe.

Making a matrix from scratch.

# nr: nrow / nc; ncol
matrix(data=5, nr=2, nc=2)
##      [,1] [,2]
## [1,]    5    5
## [2,]    5    5
matrix(1:8, 2, 4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
as.data.frame(matrix(1:8,2,4))
##   V1 V2 V3 V4
## 1  1  3  5  7
## 2  2  4  6  8

Setting a work directory.

Your files are typically living in the directory where the .rmd lives.

Normally you would do this at the start of your session.

This is where you would read and write data.

setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture1") 
# the tilde just abbreviates the bits before
# mostly you would use setwd("C:/Documents/Rstudio/assignment1") 
# for windows. Dont use \ 
# Linux: setwd("/usr/thomas/mydir")

Writing away data.

One of the most versatile formats is .csv

comma separated value file (readable in MS Excel)

write.csv(coffee_data, file= 'coffee_data.csv')
### no row names.
write.csv(coffee_data, file= 'coffee_data.csv', row.names=FALSE)
### ??write.csv to find out more

SPSS (install ‘haven’ first!) , note the different notation!

require(haven)
write_sav(coffee_data, 'coffee_data.sav')

Try it yourself

Write away your datafiles as .csv and .sav

Open your datafiles with Excel and SPSS.

Find out more about write_spss function.

Read in data.

If it is in the same folder. I have reloaded the ‘haven’ package.

require(haven)
coffee_data_the_return<-read_sav('coffee_data.sav')
### use the same notation as with setwd to get the path

Even from (public) weblinks. Here in .dat format. head() shows you the first lines.

require(data.table)
mydat <- fread('https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2019/Middle_School/sadc_ms_2019_state_a_m.dat')
head(mydat)
##        V1     V2     V3     V4    V5    V6    V7      V8    V9   V10    V11
##    <char> <char> <char> <char> <int> <int> <int>   <num> <int> <int>  <int>
## 1:     HI Hawaii   (HI)  State     2  1997     2 21.4499     2     1 117476
## 2:     HI Hawaii   (HI)  State     2  1997     2 20.9430    11     2 117477
## 3:     HI Hawaii   (HI)  State     2  1997     2 12.9590     4     1 117478
## 4:     HI Hawaii   (HI)  State     2  1997     2 12.9590     4     1 117479
## 5:     HI Hawaii   (HI)  State     2  1997     2 26.1130     9     2 117480
## 6:     HI Hawaii   (HI)  State     2  1997     2 18.7855     1     1 117481
##      V12   V13    V14   V15   V16   V17   V18   V19   V20   V21    V22    V23
##    <int> <int> <char> <int> <num> <int> <int> <int> <int> <int> <char> <char>
## 1:     4     1      .     4    22     2     2     2     1   331      1      .
## 2:     4     2      .     4    62     2     2     2     1   144      1      .
## 3:     3     1      .     4    61     5     2     2     1   133      .      .
## 4:     4     1      .     4    63     4     2     2     1   144      1      .
## 5:     4     1      .     4    22     4     2     2     1   151      1      .
## 6:     4     2      .     4    52     2     1     2     1   134      1      .
##      V24    V25    V26    V27    V28    V29   V30    V31   V32    V33    V34
##    <int> <char> <char> <char> <char> <char> <int> <char> <int> <char> <char>
## 1:     1      .      .      .      .      .     2      .     2      .      .
## 2:     1      .      .      .      .      .     2      .     2      .      .
## 3:     2      .      .      .      .      .     2      .     2      .      .
## 4:     2      .      .      .      .      .     2      .     2      .      .
## 5:     2      .      .      .      .      .     2      .     2      .      .
## 6:     1      .      .      .      .      .     1      .     2      .      .
##       V35    V36    V37    V38    V39    V40    V41    V42    V43    V44    V45
##    <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
## 1:      .      .      .      .      .      .      .      .      .      .      .
## 2:      .      .      .      .      .      .      .      .      .      .      .
## 3:      .      .      .      .      .      .      .      .      .      .      .
## 4:      .      .      .      .      .      .      .      .      .      .      .
## 5:      .      .      .      .      .      .      .      .      .      .      .
## 6:      .      .      .      .      .      .      .      .      .      .      .
##       V46    V47    V48    V49    V50    V51    V52    V53   V54   V55    V56
##    <char> <char> <char> <char> <char> <char> <char> <char> <int> <int> <char>
## 1:      .      .      .      .      .      .      .      .     2     1      .
## 2:      .      .      .      .      .      .      .      .     1     2      .
## 3:      .      .      .      .      .      .      .      .     2     2      .
## 4:      .      .      .      .      .      .      .      .     1     2      .
## 5:      .      .      .      .      .      .      .      .     1     1      .
## 6:      .      .      .      .      .      .      .      .     2     2      .
##       V57    V58    V59    V60    V61    V62    V63    V64    V65    V66    V67
##    <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
## 1:      .      .      .      .      .      .      .      .      .      .      .
## 2:      .      .      .      .      .      .      .      .      .      .      .
## 3:      .      .      .      .      .      .      .      .      .      .      .
## 4:      .      .      .      .      .      .      .      .      .      .      .
## 5:      .      .      .      .      .      .      .      .      .      .      .
## 6:      .      .      .      .      .      .      .      .      .      .      .
##       V68    V69    V70    V71    V72    V73    V74    V75    V76    V77    V78
##    <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
## 1:      .      .      .      .      .      .      .      .      .      .      .
## 2:      .      .      .      .      .      .      .      .      .      .      .
## 3:      .      .      .      .      .      .      .      .      .      .      .
## 4:      .      .      .      .      .      .      .      .      .      .      .
## 5:      .      .      .      .      .      .      .      .      .      .      .
## 6:      .      .      .      .      .      .      .      .      .      .      .
##       V79   V80    V81    V82    V83    V84    V85    V86    V87    V88    V89
##    <char> <int> <char> <char> <char> <char> <char> <char> <char> <char> <char>
## 1:      .     1      .      .      .      .      .      .      .      .      .
## 2:      .     1      .      .      .      .      .      .      .      .      .
## 3:      .     1      .      .      .      .      .      .      .      .      .
## 4:      .     1      .      .      .      .      .      .      .      .      .
## 5:      .     1      .      .      .      .      .      .      .      .      .
## 6:      .     1      .      .      .      .      .      .      .      .      .
##       V90    V91    V92    V93   V94    V95    V96    V97    V98    V99   V100
##    <char> <char> <char> <char> <int> <char> <char> <char> <char> <char> <char>
## 1:      .      .      .      .     1      .      .      .      .      .      .
## 2:      .      .      .      .     1      .      .      .      .      .      .
## 3:      .      .      .      .     1      .      .      .      .      .      .
## 4:      .      .      .      .     1      .      .      .      .      .      .
## 5:      .      .      .      .     1      .      .      .      .      .      .
## 6:      .      .      .      .     1      .      .      .      .      .      .
##      V101   V102   V103   V104   V105   V106   V107   V108   V109   V110   V111
##    <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
## 1:      .      .      .      .      .      .      .      .      .      .      .
## 2:      .      .      .      .      .      .      .      .      .      .      .
## 3:      .      .      .      .      .      .      .      .      .      .      .
## 4:      .      .      .      .      .      .      .      .      .      .      .
## 5:      .      .      .      .      .      .      .      .      .      .      .
## 6:      .      .      .      .      .      .      .      .      .      .      .
##      V112   V113
##    <char> <char>
## 1:      .      .
## 2:      .      .
## 3:      .      .
## 4:      .      .
## 5:      .      .
## 6:      .      .

Some basic data analyses / manipulations.

This follows Whickham & Grolemund (2017).

library (instead of require - require tries to load library). I’ll switch.

library(nycflights13)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second()  masks data.table::second()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Conflicts.

Take careful note of the conflicts message printed loading the tidyverse.

It tells you that dplyr conflicts with some functions.

Some of these are from base R.

If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag()

NYC Flights

This data frame contains all 336,776 flights (!) that departed from New York City in 2013.

From the US Bureau of Transportation Statistics, and is documented in ?flights.

nycflights13::flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Lets make it available to our environment.
flights<-(nycflights13::flights)

Tibbles.

Tibbles are data frames. But with some tweaks to make life a little easier.

You can turn a dataframe into a tibble with as_tibble()

Notice anything in particular?

int stands for integers.

dbl stands for doubles, or real numbers.

chr stands for character vectors, or strings.

dttm stands for date-times (a date + a time).

But I want to see everything.

View()

View(flights)

‘dplyr’ basics.

Pick observations by their values: filter().

Reorder the rows: arrange().

Pick variables by their names select().

Create new variables with functions of existing variables mutate().

Collapse many values down to a single summary summarise().

Data cleaning…

Let’s filter out some missings for departure delay (dep_delay)

Here we make a new dataset

filter()

# notice '!' for 'not'.
flights_no_miss<-filter(flights, dep_delay!='NA')

Logical operations.

& is “and”, | is “or”, and ! is “not”

= vs. ==

When filtering you’ll need: the standard suite: >, >=, <, <=, != (not equal), and == (equal).

Common mistake: = instead of ==

Floating point numbers

floating point numbers are a problem. Computers cannot store infinite numbers of digits.

sqrt(3) ^ 2 == 3
## [1] FALSE
1/98 * 98 == 1
## [1] FALSE

Solution: near()

near(sqrt(3) ^ 2,  3)
## [1] TRUE
near(1/98*98, 1)
## [1] TRUE

Basic statistics.

Let’s look at the delays with departure (dep_delay).

Note the dollar sign ($) for selecting the column

mean(flights_no_miss$dep_delay)
## [1] 12.63907
median(flights_no_miss$dep_delay)
## [1] -2

Measures of variation

Standard deviation and Standard error (of the mean).

sd(flights_no_miss$dep_delay)
## [1] 40.21006
var(flights_no_miss$dep_delay)
## [1] 1616.849
se<-sd(flights_no_miss$dep_delay)/sqrt(length(flights_no_miss$dep_delay)) 
se # standard error
## [1] 0.07015412

95% Confidence interval.

# 95 CI
UL<- (mean(flights_no_miss$dep_delay) + 1.96*se)
LL<- (mean(flights_no_miss$dep_delay) - 1.96*se)
UL
## [1] 12.77657
LL
## [1] 12.50157

Five number summary.

minimum, first quartile (Q1), median, third quartile (Q3), maximum.

fivenum(flights_no_miss$dep_delay)
## [1]  -43   -5   -2   11 1301
summary(flights_no_miss$dep_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00

Interquartile range

IQR: Q3 - Q1. Another measure of variation.

IQR(flights_no_miss$dep_delay)
## [1] 16

Boxplot

boxplot(flights_no_miss$dep_delay)

Mode. (‘modeest’)

Mode= most common value.

Trickier. ??mlv to find out more

library(modeest)
mlv(flights_no_miss$dep_delay,  method='mfv')
## [1] -5

Bickel’s modal skewness

Bonus: ‘skimr’

Sometimes you need to install a package which is under development. (First install ‘devtools’)

devtools::install_github("ropenscilabs/skimr")
## Skipping install of 'skimr' from a github remote, the SHA1 (d5126aa0) has not changed since last install.
##   Use `force = TRUE` to force installation
require(skimr)
## Loading required package: skimr

Try it yourself: skim

require(skimr)
skim(flights)
Data summary
Name flights
Number of rows 336776
Number of columns 19
_______________________
Column type frequency:
character 4
numeric 14
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
carrier 0 1.00 2 2 0 16 0
tailnum 2512 0.99 5 6 0 4043 0
origin 0 1.00 3 3 0 3 0
dest 0 1.00 3 3 0 105 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2013.00 0.00 2013 2013 2013 2013 2013 ▁▁▇▁▁
month 0 1.00 6.55 3.41 1 4 7 10 12 ▇▆▆▆▇
day 0 1.00 15.71 8.77 1 8 16 23 31 ▇▇▇▇▆
dep_time 8255 0.98 1349.11 488.28 1 907 1401 1744 2400 ▁▇▆▇▃
sched_dep_time 0 1.00 1344.25 467.34 106 906 1359 1729 2359 ▁▇▇▇▃
dep_delay 8255 0.98 12.64 40.21 -43 -5 -2 11 1301 ▇▁▁▁▁
arr_time 8713 0.97 1502.05 533.26 1 1104 1535 1940 2400 ▁▃▇▇▇
sched_arr_time 0 1.00 1536.38 497.46 1 1124 1556 1945 2359 ▁▃▇▇▇
arr_delay 9430 0.97 6.90 44.63 -86 -17 -5 14 1272 ▇▁▁▁▁
flight 0 1.00 1971.92 1632.47 1 553 1496 3465 8500 ▇▃▃▁▁
air_time 9430 0.97 150.69 93.69 20 82 129 192 695 ▇▂▂▁▁
distance 0 1.00 1039.91 733.23 17 502 872 1389 4983 ▇▃▂▁▁
hour 0 1.00 13.18 4.66 1 9 13 17 23 ▁▇▇▇▅
minute 0 1.00 26.23 19.30 0 8 29 44 59 ▇▃▆▃▅

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2013-01-01 05:00:00 2013-12-31 23:00:00 2013-07-03 10:00:00 6936

I miss SPSS… .

Rcmdr

Try and install Rcmdr.

Toy with Rcmdr.

Exercise.

Load the flights dataset.

Calculate the mean delay in arrival for Delta Airlines (DL) (use filter())

Calculate the associated 95% confidence interval.

Do the same for United Airlines (UA) and compare the two. Do their confidence intervals overlap?

Calculate the mode for the delay in arrival from JFK airport (origin).

save a dataset as .sav with only departing flights from JFK airport. –> submit via elearning portal.

A note on Jupyter notebooks.

An alternative to RStudio is Jupyter notebooks. This would be especially handy if you need to combine Python with R.

My experience is that they are great but don’t play well with multiple versions of R and/or networked PC’s.

If you are up for a challenge. No refunds and at your own risk :).

References.

For next week.

  • Complete the exercises.
  • I strongly recommend you re-read these slides.
  • Work through some of the references!
  • Toy around. Have fun! Look at Rcmdr
  • Look at your assignment! You can already complete parts, after this first lecture!

Further Resources