2024-01-15 | disclaimer
KU3 - Formulate balanced judgements with regard to complex, incomplete, ambiguous or sensitive data.
KU 4 - Contribute to the creation of new knowledge and practical applications within the discipline through a critical understanding of the processes through which knowledge is created
IPSA 2 - Use a variety of techniques, advanced research methods and technological skills applicable to psychological enquiry
PVA 1 - Apply relevant ethical, legal and professional practice frameworks (e.g., BPS), and maintain appropriate professional boundaries.
Bring your laptop if you want to. ONLY class relevant stuff.
Course manual.
Reading list.
Attendance. / Be punctual. / Be engaged.
Exercise after each lecture. Keep up!
Appointment via: thomas.pollet@northumbria.ac.uk
30% each (remaining 40% Qual. components)
Deadlines: see Turnitin briefs (1pm) / MRes. Handbook.
Graded via rubrics
Empty .rmd shell which you will turn into a .pdf
Screenshot + .pdf + .rmd
Complete the exercises in class (+ any bonus), you can find them under ‘study skills’.
Questions via elearning environment, but only if you attempted the corresponding exercise.
Most of you are familiar with using Excel and SPSS?
Why change to R?
There are quite a few reasons:
Let’s have a look at Markdown, which is a very basic language.
Work your way through this: www.markdowntutorial.com
In a duo, work through this tutorial.
Mathematical symbols.
sub/super-script: \(x^2\)
Greek symbols. For example alpha –> \(\alpha\) ; beta –> \(\beta\) ; etc.
–> Don’t worry too much about those for now. You can read up later here and here
Click File New –> R markdown. –> Document—> Html. (Many other options incl. presentations)
This will be the core in which you will complete your work.
RMarkdown can be rendered in .html / .word / .pdf
Press the knit button!
Congrats. You generated a webpage!
The bit between the ticks are R code. The text in between is Markdown.
Occasionally .html or . latex code interspersed.
You can make .pdf , which you’ll learn later, but .html is suitable for most purposes including your assignment.
If you want to make PDFs you’ll need a latex distribution. On Windows, you need Miktex, installed here in the lab. On OSX, MacTeX. On Linux, TexLive.
More info here.
RStudio can also make .PDFs more later about that.
For now: Simply open your webpage save as .pdf –> via print command.
Delete what’s between the ticks. Enter:
Should give you:
Sys.Date()
## [1] "2024-01-15"
Should give you:
Sys.time()
## [1] "2024-01-15 14:09:31 GMT"
R is not really a programme but rather works based on packages. Some basic operations can be done in base R, but mostly we will need packages.
First we install some packages. This can be done via the install.packages command. In RStudio you also have a button to click.
Thomas shows Rstudio button
Try installing the ‘ggplot2’ package via the button.
library(ggplot2) #loading ggplot2
‘#’ to write comments in your code
Most of these you might not need as you have RStudio!
In RStudio, the loaded objects are listed in the “Environment”-tab in the window in the top right corner.
ls() list objects in workspace
rm(…) remove objects from workspace
rm(list = ls()) remove all objects from workspace
save.image() saves workspace
load(“.rdata”) loads saved workspace
history() view command history
loadhistory() load command history
savehistory() save command history
Use ; if you want several operations.
2+3; 5*7; 3-5
## [1] 5
## [1] 35
## [1] -2
# The log function gives logs to the base e (e = 2.718282) log(2.71828182845905)
## [1] 1
exp(1) # the antilog function is exp
## [1] 2.718282
log10(10)
## [1] 1
Rounding to next integer is straightforward. floor() and ceiling()
floor(6.9); ceiling (6.9); floor(-6.9); ceiling(-6.9)
## [1] 6
## [1] 7
## [1] -7
## [1] -6
Stripping the decimal part is also straightforward. trunc()
trunc(9.75); trunc(-9.75)
## [1] 9
## [1] -9
Further mathematical functions are shown below (Crawley, 2013:17).
1.3e3 means 1300 because the e3 means ‘move the decimal point 3 places to the right’;
1.5e-2 means 0.015 because the e-2 indicates ‘move the decimal point 2 places to the left’
(For those of who you who had advanced maths: Complex numbers: 3.6+4.2i is a complex number with real (3.6) and imaginary (4.2) parts, and i is \(\sqrt[]{-1}\).)
We often want to store things on which we’ll do the calculations.
thomas_age<-35
IMPORTANT
Variable names in R are case sensitive, so Thomas is not the same as thomas.
Variable names should not begin with numbers (e.g. 2x) or symbols (e.g. %x or $x).
Variable names should not contain blank spaces (use body_weight or body.weight not body weight).
Make a variable, for one of the people in your duo
Mostly for your reference: 1. Object modes 2. Object classes
integer whole numbers (15, 23, 8, 42, 4, 16)
numeric real numbers (double precision: 3.14, 0.0002, 6.022E23)
character text string (“Hello World”, “ROFLMAO”, “DR Pollet”)
logical TRUE/FALSE or T/F
vector object with atomic mode
factor vector object with discrete groups (ordered/unordered)
matrix 2-dimensional array
array like matrices but multiple dimensions
list vector of components
data.frame “matrix –like” list of variables of same # of rows –> This is the one you care most about.
Many of the errors you potentially run into have to do with objects being the wrong class. (For example, R is expecting a data.frame, but you are offering it a matrix).
<- assign, this is to assign a variable. At your own risk you can also use = . Why?
c(…) combine / concatenate
seq(x) generate a sequence.
[] denotes the position of an element.
# Now let's do some very simple examples. seq(1:5) # print a sequence
## [1] 1 2 3 4 5
thomas_height<-188.5 thomas_height # prints the value.
## [1] 188.5
# number of coffee breaks in a week number_of_coffees_a_week<-c(1,2,0,0,1,4,5) number_of_coffees_a_week
## [1] 1 2 0 0 1 4 5
length(number_of_coffees_a_week) # how many elements
## [1] 7
days<-c("Mon","Tues","Wed","Thurs","Friday", "Sat", "Sun") days
## [1] "Mon" "Tues" "Wed" "Thurs" "Friday" "Sat" "Sun"
days[5] # print element number 5 -- Friday
## [1] "Friday"
days[c(1,2,3)] # print elements 1,2,3
## [1] "Mon" "Tues" "Wed"
days[5]<-"Fri" # replace Friday with Fri days
## [1] "Mon" "Tues" "Wed" "Thurs" "Fri" "Sat" "Sun"
days[c(6,7)] <- rep("Party time",2) # write Sat and Sun as Party time days
## [1] "Mon" "Tues" "Wed" "Thurs" "Fri" ## [6] "Party time" "Party time"
Use # to annotate your code.
NULL object of zero length, test with is.null(x)
NA Not Available / missing value, test with is.na(x)
NaN Not a number, test with is.nan(x) (e.g. 0/0, log(-1))
Inf, -Inf Positive/negative infinity, test with is.infinite(x) (e.g. 1/0)
is.numeric(thomas_age)
## [1] TRUE
is.numeric(days)
## [1] FALSE
is.atomic(thomas_age)
## [1] TRUE
is.character(days)[1]
## [1] TRUE
is.na(thomas_age)
## [1] FALSE
is.na(days)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Combining vectors is easy, use c(vector1,vector2)
Combining column vectors into one matrix goes as follows.
cbind() column bind
rbind() row bind
coffee_data<-cbind(number_of_coffees_a_week,days) coffee_data # this is what the matrix looks like.
## number_of_coffees_a_week days ## [1,] "1" "Mon" ## [2,] "2" "Tues" ## [3,] "0" "Wed" ## [4,] "0" "Thurs" ## [5,] "1" "Fri" ## [6,] "4" "Party time" ## [7,] "5" "Party time"
coffee_data<-as.data.frame(coffee_data) # make it a dataframe. is.data.frame(coffee_data)
## [1] TRUE
Together with your partner:
# nr: nrow / nc; ncol matrix(data=5, nr=2, nc=2)
## [,1] [,2] ## [1,] 5 5 ## [2,] 5 5
matrix(1:8, 2, 4)
## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8
as.data.frame(matrix(1:8,2,4))
## V1 V2 V3 V4 ## 1 1 3 5 7 ## 2 2 4 6 8
Your files are typically living in the directory where the .rmd lives.
Normally you would do this at the start of your session.
This is where you would read and write data.
setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture1") # the tilde just abbreviates the bits before # mostly you would use setwd("C:/Documents/Rstudio/assignment1") # for windows. Dont use \ # Linux: setwd("/usr/thomas/mydir")
One of the most versatile formats is .csv
comma separated value file (readable in MS Excel)
write.csv(coffee_data, file= 'coffee_data.csv') ### no row names. write.csv(coffee_data, file= 'coffee_data.csv', row.names=FALSE) ### ??write.csv to find out more
SPSS (install ‘haven’ first!) , note the different notation!
require(haven) write_sav(coffee_data, 'coffee_data.sav')
Write away your datafiles as .csv and .sav
Open your datafiles with Excel and SPSS.
Find out more about write_spss function.
If it is in the same folder. I have reloaded the ‘haven’ package.
require(haven) coffee_data_the_return<-read_sav('coffee_data.sav') ### use the same notation as with setwd to get the path
Even from (public) weblinks. Here in .dat format. head() shows you the first lines.
require(data.table) mydat <- fread('https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2019/Middle_School/sadc_ms_2019_state_a_m.dat') head(mydat)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ## 1: HI Hawaii (HI) State 2 1997 2 21.4499 2 1 117476 4 1 . 4 22 ## 2: HI Hawaii (HI) State 2 1997 2 20.9430 11 2 117477 4 2 . 4 62 ## 3: HI Hawaii (HI) State 2 1997 2 12.9590 4 1 117478 3 1 . 4 61 ## 4: HI Hawaii (HI) State 2 1997 2 12.9590 4 1 117479 4 1 . 4 63 ## 5: HI Hawaii (HI) State 2 1997 2 26.1130 9 2 117480 4 1 . 4 22 ## 6: HI Hawaii (HI) State 2 1997 2 18.7855 1 1 117481 4 2 . 4 52 ## V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 ## 1: 2 2 2 1 331 1 . 1 . . . . . 2 . 2 . . . ## 2: 2 2 2 1 144 1 . 1 . . . . . 2 . 2 . . . ## 3: 5 2 2 1 133 . . 2 . . . . . 2 . 2 . . . ## 4: 4 2 2 1 144 1 . 2 . . . . . 2 . 2 . . . ## 5: 4 2 2 1 151 1 . 2 . . . . . 2 . 2 . . . ## 6: 2 1 2 1 134 1 . 1 . . . . . 1 . 2 . . . ## V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 ## 1: . . . . . . . . . . . . . . . . . . 2 ## 2: . . . . . . . . . . . . . . . . . . 1 ## 3: . . . . . . . . . . . . . . . . . . 2 ## 4: . . . . . . . . . . . . . . . . . . 1 ## 5: . . . . . . . . . . . . . . . . . . 1 ## 6: . . . . . . . . . . . . . . . . . . 2 ## V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 ## 1: 1 . . . . . . . . . . . . . . . . . . ## 2: 2 . . . . . . . . . . . . . . . . . . ## 3: 2 . . . . . . . . . . . . . . . . . . ## 4: 2 . . . . . . . . . . . . . . . . . . ## 5: 1 . . . . . . . . . . . . . . . . . . ## 6: 2 . . . . . . . . . . . . . . . . . . ## V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 ## 1: . . . . . . 1 . . . . . . . . . . . . ## 2: . . . . . . 1 . . . . . . . . . . . . ## 3: . . . . . . 1 . . . . . . . . . . . . ## 4: . . . . . . 1 . . . . . . . . . . . . ## 5: . . . . . . 1 . . . . . . . . . . . . ## 6: . . . . . . 1 . . . . . . . . . . . . ## V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 ## 1: . 1 . . . . . . . . . . . . . . ## 2: . 1 . . . . . . . . . . . . . . ## 3: . 1 . . . . . . . . . . . . . . ## 4: . 1 . . . . . . . . . . . . . . ## 5: . 1 . . . . . . . . . . . . . . ## 6: . 1 . . . . . . . . . . . . . . ## V109 V110 V111 V112 V113 ## 1: . . . . . ## 2: . . . . . ## 3: . . . . . ## 4: . . . . . ## 5: . . . . . ## 6: . . . . .
This follows Whickham & Grolemund (2017).
library (instead of require - require tries to load library). I’ll switch.
library(nycflights13)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ lubridate 1.9.3 ✔ tibble 3.2.1 ## ✔ purrr 1.0.2 ✔ tidyr 1.3.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::between() masks data.table::between() ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::first() masks data.table::first() ## ✖ lubridate::hour() masks data.table::hour() ## ✖ lubridate::isoweek() masks data.table::isoweek() ## ✖ dplyr::lag() masks stats::lag() ## ✖ dplyr::last() masks data.table::last() ## ✖ lubridate::mday() masks data.table::mday() ## ✖ lubridate::minute() masks data.table::minute() ## ✖ lubridate::month() masks data.table::month() ## ✖ lubridate::quarter() masks data.table::quarter() ## ✖ lubridate::second() masks data.table::second() ## ✖ purrr::transpose() masks data.table::transpose() ## ✖ lubridate::wday() masks data.table::wday() ## ✖ lubridate::week() masks data.table::week() ## ✖ lubridate::yday() masks data.table::yday() ## ✖ lubridate::year() masks data.table::year() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Take careful note of the conflicts message printed loading the tidyverse.
It tells you that dplyr conflicts with some functions.
Some of these are from base R.
If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag()
This data frame contains all 336,776 flights (!) that departed from New York City in 2013.
From the US Bureau of Transportation Statistics, and is documented in ?flights.
nycflights13::flights
## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Lets make it available to our environment. flights<-(nycflights13::flights)
Tibbles are data frames. But with some tweaks to make life a little easier.
You can turn a dataframe into a tibble with as_tibble()
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).
View()
View(flights)
Pick observations by their values: filter().
Reorder the rows: arrange().
Pick variables by their names select().
Create new variables with functions of existing variables mutate().
Collapse many values down to a single summary summarise().
Let’s filter out some missings for departure delay (dep_delay)
Here we make a new dataset
# notice '!' for 'not'. flights_no_miss<-filter(flights, dep_delay!='NA')
& is “and”, | is “or”, and ! is “not”
When filtering you’ll need: the standard suite: >, >=, <, <=, != (not equal), and == (equal).
Common mistake: = instead of ==
floating point numbers are a problem. Computers cannot store infinite numbers of digits.
sqrt(3) ^ 2 == 3
## [1] FALSE
1/98 * 98 == 1
## [1] FALSE
near(sqrt(3) ^ 2, 3)
## [1] TRUE
near(1/98*98, 1)
## [1] TRUE
Let’s look at the delays with departure (dep_delay).
Note the dollar sign ($) for selecting the column
mean(flights_no_miss$dep_delay)
## [1] 12.63907
median(flights_no_miss$dep_delay)
## [1] -2
Standard deviation and Standard error (of the mean).
sd(flights_no_miss$dep_delay)
## [1] 40.21006
var(flights_no_miss$dep_delay)
## [1] 1616.849
se<-sd(flights_no_miss$dep_delay)/sqrt(length(flights_no_miss$dep_delay)) se # standard error
## [1] 0.07015412
# 95 CI UL<- (mean(flights_no_miss$dep_delay) + 1.96*se) LL<- (mean(flights_no_miss$dep_delay) - 1.96*se) UL
## [1] 12.77657
LL
## [1] 12.50157
minimum, first quartile (Q1), median, third quartile (Q3), maximum.
fivenum(flights_no_miss$dep_delay)
## [1] -43 -5 -2 11 1301
summary(flights_no_miss$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -43.00 -5.00 -2.00 12.64 11.00 1301.00
IQR: Q3 - Q1. Another measure of variation.
IQR(flights_no_miss$dep_delay)
## [1] 16
boxplot(flights_no_miss$dep_delay)
Mode= most common value.
Trickier. ??mlv to find out more
library(modeest) mlv(flights_no_miss$dep_delay, method='mfv')
## [1] -5
Sometimes you need to install a package which is under development. (First install ‘devtools’)
devtools::install_github("ropenscilabs/skimr")
## Skipping install of 'skimr' from a github remote, the SHA1 (d5126aa0) has not changed since last install. ## Use `force = TRUE` to force installation
require(skimr)
## Loading required package: skimr
require(skimr) skim(flights)
Name | flights |
Number of rows | 336776 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 14 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
carrier | 0 | 1.00 | 2 | 2 | 0 | 16 | 0 |
tailnum | 2512 | 0.99 | 5 | 6 | 0 | 4043 | 0 |
origin | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
dest | 0 | 1.00 | 3 | 3 | 0 | 105 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2013.00 | 0.00 | 2013 | 2013 | 2013 | 2013 | 2013 | ▁▁▇▁▁ |
month | 0 | 1.00 | 6.55 | 3.41 | 1 | 4 | 7 | 10 | 12 | ▇▆▆▆▇ |
day | 0 | 1.00 | 15.71 | 8.77 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
dep_time | 8255 | 0.98 | 1349.11 | 488.28 | 1 | 907 | 1401 | 1744 | 2400 | ▁▇▆▇▃ |
sched_dep_time | 0 | 1.00 | 1344.25 | 467.34 | 106 | 906 | 1359 | 1729 | 2359 | ▁▇▇▇▃ |
dep_delay | 8255 | 0.98 | 12.64 | 40.21 | -43 | -5 | -2 | 11 | 1301 | ▇▁▁▁▁ |
arr_time | 8713 | 0.97 | 1502.05 | 533.26 | 1 | 1104 | 1535 | 1940 | 2400 | ▁▃▇▇▇ |
sched_arr_time | 0 | 1.00 | 1536.38 | 497.46 | 1 | 1124 | 1556 | 1945 | 2359 | ▁▃▇▇▇ |
arr_delay | 9430 | 0.97 | 6.90 | 44.63 | -86 | -17 | -5 | 14 | 1272 | ▇▁▁▁▁ |
flight | 0 | 1.00 | 1971.92 | 1632.47 | 1 | 553 | 1496 | 3465 | 8500 | ▇▃▃▁▁ |
air_time | 9430 | 0.97 | 150.69 | 93.69 | 20 | 82 | 129 | 192 | 695 | ▇▂▂▁▁ |
distance | 0 | 1.00 | 1039.91 | 733.23 | 17 | 502 | 872 | 1389 | 4983 | ▇▃▂▁▁ |
hour | 0 | 1.00 | 13.18 | 4.66 | 1 | 9 | 13 | 17 | 23 | ▁▇▇▇▅ |
minute | 0 | 1.00 | 26.23 | 19.30 | 0 | 8 | 29 | 44 | 59 | ▇▃▆▃▅ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
time_hour | 0 | 1 | 2013-01-01 05:00:00 | 2013-12-31 23:00:00 | 2013-07-03 10:00:00 | 6936 |
Rcmdr
Try and install Rcmdr.
Toy with Rcmdr.
Load the flights dataset.
Calculate the mean delay in arrival for Delta Airlines (DL) (use filter())
Calculate the associated 95% confidence interval.
Do the same for United Airlines (UA) and compare the two. Do their confidence intervals overlap?
Calculate the mode for the delay in arrival from JFK airport (origin).
save a dataset as .sav with only departing flights from JFK airport. –> submit via elearning portal.
An alternative to RStudio is Jupyter notebooks. This would be especially handy if you need to combine Python with R.
My experience is that they are great but don’t play well with multiple versions of R and/or networked PC’s.
If you are up for a challenge. No refunds and at your own risk :).
A lot to soak in but the best way to learn is via doing!