Lecture 1: PY 0794 - Advanced Quantitative Research Methods

## PY0794: Advanced Quantitative research methods ![Logo](https://www.rstudio.com/wp-content/uploads/2015/06/Rlogonew.png) ## Goals (today) - Markdown - R - Notebooks (jupyter or R notebooks) OR - RStudio - Basic statistics. ## Goals (course) {.build} KU3 - Formulate balanced judgements with regard to complex, incomplete, ambiguous or sensitive data. KU 4 - Contribute to the creation of new knowledge and practical applications within the discipline through a critical understanding of the processes through which knowledge is created IPSA 2 - Use a variety of techniques, advanced research methods and technological skills applicable to psychological enquiry PVA 1 - Apply relevant ethical, legal and professional practice frameworks (e.g., BPS), and maintain appropriate professional boundaries. ```{r, out.width = "90px", echo=FALSE} knitr::include_graphics("goal.jpeg") ``` ## Housekeeping {.build} ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://imageserve.babycenter.com/4/000/366/LDQDRyAQRh8cmTkhyi7PbAPh3JwqubPo_lg.jpg") ``` Bring your laptop if you want to. **ONLY** class relevant stuff. Course manual. Reading list. Attendance. / Be punctual. / Be engaged. Exercise after each lecture. Keep up! Appointment via: thomas.pollet@northumbria.ac.uk ## Assignments {.build} ### The bit you care about most: marks. {.build} 30% each (remaining 40% Qual. components) Deadlines: see Turnitin briefs (1pm) / MRes. Handbook. Graded via rubrics Empty .rmd shell which you will turn into a .pdf Screenshot + .pdf + .rmd ## Assignments (bis) Complete the exercises in class (+ any bonus), you can find them under 'study skills'. Questions via elearning environment, _but_ only if you attempted the corresponding exercise. ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://s-media-cache-ak0.pinimg.com/originals/25/d4/7c/25d47c93e8e3f3a8b15657162e00c069.jpg") ``` ## Why oh, Why R? {.build} Most of you are familiar with using Excel and SPSS? Why change to R? There are quite a few reasons: * It will make it easier to do repetitive tasks (e.g., formatting tables). * R is maintained by statisticians and computer scientists ("Experts"). * Some techniques are not available in standalone commercial packages (e.g., SEM in SPSS). * **Free** and open source! (not dependent on often expensive software licenses). RStudio desktop also open source. Unlike SPSS / SAS / Statistica, R will always be free. ## Even more benefits {.build} * Forces you to think about what you are doing. If something doesn't work the way you like, you can change it. * Active help community. [Stack overflow](http://stackoverflow.com/) * Many companies rely on R (e.g., Facebook, Google, Shell, Thomas Cook,... ). Some newspapers rely on R to make their graphics. The [BBC](https://bbc.github.io/rcookbook/) makes their figures/infographics with it! ## Downsides * R can be slower than 'real' programming languages (still it'll beat SPSS comfortably at tasks such as bootstrapping). * There is no GUI. ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://lovestats.files.wordpress.com/2012/07/technologically-impaired-duck-spss.jpg?w=620") ``` ## Markdown and RMarkdown ### Before we move to R. Let's have a look at Markdown, which is a very basic language. ### Familiarize yourself: [Work your way through this](http://www.markdowntutorial.com/): www.markdowntutorial.com In a duo, work through this tutorial. ### Cheat sheet to be framed above your bed. [Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) ## Tutorial Done. Woohoo! {.build} ### Some things which are not (fully) covered in the tutorial. Mathematical symbols. sub/super-script: $x^2$ Greek symbols. For example alpha --> $\alpha$ ; beta --> $\beta$ ; etc. --> Don't worry too much about those for now. You can read up later [here](https://rstudio-pubs-static.s3.amazonaws.com/18858_0c289c260a574ea08c0f10b944abc883.html) and [here](http://www.stat.cmu.edu/~cshalizi/rmarkdown/) ```{r, out.width = "200px", echo=FALSE} knitr::include_graphics("https://2.bp.blogspot.com/-SzKpclTsOaw/U0UptxvUthI/AAAAAAAAjaY/tpOSAhiU19A/s1600/science-major-mouse-meme-generator-four-years-of-physics-knows-entire-greek-alphabet-upper-and-lower-case-2e411e.jpg") ``` ## RStudio. * Open RStudio * Should look something like this: ```{r, out.width = "700px", echo=FALSE} knitr::include_graphics("Rstudio.png") ``` ## RStudio - New file. {.build} Click File New --> R markdown. --> Document---> Html. (Many other options incl. presentations) This will be the core in which you will complete your work. RMarkdown can be rendered in .html / .word / .pdf ## RMarkdown Press the knit button! ```{r, out.width = "700px", echo=FALSE} knitr::include_graphics("samplermarkdown.png") ``` ## HTML {.build} Congrats. You generated a webpage! The bit between the ticks are R code. The text in between is Markdown. Occasionally .html or . latex code interspersed. You can make .pdf , which you'll learn later, but .html is suitable for most purposes including your assignment. If you want to make PDFs you'll need a latex distribution. On Windows, you need Miktex, installed here in the lab. On OSX, MacTeX. On Linux, TexLive. More info [here](https://www.latex-project.org/get/). ## From .html to .pdf {.build} RStudio can also make .PDFs more later about that. For now: Simply open your webpage save as .pdf --> via print command. ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://media.giphy.com/media/QbumCX9HFFDQA/giphy-downsized-large.gif") ``` ## First coding ever. ```{r, out.width = "200px", echo=FALSE} knitr::include_graphics("https://i.kym-cdn.com/entries/icons/original/000/021/807/ig9OoyenpxqdCQyABmOQBZDI0duHk2QZZmWg2Hxd4ro.jpg") ``` Delete what's between the ticks. Enter: - Sys.Date() and Click "Run Current Chunk" Should give you: ```{r} Sys.Date() ``` ## Sys.time() - Sys.time() and Click "Run Current Chunk" Should give you: ```{r} Sys.time() ``` ## Install packages. R is not really a programme but rather works based on packages. Some basic operations can be done in base R, but mostly we will need packages. First we install some packages. This can be done via the install.packages command. In RStudio you also have a button to click. **Thomas shows Rstudio button** Try installing the 'ggplot2' package via the button. ## Loading a package. * packages: and then tick ggplot2 * Or: ```{r} library(ggplot2) #loading ggplot2 ``` '#' to write comments in your code ## Workspace {.build} Most of these you might not need as you have RStudio! In RStudio, the loaded objects are listed in the "Environment"-tab in the window in the top right corner. **ls()** list objects in workspace **rm(…)** remove objects from workspace **rm(list = ls())** remove all objects from workspace **save.image()** saves workspace **load(".rdata")** loads saved workspace **history()** view command history **loadhistory()** load command history **savehistory()** save command history ## R as a calculator. {.build} Use ; if you want several operations. ```{r} 2+3; 5*7; 3-5 ``` ## Remember log / exp. {.build} ```{r} # The log function gives logs to the base e (e = 2.718282) log(2.71828182845905) exp(1) # the antilog function is exp log10(10) ``` ## Rounding {.build} Rounding to next integer is straightforward. floor() and ceiling() ```{r} floor(6.9); ceiling (6.9); floor(-6.9); ceiling(-6.9) ``` ## Trunc() {.build} Stripping the decimal part is also straightforward. trunc() ```{r} trunc(9.75); trunc(-9.75) ``` ## More mathematical functions. Further mathematical functions are shown below (Crawley, 2013:17). ```{r, out.width = "700px", echo=FALSE} knitr::include_graphics("Figure1.png") # from Crawley 2013 ``` ## Notation of (larger) numbers.{.build} 1.3e3 means 1300 because the e3 means ‘move the decimal point 3 places to the right’; 1.5e-2 means 0.015 because the e-2 indicates ‘move the decimal point 2 places to the left’ (For those of who you who had advanced maths: Complex numbers: 3.6+4.2i is a complex number with real (3.6) and imaginary (4.2) parts, and i is $\sqrt[]{-1}$.) ## Let's make a variable {.build} We often want to store things on which we'll do the calculations. ```{r} thomas_age<-35 ``` **IMPORTANT** Variable names in R are case sensitive, so Thomas is not the same as thomas. Variable names should not begin with numbers (e.g. 2x) or symbols (e.g. %x or $x). Variable names should not contain blank spaces (use body_weight or body.weight not body weight). **Make a variable, for one of the people in your duo** ## Terminology {.build} Mostly for your reference: 1. Object modes 2. Object classes ```{r, out.width = "200px", echo=FALSE} knitr::include_graphics("https://adventures.hartleybrody.com/assets/terms-acronyms-definition.jpg") ``` ## Object modes (atomic structures) {.build} **integer** whole numbers (15, 23, 8, 42, 4, 16) **numeric** real numbers (double precision: 3.14, 0.0002, 6.022E23) **character** text string (“Hello World”, “ROFLMAO”, “DR Pollet”) **logical** TRUE/FALSE or T/F ## Object classes {.build} **vector** object with atomic mode **factor** vector object with discrete groups (ordered/unordered) **matrix** 2-dimensional array **array** like matrices but multiple dimensions **list** vector of components **data.frame** "matrix –like" list of variables of same # of rows --> **This is the one you care most about.** Many of the errors you potentially run into have to do with objects being the wrong class. (For example, R is expecting a data.frame, but you are offering it a matrix). ## Assignment, or how to label a vector (or variable).{.build} **<-** assign, this is to assign a variable. At your own risk you can also use = . [Why?](http://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html) **c(...)** combine / concatenate seq(x) generate a sequence. **[]** denotes the position of an element. ## Examples. {.build .smaller} ```{r} # Now let's do some very simple examples. seq(1:5) # print a sequence thomas_height<-188.5 thomas_height # prints the value. # number of coffee breaks in a week number_of_coffees_a_week<-c(1,2,0,0,1,4,5) number_of_coffees_a_week length(number_of_coffees_a_week) # how many elements ``` ## Days of the week {.build} ```{r} days<-c("Mon","Tues","Wed","Thurs","Friday", "Sat", "Sun") days days[5] # print element number 5 -- Friday days[c(1,2,3)] # print elements 1,2,3 ``` ## Replacing things. {.build} ```{r} days[5]<-"Fri" # replace Friday with Fri days days[c(6,7)] <- rep("Party time",2) # write Sat and Sun as Party time days ``` ## Try it yourself (in duos) {.build} Use # to annotate your code. 1. Make an atomic vector with your height. If you don't know your metric height: 'guess'. 2. Make a vector for the months of the year. 3. Print the 6th and 9th month 4. Replace the July/August with vacation in your vector. ## Special Values {.build} **NULL** object of zero length, test with is.null(x) **NA** Not Available / missing value, test with is.na(x) **NaN** Not a number, test with is.nan(x) (e.g. 0/0, log(-1)) **Inf, -Inf** Positive/negative infinity, test with is.infinite(x) (e.g. 1/0) ## Is.numeric / etc. {.build} ```{r} is.numeric(thomas_age) is.numeric(days) is.atomic(thomas_age) is.character(days)[1] ``` ## Checking for missings: is.na() {.build} ```{r} is.na(thomas_age) is.na(days) ``` ## Combining vectors into a matrix. {.build} Combining vectors is easy, use c(vector1,vector2) Combining column vectors into one matrix goes as follows. **cbind()** column bind **rbind()** row bind ## Example with coffee data {.build} ```{r} coffee_data<-cbind(number_of_coffees_a_week,days) coffee_data # this is what the matrix looks like. coffee_data<-as.data.frame(coffee_data) # make it a dataframe. is.data.frame(coffee_data) ``` ## Try it yourself. {.build} Together with your partner: 1. combine the two vectors with your heights. (Remember the order!) (or make a new one!) 2. make a vector with your ages (in the same order as 1.) 3. make a dataframe called 'team' using cbind 4. check that it is a dataframe. ## Making a matrix from scratch. {.build} ```{r} # nr: nrow / nc; ncol matrix(data=5, nr=2, nc=2) matrix(1:8, 2, 4) as.data.frame(matrix(1:8,2,4)) ``` ## Setting a work directory. {.build} Your files are typically living in the directory where the .rmd lives. Normally you would do this at the start of your session. This is where you would read and write data. ```{r} setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture1") # the tilde just abbreviates the bits before # mostly you would use setwd("C:/Documents/Rstudio/assignment1") # for windows. Dont use \ # Linux: setwd("/usr/thomas/mydir") ``` ## Writing away data. {.build} One of the most versatile formats is .csv comma separated value file (readable in MS Excel) ```{r} write.csv(coffee_data, file= 'coffee_data.csv') ### no row names. write.csv(coffee_data, file= 'coffee_data.csv', row.names=FALSE) ### ??write.csv to find out more ``` SPSS (install 'haven' first!) , note the **different** notation! ```{r warning=F, message=F} require(haven) write_sav(coffee_data, 'coffee_data.sav') ``` ## Try it yourself {.build} Write away your datafiles as .csv and .sav Open your datafiles with Excel and SPSS. Find out more about write_spss function. ## Read in data. {.build} If it is in the same folder. I have reloaded the 'haven' package. ```{r} require(haven) coffee_data_the_return<-read_sav('coffee_data.sav') ### use the same notation as with setwd to get the path ``` Even from (public) weblinks. Here in .dat format. head() shows you the first lines. ```{r message=F, warning=F} require(data.table) mydat <- fread('https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2019/Middle_School/sadc_ms_2019_state_a_m.dat') head(mydat) ``` ## Some basic data analyses / manipulations. {.build} This follows Whickham & Grolemund (2017). [library](https://yihui.name/en/2014/07/library-vs-require/) (instead of require - require tries to load library). I'll switch. ```{r warning=F, message=F} library(nycflights13) ``` ```{r} library(tidyverse) ``` ## Conflicts.{.build} Take careful note of the conflicts message printed loading the tidyverse. It tells you that dplyr conflicts with some functions. Some of these are from base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag() ```{r, out.width = "200px", echo=FALSE} knitr::include_graphics("http://s2.quickmeme.com/img/83/83f3188fb441ef8a727dac6715f50478910342242eb4f6e974675a3f75c044d8.jpg") ``` ## NYC Flights {.build} This data frame contains all 336,776 flights (**!**) that departed from New York City in 2013. From the US Bureau of Transportation Statistics, and is documented in ?flights. ```{r} nycflights13::flights # Lets make it available to our environment. flights<-(nycflights13::flights) ``` ## Tibbles. {.build} Tibbles are data frames. But with some tweaks to make life a little easier. You can turn a dataframe into a tibble with as_tibble() ## Notice anything in particular? {.build} **int** stands for integers. **dbl** stands for doubles, or real numbers. **chr** stands for character vectors, or strings. **dttm** stands for date-times (a date + a time). ## But I want to see everything. View() ```{r} View(flights) ``` ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://qph.cf2.quoracdn.net/main-qimg-fc65a968e2e26ed910c21d77aee71846-pjlq") ``` ## 'dplyr' basics. {.build} Pick observations by their values: **filter()**. Reorder the rows: **arrange()**. Pick variables by their names **select()**. Create new variables with functions of existing variables **mutate()**. Collapse many values down to a single summary **summarise()**. ## Data cleaning... {.build} Let's filter out some missings for departure delay (dep_delay) Here we make a _new_ dataset ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://i.pinimg.com/736x/30/66/ac/3066ac69ae68ac200ace0ca8fe3882c3--friday-memes-broken-city.jpg") ``` ## filter() {.build} ```{r} # notice '!' for 'not'. flights_no_miss<-filter(flights, dep_delay!='NA') ``` ```{r, out.width = "300px", echo=FALSE} knitr::include_graphics("https://i.imgflip.com/13ui7g.jpg") ``` ## Logical operations. {.build} & is “and”, | is “or”, and ! is “not” ```{r, out.width = "500px", echo=FALSE} knitr::include_graphics("transform-logical.png") # from Whickham & Grolemund 2017 ``` ## = vs. == When filtering you'll need: the standard suite: >, >=, <, <=, != (not equal), and == (equal). Common mistake: = instead of == ## Floating point numbers {.build} floating point numbers are a problem. Computers cannot store infinite numbers of digits. ```{r} sqrt(3) ^ 2 == 3 1/98 * 98 == 1 ``` ```{r, out.width = "175px", echo=FALSE} knitr::include_graphics("floating_cat.jpg") ``` ## Solution: near() ```{r} near(sqrt(3) ^ 2, 3) near(1/98*98, 1) ``` ## Basic statistics. {.build} Let's look at the delays with departure (dep_delay). Note the dollar sign ($) for selecting the column ```{r} mean(flights_no_miss$dep_delay) median(flights_no_miss$dep_delay) ``` ## Measures of variation {.build} Standard deviation and Standard error (of the mean). ```{r} sd(flights_no_miss$dep_delay) var(flights_no_miss$dep_delay) se<-sd(flights_no_miss$dep_delay)/sqrt(length(flights_no_miss$dep_delay)) se # standard error ``` ## 95% Confidence interval. ```{r} # 95 CI UL<- (mean(flights_no_miss$dep_delay) + 1.96*se) LL<- (mean(flights_no_miss$dep_delay) - 1.96*se) UL LL ``` ```{r, out.width = "150px", echo=FALSE} knitr::include_graphics("CI.jpg") ``` ## Five number summary. minimum, first quartile (Q1), median, third quartile (Q3), maximum. ```{r} fivenum(flights_no_miss$dep_delay) summary(flights_no_miss$dep_delay) ``` ## Interquartile range IQR: Q3 - Q1. Another measure of variation. ```{r} IQR(flights_no_miss$dep_delay) ``` ## Boxplot ```{r} boxplot(flights_no_miss$dep_delay) ``` ## Mode. ('modeest') {.build} Mode= most common value. Trickier. ??mlv to find out more ```{r} library(modeest) mlv(flights_no_miss$dep_delay, method='mfv') ``` [Bickel's modal skewness](https://cran.r-project.org/web/packages/modeest/modeest.pdf) ## Bonus: 'skimr' Sometimes you need to install a package which is under development. (First install 'devtools') ```{r} devtools::install_github("ropenscilabs/skimr") require(skimr) ``` ## Try it yourself: skim ```{r} require(skimr) skim(flights) ``` ## I miss SPSS... . Rcmdr Try and install Rcmdr. Toy with Rcmdr. ## Exercise. {.build} Load the flights dataset. Calculate the mean delay in arrival for Delta Airlines (DL) (use **filter()**) Calculate the associated 95% confidence interval. Do the same for United Airlines (UA) and compare the two. Do their confidence intervals overlap? Calculate the mode for the delay in arrival from JFK airport (origin). save a dataset as .sav with only departing flights from JFK airport. --> submit via elearning portal. ## A note on Jupyter notebooks. An alternative to RStudio is Jupyter notebooks. This would be especially handy if you need to combine Python with R. My experience is that they are great but don't play well with multiple versions of R and/or networked PC's. If you are up for a challenge. No refunds and at your own risk :). ## References. * Crawley, M. J. (2013). _The R book: Second edition._ New York, NY: John Wiley & Sons. * Paradis, E. (2002). _[R for Beginners](http://lib.stat.cmu.edu/R/CRAN/doc/contrib/Paradis-rdebuts_en.pdf)_. Montpellier (F): University of Montpellier. * Pruim, R., Kaplan, D., & Horton, N. (2012). [mosaic: project MOSAIC statistics and mathematics teaching utilities](https://cran.r-project.org/doc/contrib/Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf). R package version 0.6-2. * Wickham, H., & Grolemund, G. (2017). _[R for data science](http://r4ds.had.co.nz/)_. Sebastopol, CA: O’Reilly. ## For next week. * Complete the exercises. * I strongly recommend you re-read these slides. * Work through some of the references! * Toy around. Have fun! Look at Rcmdr * Look at your assignment! You can already complete parts, after this first lecture! ## Further Resources A lot to soak in but the best way to learn is via doing! * [Free book](https://cran.r-project.org/doc/contrib/Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf) * [Free course](https://www.datacamp.com/courses/free-introduction-to-r) * [Free tutorial](http://www.statmethods.net/r-tutorial/index.html) * [Other sources](http://www.rdatamining.com/resources/courses)