```{r setup, include=FALSE}
library(xaringanExtra)
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(echo = TRUE)
```
```{r xaringan-tile-view, echo=FALSE}
xaringanExtra::use_tile_view()
```
## PY0794: Advanced Quantitative research methods.
* Last week: Introduction.
* Today all about visualization.
---
## Goals (today)
Data manipulation.
--
Data visualization.
--
Basic tests (_t_-tests, and a non-parametric alternative.)
---
## Assignment
After today you should be able to complete the following sections:
--
Descriptive statistics and data wrangling.
--
Histogram
--
Violin plot
--
_t_-test and non-parametric alternative
---
## Flights dataset
Again this is based on [Whickham & Grolemund (2017)](http://r4ds.had.co.nz/) and on work by [Gert Stulp](http://www.gertstulp.com/)
```{r}
setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture2")
library(nycflights13)
flights<-nycflights13::flights
```
---
## 'dplyr'
**Last week:** Pick observations by their values: **filter()**.
--
Reorder the rows: **arrange()**. (read about this [here](http://r4ds.had.co.nz/transform.html#arrange-rows-with-arrange))
--
_Pick variables by their names **select()**._
--
_Create new variables with functions of existing variables **mutate()**._
--
_Collapse many values down to a single summary **summarise()**._
---
## Load dplyr
Load dplyr:
```{r warning = F}
library(dplyr)
```
---
## select()
Sometimes you have dozens, even >100 or >1000 variables.
You can use select() to make a reduced data frame.
--
```{r}
flights_red<-select(flights, year, month, day)
### This overrides our earlier attempt
flights_red<-select(flights, year:day)
### all columns excluding those from year:day (inclusive)
flights_red<-select(flights, -(year:day))
```
---
## Try it yourself.
Team up with a partner.
Load the flights data.
Make a new dataframe with all variables starting from dep_time to arr_delay (inclusive).
---
## helper functions.
For more complex cases:
--
starts_with("abc"): matches names that begin with “abc”.
--
ends_with("xyz"): matches names that end with “xyz”.
--
contains("ijk"): matches names that contain “ijk”.
--
matches("(.)\\1"): selects variables that match a "regular expression". This one matches any variables that contain repeated characters. ??regex
--
num_range("x", 1:3): matches x1, x2 and x3.
---
## rename()
* select() can be used to rename variables, **but** it drops all of the variables not explicitly mentioned.
* Instead, use rename(), which is a variant of select() which keeps all the variables that are not explicitly mentioned.
* However, I found this to be [buggy](https://stackoverflow.com/questions/34275576/avoiding-error-when-using-rename-in-dplyr-and-column-doesnt-exist) so back to base R but feel free to use rename().
---
## names
```{r}
# Note the square brackets for indexing the columns.
names(flights_red)[1:3]=c("Year", "Month", "Day")
head(flights_red)
```
---
## mutate()
%>% : the pipe. Basically means apply function to.
--
mutate() allows you to make new variables
--
```{r, results='hide'}
flights <- flights %>% mutate(speed = distance / (air_time * 60), log_speed=log(speed))
```
--
transmute() if you want **only** the things you have calculated.
---
## helper functions() for mutate()
Again a whole load of helper functions exist. You can read more about them [here](http://r4ds.had.co.nz/transform.html#add-new-variables-with-mutate)
the key thing to remember is that you can apply all the functions we covered last week.
---
## Centering and standardizing.
**centering**: subtracting the mean. (Often used to make scores more interpretable and to avoid collinearity, which is a big [deal](https://link.springer.com/article/10.1007/s00265-010-1045-6)). More about this when we discuss OLS regression.
--
**standardizing**: Z-scores. We use the scale() function.
--
```{r, results='hide'}
flights<- flights %>% mutate(speed_cent = speed - mean(speed, na.rm = TRUE))
flights<- flights %>% mutate(speed_stand = scale(speed))
```
* now about that 'na.rm = TRUE' --> this removes any missings.
--
* Reminder about Z-scores.
```{r, out.width = "350px", echo=FALSE}
knitr::include_graphics("http://deafstudeerconsultant.nl/wp-content/uploads/2015/01/z-score-normale-verdeling.jpg")
```
---
## summarise()
Get one row with data for the dataframe.
```{r}
summarise(flights, mean_speed = mean(speed, na.rm = TRUE))
```
---
## group_by
more useful if combined with group_by , let's look at delays by [carrier](http://aspmhelp.faa.gov/index.php/ASQP:_Carrier_Codes_and_Names)
```{r}
by_carrier <- group_by(flights, carrier)
summarise(by_carrier, delay = mean(dep_delay, na.rm = TRUE))
```
---
## Try it yourself.
* calculate 'gain' which is arr_delay - dep_delay.
* calculate 'airtime_hours' which is airtime (in mins.) divided by the number of minutes in an hour.
* now calculate 'gain per hour'. (gain/airtime_hours)
* standardize 'gain per hour'.
* use group_by to calculate the means of gain per hour by carrier.
---
## Replacing missings.
Often we want to replace missings. For example, we want to replace the missings with the means or median.
```{r}
flights <- flights %>% mutate(dep_delay_no_miss= ifelse(is.na(dep_delay),
mean(dep_delay, na.rm = T), dep_delay))
mean(flights$dep_delay_no_miss)
```
--
Here we use an 'ifelse' function to replace the missings.
True can be abbreviated to 'T'.
---
## So far, so good.
Now that we know the basics of manipulating dataframes let's move to graphs.
Why is plotting your data important?
--
[Anscombe's quartet (1973)](http://www.jstor.org/stable/pdf/2682899.pdf?refreqid=excelsior%3A19d1f2fcfb1dacfb0ec692faa1512a37). _r_= .816
--
```{r, out.width = "350px", echo=FALSE}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/1280px-Anscombe%27s_quartet_3.svg.png")
```
Often: testing important assumptions.
---
## Datasaurus.
* An extension of Anscombe's quartet. Read more [here](https://www.autodeskresearch.com/publications/samestats)
```{r, fig.align="center", out.width = "775px", echo=FALSE}
knitr::include_graphics("AllDinosAnimatedSmaller.gif") # from https://www.autodeskresearch.com/publications/samestats
```
---
## Now let's make some graphs.
ggplot2
some 'helper' packages, 'scales' helps with axes, 'Rcolorbrewer' use, 'ggthemes' allows standardized templates, 'cow plot' allows panelling graphs.
```{r warning=F}
library(ggplot2)
library(scales) ### no longer needed in newer versions.
library(RColorBrewer)
library(ggthemes)
```
---
## Scatter plot
* Our data file is truly massive, so for this exercise we are just going to select all flights from JFK airport on christmas day.
```{r}
flights_jfk_new<-filter(flights, month == 12, day == 25, origin == 'JFK')
```
--
* Scatter plot
```{r, fig.align="center",fig.height=2.1, fig.width=2.3, fig.retina = 3}
ggplot(flights_jfk_new, aes(x=dep_delay, y=arr_delay)) + geom_point()
```
---
## What is that warning about?
* 2 missing cases.
* Let us replace those, in fact we already did that for dep_delay, but that was based on the **whole** set. For this exercise, we'll replace it again (so now it is replaced with the mean on X-mas day)
```{r, tidy.opts=list(width.cutoff=55)}
flights_jfk_new <- flights_jfk_new %>%
mutate(dep_delay_no_miss= ifelse(is.na(dep_delay),
mean(dep_delay, na.rm = T), dep_delay))
flights_jfk_new <- flights_jfk_new %>%
mutate(arr_delay_no_miss= ifelse(is.na(arr_delay),
mean(arr_delay, na.rm = T), arr_delay))
```
---
## Plot again
```{r, fig.align='center' ,fig.height=4, fig.width=4, fig.retina = 3, tidy.opts=list(width.cutoff=5)}
graph_delays<-ggplot(flights_jfk_new, aes(x=dep_delay_no_miss, y=arr_delay_no_miss))
graph_delays + geom_point()
```
---
## Colour by airline
Recap: ggplot(your_data, aes(x=x_variable, y=y_variable)) + geom_point()
Now let's colour it by carrier.
```{r, fig.align="center",fig.height=3.75, fig.width=3.75, fig.retina = 3, tidy.opts=list(width.cutoff=5)}
graph_colour <- ggplot(flights_jfk_new, aes(x=dep_delay_no_miss, y=arr_delay_no_miss,
colour=carrier))
graph_colour + geom_point()
```
---
## Flexibility.
* The beauty of ggplot is that using a different “geom”-function will create a different graph:
```{r, fig.align="center",fig.height=3.5, fig.width=3.5, fig.retina = 3}
graph_delays + geom_count()
```
---
## Let's try another one.
```{r, warning=F, message=F, fig.align="center",fig.height=4.5, fig.width=6, fig.retina = 3}
graph_delays + geom_count() + geom_smooth()
```
---
## Try it yourself.
* Make a similar set as we did but now: flights leaving LaGuardia (LGA) airport on X-mas day.
* Make a plot with departure delay on the X-axis and speed on the Y-axis.
* Colour the points by carrier.
```{r, out.width = "300px", echo=FALSE}
knitr::include_graphics("http://images.complex.com/complex/image/upload/c_limit,w_680/fl_lossy,pg_1,q_auto/tsb8pkwfrsx8cde2aqkz.jpg")
```
---
## You can tweak everything!
Example with Delta and United airlines.
```{r, fig.height=3, fig.width=6, tidy.opts=list(width.cutoff=35), fig.retina = 3}
flights_jfk_new_2_airlines<-filter(flights_jfk_new, carrier== "DL" | carrier== "UA" )
graph_beauty<-ggplot(flights_jfk_new_2_airlines,
aes(x=dep_delay_no_miss, y=arr_delay_no_miss,
colour=carrier)) +
geom_point(size=3, shape=13, alpha=0.5) + ### size/shape of points + transparency
geom_smooth(method="lm") + ### Add linear regression lines
scale_colour_brewer(palette = "Set1") + # Change the colours
labs(x="Delay departure", y="Delay arrival",
title="United vs. Delta - Showdown") +
# Change labels on axes and add title
theme_bw() # Remove the gray background
```
---
## Look at the pretty, pretty!
```{r, warning=F, message=F, out.width = "475px", fig.retina = 3}
graph_beauty
```
---
## Many types of graphs.
| # | geom | graph |
|----|:----------|:-----------------|
| 1 | [geom_density()](#anchor1) | Density plot (distribution) |
| 2 | [geom_histogram()](#anchor2) | Histogram (distribution) |
| 3 | [geom_violin()](#anchor3) | Violin pot (multiple distributions) |
| 4 | [geom_boxplot()](#anchor4) | Boxplot (multiple distributions) |
| 5 | [geom_bar()](#anchor5) | Bar chart (categorical + continuous/categorical) |
| 6 | [geom_point()](#anchor6) | Scatter (continuous + continuous) |
| 7 | [geom_count()](#anchor7) | Bubble plot (continuous + continuous/categorical) |
| 8 | [geom_jitter()](#anchor8) | Scatter (continuous + continuous/categorical) |
---
## Many types of graphs, continued
| # | geom | graph |
|----|:----------|:-----------------|
| 9 | [geom_smooth()](#anchor9) | Prediction line (continuous + continuous) |
| 10 | [geom_line()](#anchor10) | Line (continuous + continuous/categorical) |
| 11 | [geom_errorbar()](#anchor11) | Errorbar (continuous + continuous/categorical) |
--
We will not cover all of these today.
--
Need any help / reference? All the codes can be found [here](http://ggplot2.tidyverse.org/reference/)
---
## Histogram.
Let's make a panelled histogram comparing the delays in departure of two major airlines Jetblue (code= B6) and Delta airlines on X-mas day.
--
To help ourselves we'll make to separate sets. We'll also save them.
--
```{r}
### Note that I now went for new, short names.
### Ideally you want your naming to be consistent.
jfk_delta<-filter(flights_jfk_new, carrier== "DL")
jfk_jetblue<-filter(flights_jfk_new, carrier== "B6")
write.csv(jfk_delta,row.names = F, file= 'jfk_delta.csv')
write.csv(jfk_jetblue,row.names = F, file= 'jfk_jetblue.csv')
```
---
## Basics.
```{r, fig.height=5, fig.width=5, fig.pos='center', fig.retina = 3}
plot_dl<-ggplot(jfk_delta, aes(dep_delay))
plot_jetblue<-ggplot(jfk_jetblue, aes(dep_delay))
### Plot Delta.
plot_dl+geom_histogram(colour = "black", fill = "white")
```
---
## Binwidth?
* Defaults used. Binwidth is [important](http://www.stat.cmu.edu/~rnugent/PCMI2016/papers/WandBinWidth.pdf). Not too large/small.
```{r, fig.height=1.8, fig.width=2, fig.retina = 3}
plot_dl+geom_histogram(binwidth=1)
plot_dl+geom_histogram(binwidth=40)
```
---
## Settle on 5?
```{r, fig.height=1.8, fig.width=3, tidy.opts=list(width.cutoff=35), fig.retina = 3}
plot_dl<-plot_dl+geom_histogram(binwidth=5, colour = "black", fill = "white")
plot_jetblue<-plot_jetblue + geom_histogram(binwidth=5, colour = "black", fill = "white")
plot_dl
plot_jetblue
```
---
## Change labels.
```{r, fig.height=1.8, fig.width=2.2, fig.retina = 3}
plot_dl+ylab("Count") + xlab("Departure Delay - Delta")
plot_jetblue + ylab("Count") + xlab("Departure Delay - Jet Blue")
```
---
## Change the ticks / axes limits.
```{r, warning=F, message=F, fig.height=1.9, fig.width=3, tidy.opts=list(width.cutoff=35), fig.retina = 3}
plot_dl + scale_y_continuous(limits=c(0,50),breaks = pretty_breaks(6))
plot_dl + scale_x_continuous(limits=c(-20,300),breaks = pretty_breaks(6))
```
---
## Intermediate formatting lost.
* In the end we will need ***all*** the commands in one go! Or we need to store intermediate graphs and update those!
```{r, out.width = "500px", echo=FALSE, fig.pos='center', }
knitr::include_graphics("https://media.giphy.com/media/gKsJUddjnpPG0/giphy.gif")
```
---
## Themes.
* ggthemes. ??ggthemes for all options.
* google doc theme. Base size = fonts.
```{r, warning=F, message=F, fig.height=3.6, fig.width=5, fig.pos="center", fig.retina = 3}
plot_dl+theme_gdocs(base_size = 14)
```
---
## All in one!
```{r, warning=F, message=F, tidy=TRUE, fig.height=4, fig.width=4, tidy.opts=list(width.cutoff=12), fig.retina = 3}
plot_dl<-plot_dl+ geom_histogram(binwidth=5, colour = "black", fill = "white") + ylab("Count") + xlab("Departure Delay - Delta") + scale_y_continuous(limits=c(0,50),breaks = pretty_breaks(6)) + scale_x_continuous(limits=c(-15,250),breaks = pretty_breaks(6)) + theme_gdocs(base_size = 14)
plot_dl
```
---
## Try it yourself.
* Make a plot for jetblue, similar to the one for Delta Airlines
--
* Solution
```{r, tidy=TRUE, fig.height=3, fig.width=4, tidy.opts=list(width.cutoff=35), warning=F, message=F, fig.retina = 3}
plot_jetblue<-plot_jetblue+ geom_histogram(binwidth=5, colour = "black", fill = "white") + ylab("Count") + xlab("Departure Delay - Jet Blue")+ scale_y_continuous(limits=c(0,50),breaks = pretty_breaks(6)) + scale_x_continuous(limits=c(-15,250),breaks = pretty_breaks(6)) + theme_gdocs(base_size = 14)
plot_jetblue
```
---
## Saving it. ggsave??
```{r, tidy=TRUE, tidy.opts=list(width.cutoff=55), warning=F,message=F}
ggsave("plot_jetblue.pdf", width= 250, height=250, units="mm")
```
--
default is to save last plot displayed but you can explicitly name it (plot = )
--
You can change the output size to different units ("in")
--
dpi for resolution.
--
different formats (e.g., .png, .tiff)
--
It will be saved in your working directory. (change that with path = ).
---
## Combing multiple charts. plot_grid()
```{r, echo=F, message=F, warning=F, tidy.opts=list(width.cutoff=55), fig.retina = 3}
plot_jetblue<-plot_jetblue+ geom_histogram(binwidth=5, colour = "black", fill = "white") + ylab("Count") + xlab("Departure Delay - Jet Blue") + scale_y_continuous(limits=c(0,50),breaks = pretty_breaks(6)) + scale_x_continuous(limits=c(-15,250),breaks = pretty_breaks(6)) + theme_gdocs(base_size = 14)
plot_dl<-plot_dl+ geom_histogram(binwidth=5, colour = "black", fill = "white") + ylab("Count") + xlab("Departure Delay - Delta") + scale_y_continuous(limits=c(0,50),breaks = pretty_breaks(6)) + scale_x_continuous(limits=c(-15,250),breaks = pretty_breaks(6)) + theme_gdocs(base_size = 14)
```
```{r, tidy=TRUE, fig.height=4, fig.width=8, tidy.opts=list(width.cutoff=55), fig.retina = 3, warning=F, message=F}
library(cowplot)
plot_grid(plot_dl,plot_jetblue, labels=c('A', 'B'), label_size= 10, align="h")
ggsave("plot_panelled.pdf", width= 250, height=250, units="mm")
```
---
## Interpretation.
```{r, tidy=TRUE, fig.height=5, fig.width=8, tidy.opts=list(width.cutoff=55), warning=F, message=F, echo=F, fig.retina = 3}
library(cowplot)
plot_grid(plot_dl,plot_jetblue, labels=c('A', 'B'), label_size= 10, align="h")
ggsave("plot_panelled.pdf", width= 250, height=250, units="mm")
```
---
## What is good/bad about this chart?
* Quite clear.
--
* Only 2 groups? Overlay?
--
* Varying N. Sample size.
--
* White space.
--
* Units? (in minutes).
--
* (Suppressed some 'missing' values? - binning can obscure things - not always major issue)
---
## Violin plot (with box plot)
* Now let's approach this in a different way.
* Violin plots combined with a box plot (remember last week) give us good overview.
```{r, out.width = "500px", echo=FALSE, fig.align='center'}
knitr::include_graphics("https://media.giphy.com/media/x5JsFS1XtqCeA/giphy.gif") # From giphy
```
---
## Basic violin plot.
```{r,tidy=TRUE, fig.height=3.5, fig.width=8, tidy.opts=list(width.cutoff=50), fig.retina = 3}
### combine them into a single dataframe.
violinplot_data<-rbind.data.frame(jfk_delta,jfk_jetblue)
library(dplyr)
violinplot_data<- violinplot_data %>% mutate(carrier= recode(carrier, B6= "Jet Blue", DL="Delta Airlines"))
### make plot
violinplot<-ggplot(violinplot_data, aes(x = carrier, y = dep_delay)) + geom_violin() + geom_boxplot(width = 0.2)
violinplot
```
---
## Add some 'oomph'.
```{r,fig.height=4.5, fig.width=8,tidy=T, tidy.opts=list(width.cutoff=55), fig.retina = 3}
violinplot + xlab("Carrier")+ylab("Delay Departure (mins.)") + theme_stata(14)
```
---
## Can you describe the graph?
* What are the key points?
```{r, out.width = "400px", echo=FALSE}
knitr::include_graphics("https://media.giphy.com/media/bmcu39f2kh0WY/giphy.gif")
```
---
## Changing the order.
* Suppose we want Jet Blue on the left for the X-axis.
```{r,fig.height=2.1, fig.width=6,tidy=T, tidy.opts=list(width.cutoff=50), fig.retina = 3}
### recode.
library(haven) # needed for as_factor
violinplot_data$carrier_rec<-as_factor(violinplot_data$carrier)
ordered_levels<-c("Jet Blue", "Delta Airlines")
violinplot_data<- mutate(violinplot_data, carrier_rec = factor(carrier_rec, levels = ordered_levels))
### violinplot_2
violinplot_2<-ggplot(violinplot_data, aes(x = carrier_rec, y = dep_delay)) + geom_violin() + geom_boxplot(width = 0.2) + xlab("Carrier")+ylab("Delay Departure (mins.)") + theme_stata(14)
violinplot_2
```
---
## _t_-test: Are there more delays with Delta than with Jet Blue?
* Let's run an independent samples _t_-test
```{r}
t.test(dep_delay~carrier, data=violinplot_data)
```
---
## Let's break it down.
* dependent variable~independent variable
--
* Welch correction, this corrects for assumption of equality of variances. And should be the [default](https://www.rips-irsp.com/articles/10.5334/irsp.82/galley/42/download/), also check [this](https://academic.oup.com/beheco/article/17/4/688/215960/The-unequal-variance-t-test-is-an-underused).
---
## No SD?
* We need standard deviations for our report!
```{r}
sd(jfk_delta$dep_delay_no_miss)
sd(jfk_jetblue$dep_delay_no_miss)
```
---
## APA style write up.
You have done this before!
An independent-samples _t_-test was used to compare the delays in departure between Delta Airlines and Jet Blue in minutes. There was no significant difference between Delta airlines (_M_ = 17.30, _SD_ = 51.17) and Jet Blue (_M_ = 5.89, _SD_= 24.10) in average departure delay time, _t_(52.83)= 1.45, _p_=.15.
---
## Non-parametric tests.
* Do not make typical assumptions associated with parametric statistics (e.g., (roughly) normal distribution).
* Useful in our case, given the clear extreme values!
* Downside: information lost, often less powerful.
```{r, out.width = "350px", echo=FALSE, fig.align='center'}
knitr::include_graphics("https://media.giphy.com/media/l0MYwaJHfKsKwu4Ug/giphy.gif")
```
---
## Wilcoxon-Mann-Whitney U.
* Other alternatives exist (e.g., Mood's median test)
* Wilcoxon-Mann-Whitney U is common [alternative](http://vassarstats.net/textbook/ch11a.html) to _t_-test. (Sometimes, just called Mann-Whitney U).
```{r}
wilcox.test(dep_delay~carrier, data=violinplot_data)
```
---
## Kruskal-Wallis.
* We need as_factor. Read more [here](http://vassarstats.net/textbook/ch14a.html)
```{r}
kruskal.test(dep_delay~as_factor(carrier), data=violinplot_data)
```
---
## Medians?
* We need the medians for our report!
```{r}
median(jfk_delta$dep_delay_no_miss)
median(jfk_jetblue$dep_delay_no_miss)
```
---
## APA style write up.
* A Wilcoxon-Mann-Whitney test showed that there was a significant difference between Delta airlines (_Mdn_ = 1) and Jet Blue (_Mdn_= -1) in delays, _W_= 3375.5, _p_= .026.
* So: Delta airlines significantly later than Jet Blue.
```{r, out.width = "275px", echo=FALSE}
knitr::include_graphics("apa.png")
```
---
## Bootstrapping!
Bootstrapping is sampling ***with replacement***
It is a non-parametric technique.
Basically repeatedly drawing a sample and calculating a statistic.
Can be applied to nearly any technique.
---
## How to do it.
boot package. Find out more [here](http://people.tamu.edu/~alawing/materials/ESSM689/Btutorial.pdf).
--
We make a function with two arguments (data and indices). We will then feed our data to the function!
--
```{r,tidy=T, tidy.opts=list(width.cutoff=45)}
### Bootstrap 95% CI for t-test
library(boot)
### function to obtain p.values from t.test from the data
p_value <- function(data, indices) {
data_boot <- data[indices,] # allows boot to select sample
T_test_boot <- t.test(dep_delay~carrier, data=data_boot)
return(T_test_boot[[3]]) ### [[]] denotes the position in output - 3rd element
}
```
---
## Now let's run it.
```{r,tidy=T, tidy.opts=list(width.cutoff=55)}
### Bootstrapping is truly random.
### We need a 'seed' to make it replicable.
set.seed(2019)
### R is number of samples we want to draw?
### we store it in results.
results <- boot(data=violinplot_data, statistic=p_value,R=10000)
```
---
## Results.
```{r}
## view results
head(results)
```
---
## Plot Results.
```{r, out.width = "450px", fig.retina = 3}
plot(results)
```
---
## 95%CI for _p_.
* Various types. Read all about [here](http://www.tau.ac.il/~saharon/Boot/10.1.1.133.8405.pdf)
--
* One of the ['best'](https://faculty.washington.edu/heagerty/Courses/b572/public/GregImholte-1.pdf) methods is Bias-Corrected Accelerated.
--
```{r}
boot.ci(results, type="bca")
```
---
## What would you write in your report?
* The 95% Bias-Corrected Accelerated interval (using 10,000 bootstrap samples) for _p_ was [.006, .857].
```{r, out.width = "400px", echo=FALSE}
knitr::include_graphics("https://media.giphy.com/media/yNs2a0jRkYxy6191B2/giphy.gif")
```
---
## Test it yourself.
* Do the same as shown but now do it for delay in arrival.
---
## Exercise.
- Install and load the 'carData' package. load the professorial 'Salaries' dataset.
- This dataset has 397 cases and 6 variables. ??Salaries. Have a look (perhaps use ??skimr)
- Calculate the monthly salary (note that these are 9 month salaries)
- Make a 'beautiful' scatter plot with yrs.since.phd on X-axis and monthly salary and save it.
- Colour the scatter plot by gender.
- Make a panelled 'beautiful' histogram with the distribution of salaries by rank and save it.
---
## Exercise (cont'd).
- Make a 'beautiful' violin plot with gender on the X-axis and monthly salary on Y-axis.
- Test whether men earn more than women do. Use a parametric test and bootstrap it. Also use a non-parametric test. Report it as you would in an academic paper. Reflect and discuss: Why is this a good/bad testing approach?
- Bonus: calculate the mode of monthly income for each gender. calculate the (parametric) 95%CI for the mean of monthly income for each gender.
---
## References (and further reading.)
* Crawley, M. J. (2013). _The R book: Second edition._ New York, NY: John Wiley & Sons.
* Davison, A. C., & Hinkley, D. V. (1997). _Bootstrap methods and their application_. Cambridge, UK: Cambridge University Press.
* Pruim, R., Kaplan, D., & Horton, N. (2012). [mosaic: project MOSAIC statistics and mathematics teaching utilities](https://cran.r-project.org/doc/contrib/Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf). R package version 0.6-2.
* Siegel, S. & Castellan, N.J. Jr. (1988). _Nonparametric statistics for the behavioral sciences. 2nd Edn._ McGraw-Hill, New York.
* Stulp, G. (2017). _[ggplotgui](https://github.com/gertstulp/ggplotgui)_
* Wickham, H., & Grolemund, G. (2017). _[R for data science](http://r4ds.had.co.nz/)_. Sebastopol, CA: O’Reilly.
* Willems, K. (2015) _[How to Make a Histogram with ggplot2](https://www.datacamp.com/community/tutorials/make-histogram-ggplot2#gs.MSjGyL8)_
---
## Acknowledgments.
[Gert Stulp](http://gertstulp.github.io) provided code for the ggplot2 section of this lecture.
These slides were built with [xaringan](https://github.com/yihui/xaringan).