Install and load the ‘carData’ package. load the professorial ‘Salaries’ dataset.
This dataset has 397 cases and 6 variables. ??Salaries. Have a look (perhaps use ??skimr)
Calculate the monthly salary (note that these are 9 month salaries)
Make a ‘beautiful’ scatter plot with yrs.since.phd on X-axis and monthly salary and save it.
Colour the scatter plot by gender.
Make a paneled ‘beautiful’ histogram with the distribution of salaries by rank and save it.
Make a ‘beautiful’ violin plot with gender on the X-axis and monthly salary on Y-axis.
Test whether men earn more than women do. Use a parametric test and bootstrap it. Also use a non-parametric test. Report it as you would in an academic paper. Reflect and discuss: Why is this a good/bad testing approach?
Bonus: calculate the mode of monthly income for each gender. calculate the (parametric) 95%CI for the mean of monthly income for each gender.
setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture2")
require(car)
## Loading required package: car
## Loading required package: carData
salaries<-carData::Salaries
require(skimr)
## Loading required package: skimr
skim(salaries)
Name | salaries |
Number of rows | 397 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
rank | 0 | 1 | FALSE | 3 | Pro: 266, Ass: 67, Ass: 64 |
discipline | 0 | 1 | FALSE | 2 | B: 216, A: 181 |
sex | 0 | 1 | FALSE | 2 | Mal: 358, Fem: 39 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yrs.since.phd | 0 | 1 | 22.31 | 12.89 | 1 | 12 | 21 | 32 | 56 | ▇▇▆▅▁ |
yrs.service | 0 | 1 | 17.61 | 13.01 | 0 | 7 | 16 | 27 | 60 | ▇▅▃▂▁ |
salary | 0 | 1 | 113706.46 | 30289.04 | 57800 | 91000 | 107300 | 134185 | 231545 | ▅▇▅▂▁ |
# Make 9 month salary
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
salaries<- salaries %>% mutate(monthly_salary=salary/9)
Perhaps not the most beautiful but demonstrating the High Charts theme. Other themes: ??ggthemes. Toyed around with some settings
require(ggplot2)
## Loading required package: ggplot2
require(ggthemes)
## Loading required package: ggthemes
phd_plot<- ggplot(data = salaries, aes(yrs.since.phd, monthly_salary)) + geom_point(size=2.5, shape=16, alpha=0.5) + theme_hc(12) + labs(x="Years since Ph.D.", y="Monthly Salary ($)", title="Salary Post-Ph.D.")
phd_plot
ggsave("scatterplot_phd.pdf", width= 5, height=5, units="in")
Altered the colour. Note that you could still toy with shapes and note that I removed the transparency. I toyed with the X-axis. You might also want to adjust your Y-axis.
# rename sex to gender.
colnames(salaries)[5] <-"gender"
# make plot (if you are clever, you can recycle the previous plot)
require(scales)
## Loading required package: scales
gender_phd_plot<- ggplot(data = salaries, aes(yrs.since.phd, monthly_salary, colour= gender)) + scale_colour_brewer(palette = "Accent") + geom_point(size=2.5, shape=16) + theme_hc(12) + labs(x="Years since Ph.D.", y="Monthly Salary ($)", title="Salary Post-Ph.D.") + scale_x_continuous(limits=c(-2,60),breaks = pretty_breaks(4))
gender_phd_plot
ggsave("scatterplot_gender_phd.pdf", width= 5, height=5, units="in")
You can do it as in class but the following should save us some time! I used dplyr to make 3 separate plots.
require(dplyr)
require(ggplot2)
require(scales)
require(cowplot)
## Loading required package: cowplot
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
##
## theme_map
# note that I used coord_cartesian here to set the limits. You might need to peak a couple of times to get a sensible range.
plots_3 <- salaries %>% group_by(rank) %>% do(plots=ggplot(data=.) +
aes(x=monthly_salary) + geom_histogram() + ylab("Count") + xlab("Monthly Salary (US$)") + ggtitle(unique(.$rank)) + theme_gdocs(11) + coord_cartesian(xlim=c(5000,30000),ylim=c(0,30)) + scale_x_continuous(breaks = pretty_breaks(6)) + scale_y_continuous(breaks = pretty_breaks(4)))
plots_3$plots[1]
## [[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plots_3$plots[2]
## [[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plots_3$plots[3]
## [[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# plot them together, we need plotlist()
plot_grid(plotlist=c(plots_3$plots[1],plots_3$plots[2],plots_3$plots[3]), align="h")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave("histograms_salary.pdf", width= 7, height=7, units="in")
Note that graph might not be optimal as we have a lot of white space! So have a look at the example below using density and histogram.
require(dplyr)
require(ggplot2)
require(scales)
require(cowplot)
plot_density<-ggplot(salaries, aes(monthly_salary, fill=rank))+ geom_density(alpha=.4) + scale_fill_brewer(palette = "Pastel2") + geom_histogram(aes(y=..density..), alpha=0.5,position="identity") + ylab("Density") + xlab("Monthly Salary (US$)") + theme_fivethirtyeight(11)
plot_density
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave("density_salary.pdf", width= 7, height=7, units="in")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It shows the differences quite neatly perhaps most striking is the very wide range of professorial salaries.
Could be even more beautiful but below is fairly clean, not?
violinplot_gender<-ggplot(salaries, aes(x = gender, y = monthly_salary)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Monthly Salary (US$)") + xlab("Gender") + theme_economist_white(12)
violinplot_gender
ggsave("violinplot_gender.pdf",width= 5, height=5, units="in")
Note some of the outliers in the male data and the ‘long tail’.
A t-test, first.
t.test(monthly_salary~gender, data=salaries)
##
## Welch Two Sample t-test
##
## data: monthly_salary by gender
## t = -3.1615, df = 50.122, p-value = 0.002664
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -2559.7684 -570.9002
## sample estimates:
## mean in group Female mean in group Male
## 11222.49 12787.82
require(dplyr)
gender_group_sd<- salaries %>% group_by(gender) %>% summarise(SD_monthly=sd(monthly_salary))
gender_group_sd
## # A tibble: 2 × 2
## gender SD_monthly
## <fct> <dbl>
## 1 Female 2884.
## 2 Male 3382.
An independent-samples t-test was used to compare the (mean) monthly salaries between men and women. There was a significant difference with men (M = $12788, SD = 3381.88) earning on average more than women do (M = $11222, SD= 2883.57), t(50.12)= 3.16, p=.003.
But: bear in mind the outliers, and also some other things (e.g., taking into account rank/yrs since Ph.D.?)
### Bootstrap 95% CI for t-test
require(boot)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
### function to obtain p.values from t.test from the data
p_value <- function(data, indices) {
data_boot <- data[indices,] # allows boot to select sample
T_test_boot <- t.test(monthly_salary~gender, data=data_boot)
return(T_test_boot[[3]]) ### [[]] denotes the position in output - 3rd element
}
### Bootstrapping is truly random.
### We need a 'seed' to make it replicable.
set.seed(1981) # best year ever... .
### R is number of samples we want to draw?
### we store it in results.
results <- boot(data=salaries, statistic=p_value,R=10000)
plot(results)
boot.ci(results, type="bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 0.0000, 0.3548 )
## Calculations and Intervals on Original Scale
boot.ci(results, conf=.90, type="bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, conf = 0.9, type = "bca")
##
## Intervals :
## Level BCa
## 90% ( 0.0000, 0.1949 )
## Calculations and Intervals on Original Scale
The 95% bootstrapped confidence interval for p ranges from <.0001 to .3548. (But note if you had picked a smaller CI, 90%CI, it would be a narrower range, duh :) - still quite a broad range)
wilcox.test(monthly_salary~gender, data=salaries)
##
## Wilcoxon rank sum test with continuity correction
##
## data: monthly_salary by gender
## W = 5182.5, p-value = 0.008237
## alternative hypothesis: true location shift is not equal to 0
gender_group_med<- salaries %>% group_by(gender) %>% summarise(median_monthly=median(monthly_salary))
gender_group_med
## # A tibble: 2 × 2
## gender median_monthly
## <fct> <dbl>
## 1 Female 11528.
## 2 Male 12005.
A Wilcoxon-Mann-Whitney U test showed that men (Median=12005) earned significantly more than women did (Median=11528) W=5182.5, p= .008.
But note that the medians are actually not significantly different, if you use a Mood’s median test! (They are testing different things, medians vs. ranks.). So depending on how you look at the data, you might draw different conclusions here! Again, bear in mind that one would try and do multiple robustness checks to find out if any conclusion holds up. For example, by controlling for potential third variables. Have a think about what those could be.
require(RVAideMemoire)
## Loading required package: RVAideMemoire
## *** Package RVAideMemoire v 0.9-83-3 ***
mood.medtest(monthly_salary~gender, data=salaries)
##
## Mood's median test
##
## data: monthly_salary by gender
## X-squared = 1.7754, df = 1, p-value = 0.1827
library(modeest)
require(dplyr)
male_data<-filter(salaries, gender=="Male")
mlv(male_data$monthly_salary, method='mfv')
## [1] 8222.222 10222.222
female_data<-filter(salaries, gender=='Female')
mlv(female_data$monthly_salary, method='mfv')
## [1] 8055.556 8611.111
The modal monthly income for men was $9222 for women it was $8333. (Note that these are the averages of the two numbers reported).
There alternative ways of doing this but I did it via dplyr. Look back to the slides of week 1.
require(dplyr)
gender_group_95CI<- salaries %>% group_by(gender) %>% summarise(mean_monthly=mean(monthly_salary), SD_monthly=sd(monthly_salary), se_monthly=SD_monthly/sqrt(length(monthly_salary)), LL=(mean_monthly - 1.96*se_monthly), UL= (mean_monthly + 1.96*se_monthly))
gender_group_95CI
## # A tibble: 2 × 6
## gender mean_monthly SD_monthly se_monthly LL UL
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female 11222. 2884. 462. 10317. 12128.
## 2 Male 12788. 3382. 179. 12437. 13138.
The (parametric) 95% CI for mean monthly income of men is [$12438, $13138], in contrast for women this is [$10317, $12128]. These do not overlap.
The end.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] modeest_2.4.0 RVAideMemoire_0.9-83-3 boot_1.3-28.1
## [4] cowplot_1.1.1 scales_1.2.1 ggthemes_4.2.4
## [7] ggplot2_3.4.4 dplyr_1.1.3 skimr_2.1.5
## [10] car_3.1-2 carData_3.0-5
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.4 generics_0.1.3
## [4] tidyr_1.3.0 stringi_1.7.12 statip_0.2.3
## [7] digest_0.6.33 magrittr_2.0.3 evaluate_0.23
## [10] grid_4.3.2 RColorBrewer_1.1-3 fastmap_1.1.1
## [13] jsonlite_1.8.7 purrr_1.0.2 fansi_1.0.5
## [16] stabledist_0.7-1 textshaping_0.3.7 jquerylib_0.1.4
## [19] timeSeries_4031.107 abind_1.4-5 cli_3.6.1
## [22] crayon_1.5.2 rlang_1.1.1 munsell_0.5.0
## [25] base64enc_0.1-3 withr_2.5.2 repr_1.1.6
## [28] cachem_1.0.8 yaml_2.3.7 rmutil_1.1.10
## [31] tools_4.3.2 stable_1.1.6 colorspace_2.1-0
## [34] rpart_4.1.21 vctrs_0.6.4 R6_2.5.1
## [37] lifecycle_1.0.3 stringr_1.5.0 clue_0.3-65
## [40] cluster_2.1.4 ragg_1.2.6 pkgconfig_2.0.3
## [43] pillar_1.9.0 bslib_0.5.1 gtable_0.3.4
## [46] glue_1.6.2 systemfonts_1.0.5 xfun_0.41
## [49] tibble_3.2.1 tidyselect_1.2.0 highr_0.10
## [52] rstudioapi_0.15.0 knitr_1.45 spatial_7.3-17
## [55] farver_2.1.1 fBasics_4032.96 htmltools_0.5.7
## [58] rmarkdown_2.25 labeling_0.4.3 timeDate_4022.108
## [61] compiler_4.3.2