Exercise 2: instructions.

Install and load the ‘carData’ package. load the professorial ‘Salaries’ dataset.
This dataset has 397 cases and 6 variables. ??Salaries. Have a look (perhaps use ??skimr)
Calculate the monthly salary (note that these are 9 month salaries)
Make a ‘beautiful’ scatter plot with yrs.since.phd on X-axis and monthly salary and save it.
Colour the scatter plot by gender.
Make a paneled ‘beautiful’ histogram with the distribution of salaries by rank and save it.
Make a ‘beautiful’ violin plot with gender on the X-axis and monthly salary on Y-axis.
Test whether men earn more than women do. Use a parametric test and bootstrap it. Also use a non-parametric test. Report it as you would in an academic paper. Reflect and discuss: Why is this a good/bad testing approach?
Bonus: calculate the mode of monthly income for each gender. calculate the (parametric) 95%CI for the mean of monthly income for each gender.

Load salaries dataset.

setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture2")
require(car)

## Loading required package: car

## Loading required package: carData

salaries<-carData::Salaries
require(skimr)

## Loading required package: skimr

skim(salaries)

Data summary
Name	salaries
Number of rows	397
Number of columns	6
_______________________
Column type frequency:
factor	3
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
rank	1	FALSE	3	Pro: 266, Ass: 67, Ass: 64
discipline	1	FALSE	2	B: 216, A: 181
sex	1	FALSE	2	Mal: 358, Fem: 39

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
yrs.since.phd	1	22.31	12.89	1	12	21	32	56	▇▇▆▅▁
yrs.service	1	17.61	13.01	0	7	16	27	60	▇▅▃▂▁
salary	1	113706.46	30289.04	57800	91000	107300	134185	231545	▅▇▅▂▁

# Make 9 month salary
require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

salaries<- salaries %>% mutate(monthly_salary=salary/9)

‘Beautiful’ scatter plot.

Perhaps not the most beautiful but demonstrating the High Charts theme. Other themes: ??ggthemes. Toyed around with some settings

require(ggplot2)

## Loading required package: ggplot2

require(ggthemes)

## Loading required package: ggthemes

phd_plot<- ggplot(data = salaries, aes(yrs.since.phd, monthly_salary)) + geom_point(size=2.5, shape=16, alpha=0.5) + theme_hc(12)  + labs(x="Years since Ph.D.", y="Monthly Salary ($)", title="Salary Post-Ph.D.") 
phd_plot

ggsave("scatterplot_phd.pdf", width= 5, height=5, units="in")

Colour scatter plot by gender.

Altered the colour. Note that you could still toy with shapes and note that I removed the transparency. I toyed with the X-axis. You might also want to adjust your Y-axis.

# rename sex to gender.
colnames(salaries)[5] <-"gender"
# make plot (if you are clever, you can recycle the previous plot)
require(scales)

## Loading required package: scales

gender_phd_plot<- ggplot(data = salaries, aes(yrs.since.phd, monthly_salary, colour= gender)) + scale_colour_brewer(palette = "Accent") + geom_point(size=2.5, shape=16) + theme_hc(12)  + labs(x="Years since Ph.D.", y="Monthly Salary ($)", title="Salary Post-Ph.D.") + scale_x_continuous(limits=c(-2,60),breaks = pretty_breaks(4))
gender_phd_plot

ggsave("scatterplot_gender_phd.pdf", width= 5, height=5, units="in")

Panelled histogram.

You can do it as in class but the following should save us some time! I used dplyr to make 3 separate plots.

require(dplyr)
require(ggplot2)
require(scales)
require(cowplot)

## Loading required package: cowplot

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggthemes':
## 
##     theme_map

# note that I used coord_cartesian here to set the limits. You might need to peak a couple of times to get a sensible range.
plots_3 <- salaries %>% group_by(rank) %>% do(plots=ggplot(data=.) +
         aes(x=monthly_salary) + geom_histogram() +  ylab("Count") + xlab("Monthly Salary (US$)") + ggtitle(unique(.$rank)) + theme_gdocs(11) + coord_cartesian(xlim=c(5000,30000),ylim=c(0,30)) + scale_x_continuous(breaks = pretty_breaks(6)) + scale_y_continuous(breaks = pretty_breaks(4)))
plots_3$plots[1]

## [[1]]

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plots_3$plots[2]

## [[1]]

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plots_3$plots[3]

## [[1]]

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# plot them together, we need plotlist()
plot_grid(plotlist=c(plots_3$plots[1],plots_3$plots[2],plots_3$plots[3]), align="h")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggsave("histograms_salary.pdf", width= 7, height=7, units="in")

Note that graph might not be optimal as we have a lot of white space! So have a look at the example below using density and histogram.

require(dplyr)
require(ggplot2)
require(scales)
require(cowplot)
plot_density<-ggplot(salaries, aes(monthly_salary, fill=rank))+ geom_density(alpha=.4) + scale_fill_brewer(palette = "Pastel2") + geom_histogram(aes(y=..density..), alpha=0.5,position="identity") + ylab("Density") + xlab("Monthly Salary (US$)") + theme_fivethirtyeight(11)
plot_density

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggsave("density_salary.pdf", width= 7, height=7, units="in")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It shows the differences quite neatly perhaps most striking is the very wide range of professorial salaries.

Violin plot.

Could be even more beautiful but below is fairly clean, not?

violinplot_gender<-ggplot(salaries, aes(x = gender, y = monthly_salary)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Monthly Salary (US$)") + xlab("Gender") + theme_economist_white(12)
violinplot_gender

ggsave("violinplot_gender.pdf",width= 5, height=5, units="in")

Note some of the outliers in the male data and the ‘long tail’.

Tests.

t-test

A t-test, first.

t.test(monthly_salary~gender, data=salaries)

## 
##  Welch Two Sample t-test
## 
## data:  monthly_salary by gender
## t = -3.1615, df = 50.122, p-value = 0.002664
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -2559.7684  -570.9002
## sample estimates:
## mean in group Female   mean in group Male 
##             11222.49             12787.82

require(dplyr)
gender_group_sd<- salaries %>% group_by(gender) %>% summarise(SD_monthly=sd(monthly_salary))
gender_group_sd

An independent-samples t-test was used to compare the (mean) monthly salaries between men and women. There was a significant difference with men (M = $12788, SD = 3381.88) earning on average more than women do (M = $11222, SD= 2883.57), t(50.12)= 3.16, p=.003.

But: bear in mind the outliers, and also some other things (e.g., taking into account rank/yrs since Ph.D.?)

Bootstrap it.

### Bootstrap 95% CI for t-test
require(boot)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:car':
## 
##     logit

### function to obtain p.values from t.test from the data 
p_value <- function(data, indices) {
  data_boot <- data[indices,] # allows boot to select sample 
  T_test_boot <- t.test(monthly_salary~gender, data=data_boot)
  return(T_test_boot[[3]])  ### [[]] denotes the position in output - 3rd element
} 
### Bootstrapping is truly random.
### We need a 'seed' to make it replicable.
set.seed(1981) # best year ever... .
### R is number of samples we want to draw?
### we store it in results.
results <- boot(data=salaries, statistic=p_value,R=10000)
plot(results)

boot.ci(results, type="bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.0000,  0.3548 )  
## Calculations and Intervals on Original Scale

boot.ci(results, conf=.90, type="bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, conf = 0.9, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 90%   ( 0.0000,  0.1949 )  
## Calculations and Intervals on Original Scale

The 95% bootstrapped confidence interval for p ranges from <.0001 to .3548. (But note if you had picked a smaller CI, 90%CI, it would be a narrower range, duh :) - still quite a broad range)

Median test

wilcox.test(monthly_salary~gender, data=salaries)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  monthly_salary by gender
## W = 5182.5, p-value = 0.008237
## alternative hypothesis: true location shift is not equal to 0

gender_group_med<- salaries %>% group_by(gender) %>% summarise(median_monthly=median(monthly_salary))
gender_group_med

A Wilcoxon-Mann-Whitney U test showed that men (Median=12005) earned significantly more than women did (Median=11528) W=5182.5, p= .008.

But note that the medians are actually not significantly different, if you use a Mood’s median test! (They are testing different things, medians vs. ranks.). So depending on how you look at the data, you might draw different conclusions here! Again, bear in mind that one would try and do multiple robustness checks to find out if any conclusion holds up. For example, by controlling for potential third variables. Have a think about what those could be.

require(RVAideMemoire)

## Loading required package: RVAideMemoire

## *** Package RVAideMemoire v 0.9-83-7 ***

mood.medtest(monthly_salary~gender, data=salaries)

## 
##  Mood's median test
## 
## data:  monthly_salary by gender
## X-squared = 1.7754, df = 1, p-value = 0.1827

BONUS: Mode and 95%CI

Mode

library(modeest)
require(dplyr)
male_data<-filter(salaries, gender=="Male")
mlv(male_data$monthly_salary,  method='mfv')

## [1]  8222.222 10222.222

female_data<-filter(salaries, gender=='Female')
mlv(female_data$monthly_salary,  method='mfv')

## [1] 8055.556 8611.111

The modal monthly income for men was $9222 for women it was $8333. (Note that these are the averages of the two numbers reported).

95%CI

There alternative ways of doing this but I did it via dplyr. Look back to the slides of week 1.

require(dplyr)
gender_group_95CI<- salaries %>% group_by(gender) %>% summarise(mean_monthly=mean(monthly_salary), SD_monthly=sd(monthly_salary), se_monthly=SD_monthly/sqrt(length(monthly_salary)), LL=(mean_monthly - 1.96*se_monthly), UL= (mean_monthly + 1.96*se_monthly))
gender_group_95CI

The (parametric) 95% CI for mean monthly income of men is [$12438, $13138], in contrast for women this is [$10317, $12128]. These do not overlap.

SessonInfo

The end.

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] modeest_2.4.0          RVAideMemoire_0.9-83-7 boot_1.3-31           
##  [4] cowplot_1.1.3          scales_1.3.0           ggthemes_5.1.0        
##  [7] ggplot2_3.5.1          dplyr_1.1.4            skimr_2.1.5           
## [10] car_3.1-3              carData_3.0-5         
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        xfun_0.49           bslib_0.8.0        
##  [4] vctrs_0.6.5         tools_4.4.2         generics_0.1.3     
##  [7] tibble_3.2.1        fansi_1.0.6         cluster_2.1.6      
## [10] pkgconfig_2.0.3     RColorBrewer_1.1-3  lifecycle_1.0.4    
## [13] compiler_4.4.2      farver_2.1.2        stringr_1.5.1      
## [16] textshaping_0.4.0   munsell_0.5.1       repr_1.1.7         
## [19] clue_0.3-65         htmltools_0.5.8.1   sass_0.4.9         
## [22] rmutil_1.1.10       yaml_2.3.10         Formula_1.2-5      
## [25] pillar_1.9.0        crayon_1.5.3        jquerylib_0.1.4    
## [28] tidyr_1.3.1         cachem_1.1.0        rpart_4.1.23       
## [31] abind_1.4-8         fBasics_4041.97     tidyselect_1.2.1   
## [34] digest_0.6.37       stringi_1.8.4       purrr_1.0.2        
## [37] labeling_0.4.3      fastmap_1.2.0       grid_4.4.2         
## [40] colorspace_2.1-1    cli_3.6.3           magrittr_2.0.3     
## [43] base64enc_0.1-3     utf8_1.2.4          stable_1.1.6       
## [46] withr_3.0.2         rmarkdown_2.28      timeDate_4041.110  
## [49] ragg_1.3.3          timeSeries_4041.111 statip_0.2.3       
## [52] evaluate_1.0.1      knitr_1.49          stabledist_0.7-2   
## [55] rlang_1.1.4         spatial_7.3-17      glue_1.8.0         
## [58] rstudioapi_0.17.1   jsonlite_1.8.9      R6_2.5.1           
## [61] systemfonts_1.1.0

Solution Lecture 2

Prof. Thomas Pollet, Northumbria University, UK (thomas.pollet@northumbria.ac.uk)

2025-01-06 | disclaimer