Exercise 3: instructions.

Using last week’s Salaries dataset, test the assumptions for an ANOVA with monthly salary as dependent variable and academic rank as independent variable.

Conduct an ANOVA with the appropriate post-hoc tests. What do you conclude?

Conduct an alternative non-parametric test to the ANOVA. What do you conclude?

Conduct an ANCOVA with gender as factor, and years since Ph.D. as covariate. What do you conclude?

Load the iris dataset, it is part of the datasets package.

It is a famous dataset with measurements of 3 species of iris flowers.

Test the assumptions for a MANOVA with Species as the between-subject factor and petal length and sepal length as dependent variables. (I previously did this for length)

Run the MANOVA. What do you conclude? Write up the result as you would do for a paper?

BONUS: Plot one of the results from your analyses on the salaries database.

BONUS: Conduct a follow up analysis or plot for the MANOVA.

Load the salaries database

setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture3")
require(carData)
## Loading required package: carData
salaries<-carData::Salaries
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
salaries<- salaries %>% mutate(monthly_salary=salary/9)

Check assumptions

Normality

We perform a visual check here. See the slides for (inferior?) alternatives

require(ggplot2)
## Loading required package: ggplot2
require(ggthemes)
## Loading required package: ggthemes
plot_hist <- ggplot(salaries, aes(x=monthly_salary)) 
plot_hist <- plot_hist+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~rank) + theme_gdocs(12) + xlab("Monthly salary (in USD)") + ylab("Count")
plot_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

More or less OK, perhaps some deviations from normality with the asst. prof group. There is perhaps one extreme case in the prof. group who earns more than 25k USD a month!

Homogeneity of variance

Many options exist but went with Levene’s test.

require(car)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(monthly_salary ~ rank, data=salaries)

The assumption for homogeneity of variances is violated (Levene’s test: F(2,394) = 38.71, p< .0001).

ANOVA

require(ez)
## Loading required package: ez
salaries$ID<- c(1:(NROW(salaries))) # This makes an ID variable
EZ_ANOVA1<-ezANOVA(salaries, dv=monthly_salary, wid=ID, between=rank, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Warning: Data is unbalanced (unequal N per group). Make sure you specified a
## well-considered value for the type argument to ezANOVA().
## Coefficient covariances computed by hccm()
EZ_ANOVA1
## $ANOVA
##   Effect DFn DFd        SSn        SSd        F            p p<.05       ges
## 1   rank   2 394 1768293404 2716899714 128.2174 1.293048e-43     * 0.3942513
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd       SSn       SSd        F            p p<.05
## 1   2 394 194134938 987948286 38.71112 4.476856e-16     *

A one-way ANOVA showed a significant effect of academic rank on monthly salary, F(2, 394) = 128.22, p < .0001, \(\eta^2_g\) = .39.

Alternative (non-parametric) test.

Many different options (see lecture slides). I went with a medians test.

require(WRS2)
## Loading required package: WRS2
# WRS, ANOVA for Medians. (note iter=)
med1way(monthly_salary ~ rank, data=salaries, iter=10000)
## Call:
## med1way(formula = monthly_salary ~ rank, data = salaries, iter = 10000)
## 
## Test statistic F: 96.3811 
## Critical value: 2.8435 
## p-value: 0
# analysis of Medians leads to same conclusion.
require(dplyr)
grouped<-group_by(salaries, rank)
grouped %>% summarise(median= median(monthly_salary))

A median test showed that the medians differed significantly by academic rank (Test= 96.38, p<.0001). The medians showed that Assistant professors ($8867) earned least, followed by Associate Professors ($10625), and full Professors earned most ($13,702).

ANCOVA

(Note: assumptions not tested here)

require(car)
require(compute.es)
## Loading required package: compute.es
Ancova<-lm(monthly_salary~sex+yrs.since.phd, contrasts=list(sex=contr.sum), data=salaries)
Anova(Ancova, type="III")

A one-way ANCOVA examined the effect of gender on monthly salary whilst controlling for the number of years since Ph.D. There was no statistically significant effect of gender, F(1,394)=2.86, p=.092. However, the effect of the covariate, years since Ph.D., was significant, F(1,394)=78.23, p<.0001.

MANOVA

Load the iris dataset / histograms.

I previously have made an error in the instructions and had previously run everything with width rather than length

require(datasets) # not super-necessary as datasets is part of base r
require(skimr)
## Loading required package: skimr
iris<-datasets::iris
skim(iris)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃
# summarytools also prints nice overview.
#??summarytools
require(summarytools)
## Loading required package: summarytools
dfSummary(iris)

histograms.

Visual inspection of normality. This time with the ‘few theme’.

require(ggplot2)
require(ggthemes)
plot_hist_sepal <- ggplot(iris, aes(x=Sepal.Width)) 
plot_hist_sepal <- plot_hist_sepal+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Sepal Width") + ylab("Count")
plot_hist_sepal
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot_hist_petal <- ggplot(iris, aes(x=Petal.Width)) 
plot_hist_petal <- plot_hist_petal+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Petal Width") + ylab("Count")
plot_hist_petal
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This seems generally fine for sepal width but perhaps some issues for petal width. As a follow-up one could consider bootstrapping the results.

Here I do the test for length rather than width.

require(ggplot2)
require(ggthemes)
plot_hist_sepal_l <- ggplot(iris, aes(x=Sepal.Length)) 
plot_hist_sepal_l <- plot_hist_sepal_l+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Sepal Length") + ylab("Count")
plot_hist_sepal_l
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot_hist_petal_l <- ggplot(iris, aes(x=Petal.Length)) 
plot_hist_petal_l <- plot_hist_petal_l+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Petal Length") + ylab("Count")
plot_hist_petal_l
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This seems generally fine for sepal width but perhaps some issues for petal width (especially in setosa ?). As a follow-up one could again consider bootstrapping the results.

Multivariate normality

mardiaTest is deprecated as a function(). So now updated with ‘mvn’

require(dplyr)
multivariatenorm<- iris %>% select(c(Sepal.Width,Petal.Width))
require(MVN)
## Loading required package: MVN
mvn(multivariatenorm)
## $multivariateNormality
##            Test       HZ      p value MVN
## 1 Henze-Zirkler 5.429793 1.119371e-11  NO
## 
## $univariateNormality
##               Test    Variable Statistic   p value Normality
## 1 Anderson-Darling Sepal.Width    0.9080  0.0202      NO    
## 2 Anderson-Darling Petal.Width    5.1057  <0.001      NO    
## 
## $Descriptives
##               n     Mean   Std.Dev Median Min Max 25th 75th       Skew
## Sepal.Width 150 3.057333 0.4358663    3.0 2.0 4.4  2.8  3.3  0.3126147
## Petal.Width 150 1.199333 0.7622377    1.3 0.1 2.5  0.3  1.8 -0.1009166
##               Kurtosis
## Sepal.Width  0.1387047
## Petal.Width -1.3581792

The assumption of multivariate normality is violated.

Let us look at length. (Now with a Q-Q plot, also note other options for making figures).

require(dplyr)
multivariatenorm2<- iris %>% select(c(Sepal.Length,Petal.Length))
require(MVN)
mvn(multivariatenorm2, multivariatePlot = "qq")

## $multivariateNormality
##            Test       HZ      p value MVN
## 1 Henze-Zirkler 5.450384 1.035771e-11  NO
## 
## $univariateNormality
##               Test     Variable Statistic   p value Normality
## 1 Anderson-Darling Sepal.Length    0.8892  0.0225      NO    
## 2 Anderson-Darling Petal.Length    7.6785  <0.001      NO    
## 
## $Descriptives
##                n     Mean   Std.Dev Median Min Max 25th 75th       Skew
## Sepal.Length 150 5.843333 0.8280661   5.80 4.3 7.9  5.1  6.4  0.3086407
## Petal.Length 150 3.758000 1.7652982   4.35 1.0 6.9  1.6  5.1 -0.2694109
##                Kurtosis
## Sepal.Length -0.6058125
## Petal.Length -1.4168574

Homogeneity of variance

require(car)
leveneTest(Petal.Width~Species, data= iris)
leveneTest(Sepal.Width~Species, data= iris)

Levene’s tests showed that the assumption of homogeneity of variances is not violated for sepal width, F(2,147) = 0.59, p= .56) but it is violated for petal width (F(2,147) = 19.89, p< .00001).

Now let’s do it for length.

require(car)
leveneTest(Petal.Length~Species, data= iris)
leveneTest(Sepal.Length~Species, data= iris)

Levene’s tests showed that the assumption of homogeneity of variances is violated for both sepal length, F(2,147) = 6.35, p= .002) and for petal length (F(2,147) = 19.48, p< .00001).

Box M test

require(heplots)
## Loading required package: heplots
## Loading required package: broom
boxM(multivariatenorm, iris$Species)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  multivariatenorm
## Chi-Sq (approx.) = 51.794, df = 6, p-value = 2.051e-09

The Box M test is significant (p<.00001). (Note that it is very easily significant, and you should be aware of this. Likely many researchers would still proceed.).

Let’s look at length.

require(heplots)
boxM(multivariatenorm2, iris$Species)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  multivariatenorm2
## Chi-Sq (approx.) = 58.376, df = 6, p-value = 9.617e-11

The Box M test is significant (p<.00001). (Note that it is very easily significant, and you should be aware of this. Likely many researchers would still proceed.)

MANOVA

Manovamodel <- manova(cbind(Sepal.Width,Petal.Width) ~ Species, data = iris)
summary(Manovamodel)
##            Df Pillai approx F num Df den Df    Pr(>F)    
## Species     2 1.1438   98.181      4    294 < 2.2e-16 ***
## Residuals 147                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A MANOVA was conducted with two dependent variables (Petal width and Sepal width) and Species as the between-subject factor. A statistically significant effect was found (Pillai’s Trace= 1.14, F(4,294)=98.18, p<.0001).

So, the petal width and sepal widths differ between the species of iris flowers.

Now again for length,… .

Manovamodel <- manova(cbind(Sepal.Length,Petal.Length) ~ Species, data = iris)
summary(Manovamodel)
##            Df Pillai approx F num Df den Df    Pr(>F)    
## Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
## Residuals 147                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A MANOVA was conducted with two dependent variables (Petal Length and Sepal Length) and Species as the between-subject factor. A statistically significant effect was found (Pillai’s Trace= 0.99, F(4,294)=71.83, p<.0001).

BONUS: Plot one of the results from the analyses salaries database.

require(ggplot2)
require(ggthemes)
require(RColorBrewer)
## Loading required package: RColorBrewer
# get the model predictions
pred=predict(Ancova)
# rename sex to gender
colnames(salaries)[5] <-"gender"
## plot
ggplot(data = cbind(salaries, pred),
    aes(yrs.since.phd, monthly_salary, color=gender)) + geom_point() +
    facet_grid(. ~ gender) + geom_line(aes(y=pred)) + scale_colour_brewer(palette = "Dark2") + labs(x="Years since Ph.D.", y="Monthly Salary ($)",
       title="Show me the money...") +  
  theme_tufte(12) 

BONUS: Follow up analyses MANOVA

require(ez)
iris$ID<- c(1:(NROW(iris))) # This makes an ID variable
EZ_ANOVA_Petal<-ezANOVA(iris, dv=Petal.Width, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
EZ_ANOVA_Petal
## $ANOVA
##    Effect DFn DFd      SSn    SSd        F            p p<.05       ges
## 1 Species   2 147 80.41333 6.1566 960.0071 4.169446e-85     * 0.9288829
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd       SSn    SSd        F           p p<.05
## 1   2 147 0.6421333 2.3726 19.89244 2.26052e-08     *
EZ_ANOVA_Sepal<-ezANOVA(iris, dv=Sepal.Width, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
EZ_ANOVA_Sepal
## $ANOVA
##    Effect DFn DFd      SSn    SSd        F            p p<.05       ges
## 1 Species   2 147 11.34493 16.962 49.16004 4.492017e-17     * 0.4007828
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd        SSn  SSd         F         p p<.05
## 1   2 147 0.05693333 7.09 0.5902116 0.5555179
# Make a graph
require(ggplot2)
require(ggthemes)
require(cowplot)
## Loading required package: cowplot
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
## 
##     theme_map
violinplot<-ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Sepal Width") + theme_economist_white(12)
violinplot2<-ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Petal Width") + theme_economist_white(12)
plot_grid(violinplot,violinplot2, labels=c('A', 'B'), label_size= 10, align="h")

ggsave("violinplot_panelled.pdf", width= 250, height=250, units="mm")

The graph shows that setosa has narrower petals but wider sepals. You could run follow up tests to examine the significance between species in terms of width.

Now let’s do it for length.

require(ez)
iris$ID<- c(1:(NROW(iris))) # This makes an ID variable
EZ_ANOVA_Petal_l<-ezANOVA(iris, dv=Petal.Length, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
EZ_ANOVA_Petal_l
## $ANOVA
##    Effect DFn DFd      SSn     SSd        F            p p<.05       ges
## 1 Species   2 147 437.1028 27.2226 1180.161 2.856777e-91     * 0.9413717
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd      SSn     SSd        F            p p<.05
## 1   2 147 2.678533 10.1062 19.48034 3.128757e-08     *
EZ_ANOVA_Sepal_l<-ezANOVA(iris, dv=Sepal.Length, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
EZ_ANOVA_Sepal_l
## $ANOVA
##    Effect DFn DFd      SSn     SSd        F            p p<.05       ges
## 1 Species   2 147 63.21213 38.9562 119.2645 1.669669e-31     * 0.6187057
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd    SSn     SSd       F           p p<.05
## 1   2 147 1.2828 14.8418 6.35272 0.002258528     *
# Make a graph
require(ggplot2)
require(ggthemes)
require(cowplot)
violinplot_l<-ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Sepal length") + theme_economist_white(12)
violinplot2_l<-ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Petal Length") + theme_economist_white(12)
plot_grid(violinplot_l,violinplot2_l, labels=c('A', 'B'), label_size= 10, align="h")

ggsave("violinplot_panelled_length.pdf", width= 250, height=250, units="mm")

The graph shows that setosa has smaller petals and smaller sepals than other species. The difference is more pronounced in petals than in sepals. You could run follow up tests to examine the significance between species in terms of Sepals.

A note on the iris dataset.

You can read more here. It is widely used also in Machine Learning, hence I include it in the curriculum. However, it was promoted by Ronald Fisher, and you can read more about Fisher’s Eugenics here.

The END

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.3      RColorBrewer_1.1-3 heplots_1.7.0      broom_1.0.7       
##  [5] MVN_5.9            summarytools_1.0.1 skimr_2.1.5        compute.es_0.2-5  
##  [9] WRS2_1.1-6         ez_4.4-0           car_3.1-3          ggthemes_5.1.0    
## [13] ggplot2_3.5.1      dplyr_1.1.4        carData_3.0-5     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1  psych_2.4.6.26    farver_2.1.2      fastmap_1.2.0    
##  [5] reshape_0.8.9     digest_0.6.37     timechange_0.3.0  lifecycle_1.0.4  
##  [9] magrittr_2.0.3    compiler_4.4.2    rlang_1.1.4       sass_0.4.9       
## [13] tools_4.4.2       utf8_1.2.4        yaml_2.3.10       knitr_1.49       
## [17] labeling_0.4.3    htmlwidgets_1.6.4 mnormt_2.1.1      plyr_1.8.9       
## [21] repr_1.1.7        abind_1.4-8       withr_3.0.2       purrr_1.0.2      
## [25] grid_4.4.2        rgl_1.3.12        fansi_1.0.6       colorspace_2.1-1 
## [29] extrafontdb_1.0   scales_1.3.0      MASS_7.3-61       cli_3.6.3        
## [33] rmarkdown_2.28    ragg_1.3.3        generics_0.1.3    rstudioapi_0.17.1
## [37] reshape2_1.4.4    minqa_1.2.8       energy_1.7-12     cachem_1.1.0     
## [41] pander_0.6.5      stringr_1.5.1     splines_4.4.2     parallel_4.4.2   
## [45] matrixStats_1.4.1 base64enc_0.1-3   vctrs_0.6.5       boot_1.3-31      
## [49] Matrix_1.7-1      jsonlite_1.8.9    rapportools_1.1   Formula_1.2-5    
## [53] systemfonts_1.1.0 magick_2.8.5      nortest_1.0-4     tidyr_1.3.1      
## [57] jquerylib_0.1.4   glue_1.8.0        nloptr_2.1.1      codetools_0.2-20 
## [61] lubridate_1.9.3   stringi_1.8.4     gtable_0.3.6      extrafont_0.19   
## [65] lme4_1.1-35.5     gsl_2.1-8         munsell_0.5.1     tibble_3.2.1     
## [69] pillar_1.9.0      htmltools_0.5.8.1 R6_2.5.1          textshaping_0.4.0
## [73] tcltk_4.4.2       evaluate_1.0.1    lattice_0.22-6    moments_0.14.1   
## [77] backports_1.5.0   pryr_0.1.6        bslib_0.8.0       Rcpp_1.0.13      
## [81] Rttf2pt1_1.3.12   nlme_3.1-166      checkmate_2.3.2   mgcv_1.9-1       
## [85] xfun_0.49         pkgconfig_2.0.3