Using last week’s Salaries dataset, test the assumptions for an ANOVA with monthly salary as dependent variable and academic rank as independent variable.
Conduct an ANOVA with the appropriate post-hoc tests. What do you conclude?
Conduct an alternative non-parametric test to the ANOVA. What do you conclude?
Conduct an ANCOVA with gender as factor, and years since Ph.D. as covariate. What do you conclude?
Load the iris dataset, it is part of the datasets package.
It is a famous dataset with measurements of 3 species of iris flowers.
Test the assumptions for a MANOVA with Species as the between-subject factor and petal length and sepal length as dependent variables. (I previously did this for length)
Run the MANOVA. What do you conclude? Write up the result as you would do for a paper?
BONUS: Plot one of the results from your analyses on the salaries database.
BONUS: Conduct a follow up analysis or plot for the MANOVA.
## Loading required package: carData
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We perform a visual check here. See the slides for (inferior?) alternatives
## Loading required package: ggplot2
## Loading required package: ggthemes
plot_hist <- ggplot(salaries, aes(x=monthly_salary))
plot_hist <- plot_hist+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~rank) + theme_gdocs(12) + xlab("Monthly salary (in USD)") + ylab("Count")
plot_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
More or less OK, perhaps some deviations from normality with the asst. prof group. There is perhaps one extreme case in the prof. group who earns more than 25k USD a month!
Many options exist but went with Levene’s test.
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
The assumption for homogeneity of variances is violated (Levene’s test: F(2,394) = 38.71, p< .0001).
## Loading required package: ez
salaries$ID<- c(1:(NROW(salaries))) # This makes an ID variable
EZ_ANOVA1<-ezANOVA(salaries, dv=monthly_salary, wid=ID, between=rank, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Warning: Data is unbalanced (unequal N per group). Make sure you specified a
## well-considered value for the type argument to ezANOVA().
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 rank 2 394 1768293404 2716899714 128.2174 1.293048e-43 * 0.3942513
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 394 194134938 987948286 38.71112 4.476856e-16 *
A one-way ANOVA showed a significant effect of academic rank on monthly salary, F(2, 394) = 128.22, p < .0001, \(\eta^2_g\) = .39.
Many different options (see lecture slides). I went with a medians test.
## Loading required package: WRS2
## Call:
## med1way(formula = monthly_salary ~ rank, data = salaries, iter = 10000)
##
## Test statistic F: 96.3811
## Critical value: 2.8435
## p-value: 0
# analysis of Medians leads to same conclusion.
require(dplyr)
grouped<-group_by(salaries, rank)
grouped %>% summarise(median= median(monthly_salary))
A median test showed that the medians differed significantly by academic rank (Test= 96.38, p<.0001). The medians showed that Assistant professors ($8867) earned least, followed by Associate Professors ($10625), and full Professors earned most ($13,702).
(Note: assumptions not tested here)
## Loading required package: compute.es
Ancova<-lm(monthly_salary~sex+yrs.since.phd, contrasts=list(sex=contr.sum), data=salaries)
Anova(Ancova, type="III")
A one-way ANCOVA examined the effect of gender on monthly salary whilst controlling for the number of years since Ph.D. There was no statistically significant effect of gender, F(1,394)=2.86, p=.092. However, the effect of the covariate, years since Ph.D., was significant, F(1,394)=78.23, p<.0001.
I previously have made an error in the instructions and had previously run everything with width rather than length
## Loading required package: skimr
Name | iris |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
## Loading required package: summarytools
Visual inspection of normality. This time with the ‘few theme’.
require(ggplot2)
require(ggthemes)
plot_hist_sepal <- ggplot(iris, aes(x=Sepal.Width))
plot_hist_sepal <- plot_hist_sepal+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Sepal Width") + ylab("Count")
plot_hist_sepal
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot_hist_petal <- ggplot(iris, aes(x=Petal.Width))
plot_hist_petal <- plot_hist_petal+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Petal Width") + ylab("Count")
plot_hist_petal
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This seems generally fine for sepal width but perhaps some issues for petal width. As a follow-up one could consider bootstrapping the results.
Here I do the test for length rather than width.
require(ggplot2)
require(ggthemes)
plot_hist_sepal_l <- ggplot(iris, aes(x=Sepal.Length))
plot_hist_sepal_l <- plot_hist_sepal_l+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Sepal Length") + ylab("Count")
plot_hist_sepal_l
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot_hist_petal_l <- ggplot(iris, aes(x=Petal.Length))
plot_hist_petal_l <- plot_hist_petal_l+ geom_histogram(colour = "black", fill = "white") + facet_wrap(~Species) + theme_few(12) + xlab("Petal Length") + ylab("Count")
plot_hist_petal_l
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This seems generally fine for sepal width but perhaps some issues for petal width (especially in setosa ?). As a follow-up one could again consider bootstrapping the results.
mardiaTest is deprecated as a function(). So now updated with ‘mvn’
## Loading required package: MVN
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 5.429793 1.119371e-11 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Sepal.Width 0.9080 0.0202 NO
## 2 Anderson-Darling Petal.Width 5.1057 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew
## Sepal.Width 150 3.057333 0.4358663 3.0 2.0 4.4 2.8 3.3 0.3126147
## Petal.Width 150 1.199333 0.7622377 1.3 0.1 2.5 0.3 1.8 -0.1009166
## Kurtosis
## Sepal.Width 0.1387047
## Petal.Width -1.3581792
The assumption of multivariate normality is violated.
Let us look at length. (Now with a Q-Q plot, also note other options for making figures).
require(dplyr)
multivariatenorm2<- iris %>% select(c(Sepal.Length,Petal.Length))
require(MVN)
mvn(multivariatenorm2, multivariatePlot = "qq")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 5.450384 1.035771e-11 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Sepal.Length 0.8892 0.0225 NO
## 2 Anderson-Darling Petal.Length 7.6785 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew
## Sepal.Length 150 5.843333 0.8280661 5.80 4.3 7.9 5.1 6.4 0.3086407
## Petal.Length 150 3.758000 1.7652982 4.35 1.0 6.9 1.6 5.1 -0.2694109
## Kurtosis
## Sepal.Length -0.6058125
## Petal.Length -1.4168574
Levene’s tests showed that the assumption of homogeneity of variances is not violated for sepal width, F(2,147) = 0.59, p= .56) but it is violated for petal width (F(2,147) = 19.89, p< .00001).
Now let’s do it for length.
Levene’s tests showed that the assumption of homogeneity of variances is violated for both sepal length, F(2,147) = 6.35, p= .002) and for petal length (F(2,147) = 19.48, p< .00001).
## Loading required package: heplots
## Loading required package: broom
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: multivariatenorm
## Chi-Sq (approx.) = 51.794, df = 6, p-value = 2.051e-09
The Box M test is significant (p<.00001). (Note that it is very easily significant, and you should be aware of this. Likely many researchers would still proceed.).
Let’s look at length.
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: multivariatenorm2
## Chi-Sq (approx.) = 58.376, df = 6, p-value = 9.617e-11
The Box M test is significant (p<.00001). (Note that it is very easily significant, and you should be aware of this. Likely many researchers would still proceed.)
## Df Pillai approx F num Df den Df Pr(>F)
## Species 2 1.1438 98.181 4 294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A MANOVA was conducted with two dependent variables (Petal width and Sepal width) and Species as the between-subject factor. A statistically significant effect was found (Pillai’s Trace= 1.14, F(4,294)=98.18, p<.0001).
So, the petal width and sepal widths differ between the species of iris flowers.
Now again for length,… .
## Df Pillai approx F num Df den Df Pr(>F)
## Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A MANOVA was conducted with two dependent variables (Petal Length and Sepal Length) and Species as the between-subject factor. A statistically significant effect was found (Pillai’s Trace= 0.99, F(4,294)=71.83, p<.0001).
## Loading required package: RColorBrewer
# get the model predictions
pred=predict(Ancova)
# rename sex to gender
colnames(salaries)[5] <-"gender"
## plot
ggplot(data = cbind(salaries, pred),
aes(yrs.since.phd, monthly_salary, color=gender)) + geom_point() +
facet_grid(. ~ gender) + geom_line(aes(y=pred)) + scale_colour_brewer(palette = "Dark2") + labs(x="Years since Ph.D.", y="Monthly Salary ($)",
title="Show me the money...") +
theme_tufte(12)
require(ez)
iris$ID<- c(1:(NROW(iris))) # This makes an ID variable
EZ_ANOVA_Petal<-ezANOVA(iris, dv=Petal.Width, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 Species 2 147 80.41333 6.1566 960.0071 4.169446e-85 * 0.9288829
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 147 0.6421333 2.3726 19.89244 2.26052e-08 *
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 Species 2 147 11.34493 16.962 49.16004 4.492017e-17 * 0.4007828
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 147 0.05693333 7.09 0.5902116 0.5555179
## Loading required package: cowplot
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
##
## theme_map
violinplot<-ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Sepal Width") + theme_economist_white(12)
violinplot2<-ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Petal Width") + theme_economist_white(12)
plot_grid(violinplot,violinplot2, labels=c('A', 'B'), label_size= 10, align="h")
The graph shows that setosa has narrower petals but wider sepals. You could run follow up tests to examine the significance between species in terms of width.
Now let’s do it for length.
require(ez)
iris$ID<- c(1:(NROW(iris))) # This makes an ID variable
EZ_ANOVA_Petal_l<-ezANOVA(iris, dv=Petal.Length, wid=ID, between=Species, detailed=TRUE)
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 Species 2 147 437.1028 27.2226 1180.161 2.856777e-91 * 0.9413717
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 147 2.678533 10.1062 19.48034 3.128757e-08 *
## Warning: Converting "ID" to factor for ANOVA.
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 Species 2 147 63.21213 38.9562 119.2645 1.669669e-31 * 0.6187057
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 147 1.2828 14.8418 6.35272 0.002258528 *
# Make a graph
require(ggplot2)
require(ggthemes)
require(cowplot)
violinplot_l<-ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Sepal length") + theme_economist_white(12)
violinplot2_l<-ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_violin() + geom_boxplot(width = 0.2) + ylab("Petal Length") + theme_economist_white(12)
plot_grid(violinplot_l,violinplot2_l, labels=c('A', 'B'), label_size= 10, align="h")
The graph shows that setosa has smaller petals and smaller sepals than other species. The difference is more pronounced in petals than in sepals. You could run follow up tests to examine the significance between species in terms of Sepals.
You can read more here. It is widely used also in Machine Learning, hence I include it in the curriculum. However, it was promoted by Ronald Fisher, and you can read more about Fisher’s Eugenics here.
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.1.3 RColorBrewer_1.1-3 heplots_1.7.0 broom_1.0.7
## [5] MVN_5.9 summarytools_1.0.1 skimr_2.1.5 compute.es_0.2-5
## [9] WRS2_1.1-6 ez_4.4-0 car_3.1-3 ggthemes_5.1.0
## [13] ggplot2_3.5.1 dplyr_1.1.4 carData_3.0-5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 psych_2.4.6.26 farver_2.1.2 fastmap_1.2.0
## [5] reshape_0.8.9 digest_0.6.37 timechange_0.3.0 lifecycle_1.0.4
## [9] magrittr_2.0.3 compiler_4.4.2 rlang_1.1.4 sass_0.4.9
## [13] tools_4.4.2 utf8_1.2.4 yaml_2.3.10 knitr_1.49
## [17] labeling_0.4.3 htmlwidgets_1.6.4 mnormt_2.1.1 plyr_1.8.9
## [21] repr_1.1.7 abind_1.4-8 withr_3.0.2 purrr_1.0.2
## [25] grid_4.4.2 rgl_1.3.12 fansi_1.0.6 colorspace_2.1-1
## [29] extrafontdb_1.0 scales_1.3.0 MASS_7.3-61 cli_3.6.3
## [33] rmarkdown_2.28 ragg_1.3.3 generics_0.1.3 rstudioapi_0.17.1
## [37] reshape2_1.4.4 minqa_1.2.8 energy_1.7-12 cachem_1.1.0
## [41] pander_0.6.5 stringr_1.5.1 splines_4.4.2 parallel_4.4.2
## [45] matrixStats_1.4.1 base64enc_0.1-3 vctrs_0.6.5 boot_1.3-31
## [49] Matrix_1.7-1 jsonlite_1.8.9 rapportools_1.1 Formula_1.2-5
## [53] systemfonts_1.1.0 magick_2.8.5 nortest_1.0-4 tidyr_1.3.1
## [57] jquerylib_0.1.4 glue_1.8.0 nloptr_2.1.1 codetools_0.2-20
## [61] lubridate_1.9.3 stringi_1.8.4 gtable_0.3.6 extrafont_0.19
## [65] lme4_1.1-35.5 gsl_2.1-8 munsell_0.5.1 tibble_3.2.1
## [69] pillar_1.9.0 htmltools_0.5.8.1 R6_2.5.1 textshaping_0.4.0
## [73] tcltk_4.4.2 evaluate_1.0.1 lattice_0.22-6 moments_0.14.1
## [77] backports_1.5.0 pryr_0.1.6 bslib_0.8.0 Rcpp_1.0.13
## [81] Rttf2pt1_1.3.12 nlme_3.1-166 checkmate_2.3.2 mgcv_1.9-1
## [85] xfun_0.49 pkgconfig_2.0.3