- Last lecture: CFA via SEM
- Today: Mediation via SEM (+ dplyr exercise)
2018-11-28 | disclaimer
After today you should be able to complete the following sections for Assignment II:
Mediation via SEM. (getting close to the end, hurray!)
‘Causal’ paths approach.
Sequence of regressions.
We already did mediations in week 6 with a hypothetical dataset (and with body image data if you completed the exercise!)
Do you remember the methods you used?
Example, simulated data from here
X= grades
Y= happiness
Proposed mediator (M): self-esteem.
# Long # string. D <- read.csv("http://static.lib.virginia.edu/statlab/materials/data/mediationData.csv") Data_med <- D
setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture9") require(lavaan) mediationmodelSEM <- ' # direct effect Y ~ c*X # mediator M ~ a*X Y ~ b*M # indirect effect (a*b) ab := a*b # total effect total := c + (a*b) '
fit_mediationSEM <- sem(mediationmodelSEM, se="bootstrap", data = Data_med) sink(file="SEM-mediation.txt") summary(fit_mediationSEM, standardized=TRUE, fit.measures=T)
## lavaan 0.6-3 ended normally after 14 iterations ## ## Optimization method NLMINB ## Number of free parameters 5 ## ## Number of observations 100 ## ## Estimator ML ## Model Fit Test Statistic 0.000 ## Degrees of freedom 0 ## Minimum Function Value 0.0000000000000 ## ## Model test baseline model: ## ## Minimum Function Test Statistic 77.413 ## Degrees of freedom 3 ## P-value 0.000 ## ## User model versus baseline model: ## ## Comparative Fit Index (CFI) 1.000 ## Tucker-Lewis Index (TLI) 1.000 ## ## Loglikelihood and Information Criteria: ## ## Loglikelihood user model (H0) -379.612 ## Loglikelihood unrestricted model (H1) -379.612 ## ## Number of free parameters 5 ## Akaike (AIC) 769.225 ## Bayesian (BIC) 782.250 ## Sample-size adjusted Bayesian (BIC) 766.459 ## ## Root Mean Square Error of Approximation: ## ## RMSEA 0.000 ## 90 Percent Confidence Interval 0.000 0.000 ## P-value RMSEA <= 0.05 NA ## ## Standardized Root Mean Square Residual: ## ## SRMR 0.000 ## ## Parameter Estimates: ## ## Standard Errors Bootstrap ## Number of requested bootstrap draws 1000 ## Number of successful bootstrap draws 1000 ## ## Regressions: ## Estimate Std.Err z-value P(>|z|) Std.lv Std.all ## Y ~ ## X (c) 0.040 0.122 0.324 0.746 0.040 0.034 ## M ~ ## X (a) 0.561 0.099 5.670 0.000 0.561 0.514 ## Y ~ ## M (b) 0.635 0.101 6.280 0.000 0.635 0.593 ## ## Variances: ## Estimate Std.Err z-value P(>|z|) Std.lv Std.all ## .Y 2.581 0.335 7.698 0.000 2.581 0.627 ## .M 2.633 0.348 7.570 0.000 2.633 0.735 ## ## Defined Parameters: ## Estimate Std.Err z-value P(>|z|) Std.lv Std.all ## ab 0.357 0.078 4.543 0.000 0.357 0.305 ## total 0.396 0.124 3.199 0.001 0.396 0.339
sink()
How does that compare to the other method?
require(mediation) med.fit<-lm(M~X, data=Data_med) out.fit<-lm(Y~X+M, data=Data_med) # Robust SE is ignored for Bootstrap. Otherwise omit boot=TRUE. set.seed(1984) med.out <- mediate(med.fit, out.fit, treat = "X", mediator = "M", boot=TRUE, sims = 10000)
summary(med.out)
## ## Causal Mediation Analysis ## ## Nonparametric Bootstrap Confidence Intervals with the Percentile Method ## ## Estimate 95% CI Lower 95% CI Upper p-value ## ACME 0.3565 0.2141 0.53 <2e-16 *** ## ADE 0.0396 -0.1962 0.30 0.7482 ## Total Effect 0.3961 0.1536 0.64 0.0008 *** ## Prop. Mediated 0.9000 0.4786 2.03 0.0008 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Sample Size Used: 100 ## ## ## Simulations: 10000
Comparison to other pre-specified models!
We can use model fit statistics to compare models
Also, easier if we have multiple mediators and even latent constructs.
require(lavaan) directmodelSEM <- ' # just direct effects Y ~ c*X+b*M '
fit_directSEM <- sem(directmodelSEM, se="bootstrap", data = Data_med) sink(file="SEM-direct.txt") summary(fit_directSEM, standardized=TRUE, fit.measures=T)
## lavaan 0.6-3 ended normally after 11 iterations ## ## Optimization method NLMINB ## Number of free parameters 3 ## ## Number of observations 100 ## ## Estimator ML ## Model Fit Test Statistic 0.000 ## Degrees of freedom 0 ## ## Model test baseline model: ## ## Minimum Function Test Statistic 46.681 ## Degrees of freedom 2 ## P-value 0.000 ## ## User model versus baseline model: ## ## Comparative Fit Index (CFI) 1.000 ## Tucker-Lewis Index (TLI) 1.000 ## ## Loglikelihood and Information Criteria: ## ## Loglikelihood user model (H0) -189.311 ## Loglikelihood unrestricted model (H1) -189.311 ## ## Number of free parameters 3 ## Akaike (AIC) 384.622 ## Bayesian (BIC) 392.437 ## Sample-size adjusted Bayesian (BIC) 382.963 ## ## Root Mean Square Error of Approximation: ## ## RMSEA 0.000 ## 90 Percent Confidence Interval 0.000 0.000 ## P-value RMSEA <= 0.05 NA ## ## Standardized Root Mean Square Residual: ## ## SRMR 0.000 ## ## Parameter Estimates: ## ## Standard Errors Bootstrap ## Number of requested bootstrap draws 1000 ## Number of successful bootstrap draws 1000 ## ## Regressions: ## Estimate Std.Err z-value P(>|z|) Std.lv Std.all ## Y ~ ## X (c) 0.040 0.123 0.323 0.747 0.040 0.034 ## M (b) 0.635 0.103 6.164 0.000 0.635 0.593 ## ## Variances: ## Estimate Std.Err z-value P(>|z|) Std.lv Std.all ## .Y 2.581 0.331 7.810 0.000 2.581 0.627
sink()
Kline lists 50 (!) ways in which you can fool yourself in his SEM book.
Sample Size: you need large sample. A simple model already requires large amounts of data.
Poor data / Making something out of nothing: If there are no baseline correlations –> SEM won’t help.
Naming a latent variable doesn’t mean it exists… .
Ignoring warnings - some are safe to ignore but often they tell you something meaningful (for example, you are building something nonsensical)
Ignoring practical significance. One can build a model that fits the data closely, but does not a whole lot of variance (also note: statistical significance is not the same as clinical or biological significance!).
LavaanPlot relies on diagrammer.
require(lavaanPlot) lavaanPlot(model = fit_mediationSEM, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"))
You can further improve on this. Check diagrammer.
require(lavaanPlot) labels <- list(X = "independent", M = "mediator", Y = "dependent") lavaanPlot(model = fit_mediationSEM, graph_options = list(rankdir = "LR"), labels = labels, node_options = list(shape = "box", fontname = "Helvetica"), edge_options = list(color = "grey"))
require(semPlot) labels2 <- list("Dependent", "Mediator", "Independent") # need to reorganize. semPaths(fit_mediationSEM, "std", edge.label.cex = 0.5, layout = "spring", whatLabels = "std", intercepts = FALSE, style = "ram", nodeLabels = labels2)
# Lisrel. require(semPlot) semPaths(fit_mediationSEM, "std", edge.label.cex = 0.5, layout = "spring", whatLabels = "std", intercepts = FALSE, style = "lisrel", curvePivot = TRUE, fade = FALSE, nodeLabels = labels2)
Check this package but requires rJava (which does not always play nice with a Mac).
Data wrangling might still be hard for some of you.
–> Back to dplyr?
Install the ‘tidyverse’ package, load the dataset ‘mtcars’ from the datasets
Load the mtcars dataset.
Calculate a summary table which has the mean/SD of the horse power variable organized by number of gears. (Bonus: export it to .html or Word.)
Make a new dataframe called my_cars which contains the columns mpg, hp columns but let the column names be miles_per_gallon and horse_power respectively.
Create a new variable in the dataframe called km_per_litre using the mutate function. Note: 1 mpg = 0.425 km/l .
Look at the sample_frac() function. Use it to make a new dataframe with a random selection of half the data.
Look at the slice function. From the original dataframe select rows 10 to 35.
Look at the tibble package and the rownames_to_column function. Make a dataset with just the “Lotus Europa” model. What would be an alternative way of reaching the same goal?
When you are done you can evaluate your results here.
Thus far: pretty basic.
We can build more complex models. Also note that SEM can be a useful alternative for multilevel models. (They are actually quite similar)
Wheaton (1977) “used longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomie (a feeling of being lost with regard to society). The structural component of the model hypothesizes that SES in 1966 influences both alienation in 1967 and 1971 and alienation in 1967 influences the same measure in 1971. We also let the disturbances correlate from one time point to the next.” from here.
We can do SEM based on correlation and covariance matrix!
First make your covariance matrix.
lower <- ' 11.834, 6.947, 9.364, 6.819, 5.091, 12.532, 4.783, 5.028, 7.495, 9.986, -3.839, -3.889, -3.841, -3.625, 9.610, -21.899, -18.831, -21.748, -18.775, 35.522, 450.288 ' # convert to a full symmetric covariance matrix with names wheaton.cov <- getCov(lower, names=c("anomia67","powerless67", "anomia71", "powerless71","education","sei"))
This model brings which 2 aspects of SEM together?
require(lavaan) # the model wheaton.model = ' # measurement model ses =~ education + sei alien67 =~ anomia67 + powerless67 alien71 =~ anomia71 + powerless71 # structural model alien71 ~ aa*alien67 + ses alien67 ~ sa*ses # correlated residuals anomia67 ~~ anomia71 powerless67 ~~ powerless71 # Indirect effect IndirectEffect := sa*aa '
require(lavaan) alienation <- sem(wheaton.model, sample.cov=wheaton.cov, sample.nobs=932) sink(file='summary_alienation.txt') summary(alienation, fit.measures=T)
## lavaan 0.6-3 ended normally after 84 iterations ## ## Optimization method NLMINB ## Number of free parameters 17 ## ## Number of observations 932 ## ## Estimator ML ## Model Fit Test Statistic 4.735 ## Degrees of freedom 4 ## P-value (Chi-square) 0.316 ## ## Model test baseline model: ## ## Minimum Function Test Statistic 2133.722 ## Degrees of freedom 15 ## P-value 0.000 ## ## User model versus baseline model: ## ## Comparative Fit Index (CFI) 1.000 ## Tucker-Lewis Index (TLI) 0.999 ## ## Loglikelihood and Information Criteria: ## ## Loglikelihood user model (H0) -15213.274 ## Loglikelihood unrestricted model (H1) -15210.906 ## ## Number of free parameters 17 ## Akaike (AIC) 30460.548 ## Bayesian (BIC) 30542.783 ## Sample-size adjusted Bayesian (BIC) 30488.792 ## ## Root Mean Square Error of Approximation: ## ## RMSEA 0.014 ## 90 Percent Confidence Interval 0.000 0.053 ## P-value RMSEA <= 0.05 0.930 ## ## Standardized Root Mean Square Residual: ## ## SRMR 0.007 ## ## Parameter Estimates: ## ## Information Expected ## Information saturated (h1) model Structured ## Standard Errors Standard ## ## Latent Variables: ## Estimate Std.Err z-value P(>|z|) ## ses =~ ## education 1.000 ## sei 5.219 0.422 12.364 0.000 ## alien67 =~ ## anomia67 1.000 ## powerless67 0.979 0.062 15.895 0.000 ## alien71 =~ ## anomia71 1.000 ## powerless71 0.922 0.059 15.498 0.000 ## ## Regressions: ## Estimate Std.Err z-value P(>|z|) ## alien71 ~ ## alien67 (aa) 0.607 0.051 11.898 0.000 ## ses -0.227 0.052 -4.334 0.000 ## alien67 ~ ## ses (sa) -0.575 0.056 -10.195 0.000 ## ## Covariances: ## Estimate Std.Err z-value P(>|z|) ## .anomia67 ~~ ## .anomia71 1.623 0.314 5.176 0.000 ## .powerless67 ~~ ## .powerless71 0.339 0.261 1.298 0.194 ## ## Variances: ## Estimate Std.Err z-value P(>|z|) ## .education 2.801 0.507 5.525 0.000 ## .sei 264.597 18.126 14.597 0.000 ## .anomia67 4.731 0.453 10.441 0.000 ## .powerless67 2.563 0.403 6.359 0.000 ## .anomia71 4.399 0.515 8.542 0.000 ## .powerless71 3.070 0.434 7.070 0.000 ## ses 6.798 0.649 10.475 0.000 ## .alien67 4.841 0.467 10.359 0.000 ## .alien71 4.083 0.404 10.104 0.000 ## ## Defined Parameters: ## Estimate Std.Err z-value P(>|z|) ## IndirectEffect -0.349 0.041 -8.538 0.000
sink()
require(semPlot) require(qgraph) semPaths(alienation,style = "lisrel",what="std", curvePivot = TRUE)
What happens if you remove the arrow between “Alien67” and “Alien71”? (Compare the fit indices and make a plot)
What happens if you remove the correlated residuals?
Use the following code below to generate a correlation matrix to use (you’ll need to load lavaan).
The data comes from this paper. A study testing the Theory of Planned Behaviour when applying to graduate school for 131 students.
require(lavaan) data<-lav_matrix_lower2full(c(1,.47,1,.66,.50,1,.77,.41,.46,1,.52,.38,.50,.50,1)) rownames(data)<-colnames(data)<-c("att",'sn','pbc','int','beh') # a matrix.
Reconstruct the model from the article. (Note that there are no (!) latent variables)
Discuss the fit. Make a sem plot.
What happens when you remove the indirect path?
Also check the reading list! (many more than listed here)
Clark, M. (2017). SEM, Graphical and Latent Variable Modeling, http://m-clark.github.io/docs/sem/
Rosseel, Y. (2017). Lavaan. http://lavaan.ugent.be/index.html
Tarr, G. (2017). Dplyr exercises. https://garthtarr.github.io/meatR/dplyr_ex1.html