Lecture 9: PY 0782 - Advanced Quantitative Research Methods

2018-11-28 | disclaimer

PY0782: Advanced Quantitative research methods.

Last lecture: CFA via SEM
Today: Mediation via SEM (+ dplyr exercise)

Assignment

After today you should be able to complete the following sections for Assignment II:

Mediation via SEM. (getting close to the end, hurray!)

Reminder Mediation.

‘Causal’ paths approach.

Sequence of regressions.

Why is this useful again?

Back to our hypothetical example.

We already did mediations in week 6 with a hypothetical dataset (and with body image data if you completed the exercise!)

Do you remember the methods you used?

Dataset.

Example, simulated data from here

X= grades

Y= happiness

Proposed mediator (M): self-esteem.

# Long
# string.
D <- read.csv("http://static.lib.virginia.edu/statlab/materials/data/mediationData.csv")
Data_med <- D

Lavaan

setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture9")
require(lavaan)
mediationmodelSEM <- ' # direct effect
             Y ~ c*X
           # mediator
             M ~ a*X
             Y ~ b*M
           # indirect effect (a*b)
             ab := a*b
           # total effect
             total := c + (a*b)
         '

Model

fit_mediationSEM <- sem(mediationmodelSEM, se="bootstrap", data = Data_med)
sink(file="SEM-mediation.txt")
summary(fit_mediationSEM, standardized=TRUE, fit.measures=T)

## lavaan 0.6-3 ended normally after 14 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
##   Minimum Function Value               0.0000000000000
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               77.413
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -379.612
##   Loglikelihood unrestricted model (H1)       -379.612
## 
##   Number of free parameters                          5
##   Akaike (AIC)                                 769.225
##   Bayesian (BIC)                               782.250
##   Sample-size adjusted Bayesian (BIC)          766.459
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard Errors                            Bootstrap
##   Number of requested bootstrap draws             1000
##   Number of successful bootstrap draws            1000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   Y ~                                                                   
##     X          (c)    0.040    0.122    0.324    0.746    0.040    0.034
##   M ~                                                                   
##     X          (a)    0.561    0.099    5.670    0.000    0.561    0.514
##   Y ~                                                                   
##     M          (b)    0.635    0.101    6.280    0.000    0.635    0.593
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .Y                 2.581    0.335    7.698    0.000    2.581    0.627
##    .M                 2.633    0.348    7.570    0.000    2.633    0.735
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     ab                0.357    0.078    4.543    0.000    0.357    0.305
##     total             0.396    0.124    3.199    0.001    0.396    0.339

sink()

Comparison to other methods.

How does that compare to the other method?

require(mediation)
med.fit<-lm(M~X, data=Data_med)
out.fit<-lm(Y~X+M, data=Data_med)
# Robust SE is ignored for Bootstrap. Otherwise omit boot=TRUE.
set.seed(1984)
med.out <- mediate(med.fit, out.fit, treat = "X", mediator = "M", boot=TRUE, sims = 10000)

Summary.

summary(med.out)

## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME             0.3565       0.2141         0.53  <2e-16 ***
## ADE              0.0396      -0.1962         0.30  0.7482    
## Total Effect     0.3961       0.1536         0.64  0.0008 ***
## Prop. Mediated   0.9000       0.4786         2.03  0.0008 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 100 
## 
## 
## Simulations: 10000

What are the benefits?

Comparison to other pre-specified models!

We can use model fit statistics to compare models

Also, easier if we have multiple mediators and even latent constructs.

Model with just direct effects.

require(lavaan)
directmodelSEM <- ' # just direct effects
                    Y ~ c*X+b*M '

Direct model

fit_directSEM <- sem(directmodelSEM, se="bootstrap", data = Data_med)
sink(file="SEM-direct.txt")
summary(fit_directSEM, standardized=TRUE, fit.measures=T)

## lavaan 0.6-3 ended normally after 11 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          3
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic               46.681
##   Degrees of freedom                                 2
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -189.311
##   Loglikelihood unrestricted model (H1)       -189.311
## 
##   Number of free parameters                          3
##   Akaike (AIC)                                 384.622
##   Bayesian (BIC)                               392.437
##   Sample-size adjusted Bayesian (BIC)          382.963
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent Confidence Interval          0.000  0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard Errors                            Bootstrap
##   Number of requested bootstrap draws             1000
##   Number of successful bootstrap draws            1000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   Y ~                                                                   
##     X          (c)    0.040    0.123    0.323    0.747    0.040    0.034
##     M          (b)    0.635    0.103    6.164    0.000    0.635    0.593
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .Y                 2.581    0.331    7.810    0.000    2.581    0.627

sink()

How to fool yourself with SEM

Kline lists 50 (!) ways in which you can fool yourself in his SEM book.

Sample Size: you need large sample. A simple model already requires large amounts of data.

Poor data / Making something out of nothing: If there are no baseline correlations –> SEM won’t help.

Naming a latent variable doesn’t mean it exists… .

How to fool yourself with SEM: the saga continues

Ignoring warnings - some are safe to ignore but often they tell you something meaningful (for example, you are building something nonsensical)

Ignoring practical significance. One can build a model that fits the data closely, but does not a whole lot of variance (also note: statistical significance is not the same as clinical or biological significance!).

Visualisations: Make a nice diagram…

LavaanPlot relies on diagrammer.

require(lavaanPlot)
lavaanPlot(model = fit_mediationSEM, node_options = list(shape = "box", 
    fontname = "Helvetica"), edge_options = list(color = "grey"))

But it is not the way we want.

You can further improve on this. Check diagrammer.

require(lavaanPlot)
labels <- list(X = "independent", M = "mediator", Y = "dependent")
lavaanPlot(model = fit_mediationSEM, graph_options = list(rankdir = "LR"), 
    labels = labels, node_options = list(shape = "box", fontname = "Helvetica"), 
    edge_options = list(color = "grey"))

Make a plot with the estimates.

require(semPlot)
labels2 <- list("Dependent", "Mediator", "Independent")  # need to reorganize.
semPaths(fit_mediationSEM, "std", edge.label.cex = 0.5, layout = "spring", 
    whatLabels = "std", intercepts = FALSE, style = "ram", nodeLabels = labels2)

Make the line visible… .

# Lisrel.
require(semPlot)
semPaths(fit_mediationSEM, "std", edge.label.cex = 0.5, layout = "spring", 
    whatLabels = "std", intercepts = FALSE, style = "lisrel", 
    curvePivot = TRUE, fade = FALSE, nodeLabels = labels2)

alternative route.

Check this package but requires rJava (which does not always play nice with a Mac).

Back to the future.

Data wrangling might still be hard for some of you.

–> Back to dplyr?

Some further ‘dplyr’ exercises and useful functions…

Install the ‘tidyverse’ package, load the dataset ‘mtcars’ from the datasets

Load the mtcars dataset.

Exercise

Calculate a summary table which has the mean/SD of the horse power variable organized by number of gears. (Bonus: export it to .html or Word.)
Make a new dataframe called my_cars which contains the columns mpg, hp columns but let the column names be miles_per_gallon and horse_power respectively.
Create a new variable in the dataframe called km_per_litre using the mutate function. Note: 1 mpg = 0.425 km/l .
Look at the sample_frac() function. Use it to make a new dataframe with a random selection of half the data.
Look at the slice function. From the original dataframe select rows 10 to 35.
Look at the tibble package and the rownames_to_column function. Make a dataset with just the “Lotus Europa” model. What would be an alternative way of reaching the same goal?

Solution.

When you are done you can evaluate your results here.

Back to SEM.

Thus far: pretty basic.

We can build more complex models. Also note that SEM can be a useful alternative for multilevel models. (They are actually quite similar)

Back to SEM: Wheaton data.

Wheaton (1977) “used longitudinal data to develop a model of the stability of alienation from 1967 to 1971, accounting for socioeconomic status as a covariate. Each of the three factors have two indicator variables, SES in 1966 is measured by education and occupational status in 1966 and alienation in both years is measured by powerlessness and anomie (a feeling of being lost with regard to society). The structural component of the model hypothesizes that SES in 1966 influences both alienation in 1967 and 1971 and alienation in 1967 influences the same measure in 1971. We also let the disturbances correlate from one time point to the next.” from here.

Make covariance matrix.

We can do SEM based on correlation and covariance matrix!

First make your covariance matrix.

lower <- '
 11.834,
  6.947,    9.364,
  6.819,    5.091,   12.532,
  4.783,    5.028,    7.495,    9.986,
 -3.839,   -3.889,   -3.841,   -3.625,   9.610,
-21.899,  -18.831,  -21.748,  -18.775,  35.522,  450.288 '

# convert to a full symmetric covariance matrix with names
wheaton.cov <- getCov(lower, names=c("anomia67","powerless67", "anomia71",
                                     "powerless71","education","sei"))

The model we are trying to build.

Question?

This model brings which 2 aspects of SEM together?

Build the model.

require(lavaan)
# the model
wheaton.model = '
  # measurement model
    ses     =~ education + sei
    alien67 =~ anomia67 + powerless67
    alien71 =~ anomia71 + powerless71
 
  # structural model
    alien71 ~ aa*alien67 + ses
    alien67 ~ sa*ses
 
  # correlated residuals
    anomia67 ~~ anomia71
    powerless67 ~~ powerless71

  # Indirect effect
    IndirectEffect := sa*aa
'

What is missing in the diagram?

Fit the model.

require(lavaan)
alienation <- sem(wheaton.model, sample.cov=wheaton.cov, sample.nobs=932)
sink(file='summary_alienation.txt')
summary(alienation, fit.measures=T)

## lavaan 0.6-3 ended normally after 84 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         17
## 
##   Number of observations                           932
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       4.735
##   Degrees of freedom                                 4
##   P-value (Chi-square)                           0.316
## 
## Model test baseline model:
## 
##   Minimum Function Test Statistic             2133.722
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User model versus baseline model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       0.999
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -15213.274
##   Loglikelihood unrestricted model (H1)     -15210.906
## 
##   Number of free parameters                         17
##   Akaike (AIC)                               30460.548
##   Bayesian (BIC)                             30542.783
##   Sample-size adjusted Bayesian (BIC)        30488.792
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.014
##   90 Percent Confidence Interval          0.000  0.053
##   P-value RMSEA <= 0.05                          0.930
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.007
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   ses =~                                              
##     education         1.000                           
##     sei               5.219    0.422   12.364    0.000
##   alien67 =~                                          
##     anomia67          1.000                           
##     powerless67       0.979    0.062   15.895    0.000
##   alien71 =~                                          
##     anomia71          1.000                           
##     powerless71       0.922    0.059   15.498    0.000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   alien71 ~                                           
##     alien67   (aa)    0.607    0.051   11.898    0.000
##     ses              -0.227    0.052   -4.334    0.000
##   alien67 ~                                           
##     ses       (sa)   -0.575    0.056  -10.195    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##  .anomia67 ~~                                         
##    .anomia71          1.623    0.314    5.176    0.000
##  .powerless67 ~~                                      
##    .powerless71       0.339    0.261    1.298    0.194
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .education         2.801    0.507    5.525    0.000
##    .sei             264.597   18.126   14.597    0.000
##    .anomia67          4.731    0.453   10.441    0.000
##    .powerless67       2.563    0.403    6.359    0.000
##    .anomia71          4.399    0.515    8.542    0.000
##    .powerless71       3.070    0.434    7.070    0.000
##     ses               6.798    0.649   10.475    0.000
##    .alien67           4.841    0.467   10.359    0.000
##    .alien71           4.083    0.404   10.104    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     IndirectEffect   -0.349    0.041   -8.538    0.000

sink()

Semplot

require(semPlot)
require(qgraph)
semPaths(alienation,style = "lisrel",what="std",  curvePivot = TRUE)

Try it yourself.

What happens if you remove the arrow between “Alien67” and “Alien71”? (Compare the fit indices and make a plot)

What happens if you remove the correlated residuals?

Exercise

Use the following code below to generate a correlation matrix to use (you’ll need to load lavaan).

The data comes from this paper. A study testing the Theory of Planned Behaviour when applying to graduate school for 131 students.

require(lavaan)
data<-lav_matrix_lower2full(c(1,.47,1,.66,.50,1,.77,.41,.46,1,.52,.38,.50,.50,1))
rownames(data)<-colnames(data)<-c("att",'sn','pbc','int','beh') # a matrix.

Exercise (cont’d)

Reconstruct the model from the article. (Note that there are no (!) latent variables)

Discuss the fit. Make a sem plot.

What happens when you remove the indirect path?

References (and further reading.)

Also check the reading list! (many more than listed here)

Clark, M. (2017). SEM, Graphical and Latent Variable Modeling, http://m-clark.github.io/docs/sem/
Rosseel, Y. (2017). Lavaan. http://lavaan.ugent.be/index.html
Tarr, G. (2017). Dplyr exercises. https://garthtarr.github.io/meatR/dplyr_ex1.html