Lecture 4: PY 0782 - Advanced Quantitative Research Methods

2018-10-22 | disclaimer

PY0782: Advanced Quantitative research methods.

Last week: ANOVA and its associated nightmares.
Today all correlation and regression.

Goals (today)

Correlation and ordinary least squares (OLS) regression

Assumptions underpinning regression

Logistic regression.

Assignment

After today you should be able to complete the following sections:

Correlation

OLS Regression / assumptions.

Plots.

Logistic regression

(Pearson) Correlation

A correlation is mostly the first step. You want to find out the strength of an association

Formula (for your reference)

Sample correlation coefficient. These are the formulae.

\(r = \frac{\Sigma(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\Sigma(x_i - \bar{x})^2\Sigma(y_i - \bar{y})^2}}\)

is equivalent to

\(r = \frac{cov (x,y)}{\sqrt{var(x)*var(y)}}\)

Correlation.

Covariance divided by the geometric mean of variances.

Read back your notes or check here

Assumptions.

levels of measurement: continuous. (Alternative: ordinal correlations)

normality (outliers)

homoscedasticity

linearity (Think back of Anscombe's quartet)

Population versus sample.

Note that if you have sampled the entire population, then there is no probability!

Particular pet peeve of mine.

Remember.

Correlation is not causation (but some work in computing science, can infer this in basic associations!)

Check this

Correlation.

setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture4")
require(datasets)
iris<-iris # not really necessary but typically good to store away
cor(iris[1],iris[2]) # Correlation.

##              Sepal.Width
## Sepal.Length  -0.1175698

cor(iris$Sepal.Length, iris$Sepal.Width)

## [1] -0.1175698

# also change methods Kendall / Spearman. 
cor(iris[1],iris[2], method='kendall')

##              Sepal.Width
## Sepal.Length -0.07699679

Significance test.

# custom confidence interval 90%
cor.test(iris$Sepal.Length, iris$Sepal.Width, conf.level = .90)

## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  -0.24846981  0.01754741
## sample estimates:
##        cor 
## -0.1175698

Sample write up

The Pearson correlation between sepal length and sepal width was not significant (r(148)= -.12, p=.15).

Make a pretty chart.

require(ggplot2)  # note geom_jitter
require(ggthemes)
require(RColorBrewer)
gg <- ggplot(data = iris, aes(Sepal.Width, Sepal.Length, color = Species)) + 
    geom_jitter()
gg <- gg + facet_grid(. ~ Species) + geom_smooth(method = "lm")
gg <- gg + scale_colour_brewer(palette = "Dark2")
gg <- gg + labs(x = "Sepal Width", y = "Sepal Length", title = "Sepals") + 
    theme_tufte(12)
gg

Correlation matrix

require(apaTables)
require(dplyr)
cor_table<-dplyr::select(iris,-Species)
apa.cor.table(cor_table, filename = "correlations-iris.doc")

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable        M    SD   1           2            3         
##   1. Sepal.Length 5.84 0.83                                    
##                                                                
##   2. Sepal.Width  3.06 0.44 -.12                               
##                             [-.27, .04]                        
##                                                                
##   3. Petal.Length 3.76 1.77 .87**       -.43**                 
##                             [.83, .91]  [-.55, -.29]           
##                                                                
##   4. Petal.Width  1.20 0.76 .82**       -.37**       .96**     
##                             [.76, .86]  [-.50, -.22] [.95, .97]
##                                                                
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##

ggcorrplot.

# ggcorrplot, you can
# then further tweak
# this, as it is a
# ggplot.
require(ggcorrplot)
cormatrix <- cor(cor_table)
ggcorrplot(cormatrix, 
    hc.order = TRUE, 
    type = "lower", lab = TRUE)

Other packages exist.

For further info. see the tutorial referenced at the end by Willems.

??corrplot

New data: Prestige dataset

This is based on Fox (2015) and Fox (2002)

library(carData)
prestige<-carData::Prestige

What is it about?

education: The average number of years of education for occupational incumbents.

income: The average income of occupational incumbents, in dollars.

women: The percentage of women in the occupation.

prestige:The average prestige rating for the occupation.

census: The code of the occupation used in the survey.

type: Professional and managerial(prof), white collar(wc), blue collar(bc), or missing(NA)

Try it yourself.

Load the Prestige dataset

Make a correlation matrix with all the continuous variables. (use filter from dplyr to remove strings)

And: either export a .docx or make a correlation plot.

Partial correlations.

Partial correlations, when you want to control for a continuous variable (z)

require(ppcor)
pcor.test(x = prestige$education, y= prestige$prestige , z=prestige$income)

##    estimate      p.value statistic   n gp  Method
## 1 0.7660533 1.032202e-20  11.85813 102  1 pearson

Linear regression.

When would you use linear regression?

Run a basic one.

model_education<-lm(prestige~education, data=prestige)
summary(model_education)

## 
## Call:
## lm(formula = prestige ~ education, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0397  -6.5228   0.6611   6.7430  18.1636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -10.732      3.677  -2.919  0.00434 ** 
## education      5.361      0.332  16.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 100 degrees of freedom
## Multiple R-squared:  0.7228, Adjusted R-squared:   0.72 
## F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16

Assumptions.

Linearity: Think back to the datasaurus! (This also relates to multicollinearity, measurement error).

Normality: All errors normally distributed around 0.

Homoscedasticity: The variance of the errors should be constant for the (combination) (as with ANOVA)

Independence: Errors should be independent.

More here.

How to check?

Linearity: Scatterplot between X & Y

Normality: Plot of error terms

Homoscedasticity: Many options (Breusch-Pagan test). Visual check: Is there a larger spread of measurements around the regression line at one side of the scatterplot than at the other

Independence: Durbin-Watson test / Check for autocorrelation (Ljung-Box test)

Solutions: linearity.

Good idea to plot before you start.

require(ggplot2)
require(scales)
require(ggthemes)
plot_edu <- ggplot(prestige, aes(education, prestige)) + geom_jitter()
plot_edu <- plot_edu + scale_y_continuous(limits = c(0, 100), breaks = pretty_breaks(6)) + 
    scale_x_continuous(limits = c(6, 18), breaks = pretty_breaks(6)) + ylab("Prestige") + 
    xlab("Education") + geom_smooth(method = "loess") + theme_tufte(12)

Plot: By and large linear?

plot_edu

plot()

4 diagnostic models (but still have to check independence of error). More here. Here is how you do it in ggplot2!

plot(model_education, which = 1)

Q-Q plot.

plot(model_education, which = 2)

Scale location plot.

plot(model_education, which = 3)

Outliers

plot(model_education, which = 4)

Can't we just use scatter plots?

Perhaps for basic models but it will quickly become complex.

Some of these plots give you information which you could not get from just scatter plots.

Homoscedasticity: Breusch-Pagan test.

require(lmtest)
bptest(model_education)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_education
## BP = 0.70577, df = 1, p-value = 0.4009

Durbin-Watson test (lmtest).

dwtest(model_education) #Durbin-Watson test.

## 
##  Durbin-Watson test
## 
## data:  model_education
## DW = 1.4392, p-value = 0.001598
## alternative hypothesis: true autocorrelation is greater than 0

Box test.

Also note that there is one other alternative typically used in econometrics (Breusch-Godfroy). Check here.

Box.test(residuals(model_education), type="Ljung-Box") #Box test

## 
##  Box-Ljung test
## 
## data:  residuals(model_education)
## X-squared = 7.9574, df = 1, p-value = 0.004789

Write up.

Visual inspection suggested that a linear fit was appropriate. Visual inspection also suggested that the errors from the OLS regression model were approximately normally distributed. A Breusch-Pagan test showed that the assumption of homoscedasticity could be upheld (\(\chi^2\)(1)= 0.71, p=.40). However, a Durbin-Watson test indicated that the assumption of independence of errors was violated (Durbin-Watson test= 1.44, p=.002).

Problems… .

Linearity: Transform data (e.g., log-transform) / fit curve

Normality: Transform data / fit curve

Heteroscedasticity: sandwich package can calculate errors adjusting for this.

Autocorrelation: sandwich package can calculate (e.g., Newey-West errors), alternatively look into panel model packages (??plm)

Hierarchical regression.

Usually we run more than one! And we build a table. (Normally you would check the assumptions of every model!)

lm_education_women<-lm(prestige~education+women, data=prestige)
summary(lm_education_women)

## 
## Call:
## lm(formula = prestige ~ education + women, data = prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.010  -4.069   1.050   5.027  18.942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.75416    3.54213  -2.471 0.015164 *  
## education    5.42780    0.31612  17.170  < 2e-16 ***
## women       -0.09305    0.02719  -3.422 0.000904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.652 on 99 degrees of freedom
## Multiple R-squared:  0.7521, Adjusted R-squared:  0.7471 
## F-statistic: 150.2 on 2 and 99 DF,  p-value: < 2.2e-16

Stargazer!

require(stargazer)
stargazer(model_education, lm_education_women, dep.var.labels = c("Prestige"), 
    covariate.labels = c("Education (years)", "Women (%)"), style = "demography", 
    out = "hierarchical.htm", header = F)

Sample write up.

Table X describes a hierarchical regression on prestige. The first model showed a statistically significant, positive, effect of years of education on prestige. In the second model, the effect of education was upheld, while controlling for the % of women. The effect of % women was also statistically significant.

Interpreting B coefficients.

Centering (week 2). We can use our model to make predictions… .

require(dplyr)
# use :: just in
# case of
# conflicts.
prestige <- dplyr::mutate(prestige, women_cent = women - mean(women), 
    education_cent = education - mean(education), prestige_cent = prestige - 
        mean(prestige))
centered_model <- lm(prestige_cent ~ education_cent + women_cent, 
    data = prestige)

Centered model.

2 years of extra education –> increase of 10.86 in prestige scores (all else being equal!)

a 10% shift in the number of women –> decrease of around 1 point (0.93) in terms of prestige.

summary(centered_model)

## 
## Call:
## lm(formula = prestige_cent ~ education_cent + women_cent, data = prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.010  -4.069   1.050   5.027  18.942 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.865e-15  8.566e-01   0.000 1.000000    
## education_cent  5.428e+00  3.161e-01  17.170  < 2e-16 ***
## women_cent     -9.305e-02  2.719e-02  -3.422 0.000904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.652 on 99 degrees of freedom
## Multiple R-squared:  0.7521, Adjusted R-squared:  0.7471 
## F-statistic: 150.2 on 2 and 99 DF,  p-value: < 2.2e-16

2*5.428

## [1] 10.856

#alternatively
2*coefficients(centered_model)[2]

## education_cent 
##       10.85559

# Shift of 10%
10*coefficients(centered_model)[3]

## women_cent 
## -0.9304622

from B to Beta…

Remember scale() - \(\beta\)'s are nothing more than standardized B's! (but beware scale() takes matrices as input - look at z-score functions).

Interpreted in shifts of standard deviations. A \(\beta\) of 1 implies that a shift of 1SD in X leads to a shift of 1SD in Y.

# -1 so we can drop the intercept -- standardized regression has no intercept.
model_education_stand<-lm(scale(prestige)~scale(education)-1, data=prestige)
summary(model_education_stand)

## 
## Call:
## lm(formula = scale(prestige) ~ scale(education) - 1, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51354 -0.37914  0.03843  0.39193  1.05575 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## scale(education)  0.85018    0.05239   16.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5265 on 101 degrees of freedom
## Multiple R-squared:  0.7228, Adjusted R-squared:  0.7201 
## F-statistic: 263.4 on 1 and 101 DF,  p-value: < 2.2e-16

Alternative. QuantPsyc package.

Note be wary with interaction terms! (center them beforehand - values might not be correct from QuantPsyc). Alternative function: lm.beta.

require(QuantPsyc)
lm.beta(model_education)

## education 
## 0.8501769

Bootstrapping a B coefficient.

Again note that you can do this for any statistic.

# Bootstrap 95% CI for B coefficient.
library(boot)
# function which does bootstrapping.
coeff_B <- function(data, indices) {
data_boot <- data[indices, ] # allows boot to select sample
B_boot <- lm(prestige_cent ~ education_cent + women_cent, data = data_boot)
return(B_boot$coeff[2]) # education in position 2 / alter for different variables.
}
# bootstrapping with 10000 replications
set.seed(666)
results_coeff_B <- boot(data = prestige, statistic = coeff_B, R = 10000)

view results

# view results
results_coeff_B

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = prestige, statistic = coeff_B, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 5.427796 0.003518518    0.288098

95% BCa (bias-corrected accelerated) confidence interval.

# get 95% confidence interval 
boot.ci(results_coeff_B, type="bca")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results_coeff_B, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 4.854,  5.972 )  
## Calculations and Intervals on Original Scale

The 95% BCA confidence interval (with 10,000 bootstraps) for the (centered) coefficient of education is [4.85, 5.97].

Logistic Regression.

Sometimes we have a dichotomous outcome (for example: Yes/No, Alive/Dead)

We cannot use OLS regression as it violates the assumptions of normality.

Clever trick…

We can take proportions and turn them into Odds Ratios and then turn those into LogOdds which then conform to linearity.

Read the references for full explanation and underpinning.

Assumptions logistic regression.

Some typical assumptions for OLS/MANOVA do not apply (homoscedasticity/normal distribution error terms/…) but:

Correct coding of dependent variable (correctly measured).

All relevant variables included. No over/underfitting.

Independent observations (also relates to multicollinearity - independent variables should not be strongly related to one another).

Linearity between log-odds and independent variable.

Sample size, required for estimation (according to some n=10 per IV ok, others n>30 per IV).

Assumption checks?

Some argue that logistic regression does not require assumption checks, and thus we would not check these.

Long story but Maximum likelihood estimation has better properties than OLS in terms of estimation

Instead, some argue that at the design stage think about sample size / distribution, et cetera before running the model.

Again, you can always bootstrap, which might make some of your conclusions more robust!

Pima Indian data.

9 variables and 768 cases, women aged 21 or older.

require(mlbench) # contains the data
require(skimr)
# Pima data first needs data() command.
data(PimaIndiansDiabetes2)
pima_data<-PimaIndiansDiabetes2
skim(pima_data)

## Skim summary statistics
##  n obs: 768 
##  n variables: 9 
## 
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n n_unique                top_counts ordered
##  diabetes       0      768 768        2 neg: 500, pos: 268, NA: 0   FALSE
## 
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n   mean     sd     p0   p25    p50    p75
##       age       0      768 768  33.24  11.76 21     24     29     41   
##   glucose       5      763 768 121.69  30.54 44     99    117    141   
##   insulin     374      394 768 155.55 118.78 14     76.25 125    190   
##      mass      11      757 768  32.46   6.92 18.2   27.5   32.3   36.6 
##  pedigree       0      768 768   0.47   0.33  0.078  0.24   0.37   0.63
##  pregnant       0      768 768   3.85   3.37  0      1      3      6   
##  pressure      35      733 768  72.41  12.38 24     64     72     80   
##   triceps     227      541 768  29.15  10.48  7     22     29     36   
##    p100     hist
##   81    ▇▃▂▂▁▁▁▁
##  199    ▁▂▇▇▆▃▂▂
##  846    ▇▆▂▁▁▁▁▁
##   67.1  ▃▇▇▅▂▁▁▁
##    2.42 ▇▅▂▁▁▁▁▁
##   17    ▇▃▂▂▁▁▁▁
##  122    ▁▁▃▇▇▂▁▁
##   99    ▃▇▇▂▁▁▁▁

Logistic regression.

log1<-glm(diabetes~mass, data=pima_data, family=binomial)
log2<-glm(diabetes~mass+age, data=pima_data, family=binomial)
summary(log1)

## 
## Call:
## glm(formula = diabetes ~ mass, family = binomial, data = pima_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0094  -0.9184  -0.6598   1.2254   1.9107  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.99682    0.42885   -9.32  < 2e-16 ***
## mass         0.10250    0.01261    8.13 4.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.53  on 756  degrees of freedom
## Residual deviance: 904.89  on 755  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 908.89
## 
## Number of Fisher Scoring iterations: 4

summary of log2

summary(log2)

## 
## Call:
## glm(formula = diabetes ~ mass + age, family = binomial, data = pima_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0306  -0.8907  -0.5692   1.1011   2.1555  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.740908   0.537331 -10.684  < 2e-16 ***
## mass         0.107810   0.013071   8.248  < 2e-16 ***
## age          0.045746   0.007017   6.519 7.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 981.53  on 756  degrees of freedom
## Residual deviance: 860.61  on 754  degrees of freedom
##   (11 observations deleted due to missingness)
## AIC: 866.61
## 
## Number of Fisher Scoring iterations: 4

Likelihood ratio tests.

summary() gives you parameter estimates and associated tests (Wald tests).

Likelihood ratio tests tend to perform better with smaller samples.

You should be concerned if the conclusions differ (likely suggests non-linearity).

Likelihood ratio tests.

anova(log1, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: diabetes
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   756     981.53              
## mass  1   76.637       755     904.89 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likelihood ratio tests, log2

anova(log2, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: diabetes
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   756     981.53              
## mass  1   76.637       755     904.89 < 2.2e-16 ***
## age   1   44.279       754     860.61 2.847e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Odds ratios.

The coefficients are not really interpretable given that they are a linear relationship between the independent variable and log-odds.

We can take the exponential to interpret them!

Odds ratios

incr= custom increments (not useful/necessary if you have factors). Note that you might want to center your data beforehand! (ease of interpretation).

require(oddsratio)

## Loading required package: oddsratio

# shift of 5 in BMI.
or_glm(pima_data, incr= list(mass= 5), log1)

##   predictor oddsratio CI_low (2.5 %) CI_high (97.5 %) increment
## 1      mass     1.669          1.479            1.894         5

or_glm(pima_data, incr= list(mass= 5, age=1), log2)

##   predictor oddsratio CI_low (2.5 %) CI_high (97.5 %) increment
## 1      mass     1.714          1.512            1.955         5
## 2       age     1.047          1.033            1.061         1

Alternative way.

require(questionr)
odds.ratio(log1)

##                    OR     2.5 % 97.5 %         p    
## (Intercept) 0.0183740 0.0077693 0.0418 < 2.2e-16 ***
## mass        1.1079369 1.0814358 1.1363  4.31e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

odds.ratio(log2)

##                    OR     2.5 % 97.5 %         p    
## (Intercept) 0.0032119 0.0010872 0.0090 < 2.2e-16 ***
## mass        1.1138365 1.0862633 1.1434 < 2.2e-16 ***
## age         1.0468089 1.0326489 1.0615 7.072e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What does that mean?

In model 2: The odds of having diabetes versus not having diabetes, is 1.11 times (95%CI [1.09,1.14]) higher when BMI increases with one unit (while controlling for age).

You can then use this to calculate specific odds.

Where is my R squared?

Complex - you can get Pseudo \(R^2\) but perhaps you should not.

require(rcompanion)
nagelkerke(log1, restrictNobs=T)

## $Models
##                                                   
## Model: "glm, diabetes ~ mass, binomial, pima_data"
## Null:  "glm, diabetes ~ 1, binomial, fit$model"   
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.0780790
## Cox and Snell (ML)                  0.0962815
## Nagelkerke (Cragg and Uhler)        0.1325200
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -1     -38.318 76.637 2.0549e-18
## 
## $Number.of.observations
##           
## Model: 757
## Null:  757
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

nagelkerke(log2, restrictNobs=T)

## $Models
##                                                         
## Model: "glm, diabetes ~ mass + age, binomial, pima_data"
## Null:  "glm, diabetes ~ 1, binomial, fit$model"         
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                             0.123192
## Cox and Snell (ML)                   0.147626
## Nagelkerke (Cragg and Uhler)         0.203191
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -2     -60.458 120.92 5.5392e-27
## 
## $Number.of.observations
##           
## Model: 757
## Null:  757
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

Stargazer!

We can get a nice table with Odds Ratios and summaries. Or look into broom() and tidy() (see last week).

require(stargazer)
stargazer(log1, log2, covariate.labels = c("BMI", "Age"), omit = c("Constant"), 
    apply.coef = exp, t.auto = F, p.auto = F, report = "vc*", style = "asr", 
    out = "logistic.htm", header = FALSE)

Sample write up.

Table X shows the key results for a hierarchical logistic regression analysis. In Model 1, there was a statistically significant effect of BMI on having diabetes, (\(\chi^2(1)\)=76.64, p<.0001). Moreover, in Model 2 there was a significant effect of age (\(\chi^2(1)\)=44.28, p<.0001), after accounting for BMI (\(\chi^2(1)\)=76.64, p<.0001).

The \(\chi^2\) values correspond to the likelihood ratio test (the Deviance value) for the variables (you also can calculate for the model - check the rcompanion output).

You might want to add a Pseudo-\(R^2\).

You would then describe the odds ratios or illustrate the effect.

vcd package

Continuous –> You can plot the odds ratios. Look at how to do that in ggplot2

How to plot our results for categorical variables? Mosaic plots! Also look at alternatives

Look into ggmosaic() if you want it in ggplot2.

How to do it?

First make some categorical variables.

require(dplyr)  # make some categorical variables
pima_data <- pima_data %>% mutate(obese = cut(mass, breaks = c(-Inf, 30, Inf), 
    labels = c("No", "Yes")))
pima_data <- pima_data %>% mutate(nulligravada = cut(pregnant, breaks = c(-Inf, 
    0.99, Inf), labels = c("never pregnant", "ever pregnant")))

Vcd

require(RColorBrewer)
# pick some colours.
colours <- c(color = brewer.pal(2, "Set3"))
fill_colours <- rep(colours, each = 4)  # 4 quadrants to fill
require(vcd)
mosaicplot1 <- mosaic(~diabetes + nulligravada + obese, zero_size = 0, data = pima_data, 
    gp = gpar(fill = fill_colours, col = 0), legend = F, labeling = labeling_values)

Exercise.

Load last week's Salaries dataset. If necessary, calculate monthly_salary again.
calculate a Spearman and Pearson correlation between years since Ph.D. and monthly salary. Conduct a significance test for one of them. Pick one and discuss the outcome
Build a hierarchical regression model, in step 1 use monthly salary as dependent and year since Ph.D. as independent, in step 2 add gender (keep all variables from step 1), in step 3 add rank (keep all variables from step 2). Export the hierarchical regression table.
Test the assumptions for the final model, how would you combat any issues that arise in terms of violations of these assumptions?

Exercise. (cont'd)

Calculate the standardized coefficients for model 1 (\(\beta\)). Interpret this coefficient. Bootstrap this coefficient and report as you would do in a paper.
What does the final model predict in terms of monthly salary shift, when there is a shift of being 3 years post-Ph.D.
Install and load the 'caret' package. load the GermanCredit dataset. ??GermanCredit to find out more also here. It contains data on credit scores for a 1,000 cases and has 62 variables. As done before, you might want to explore the descriptive statistics.
Make a corrplot() for up to 10 continuous variables in the GermanCredit dataset. Export a correlation matrix (up to 10 variables).

Exercise (cont'd)

Conduct a hierarchical logistic regression on 'Class' as dependent, in each step keeping the previous variables. In step 1 add Amount. In step 2 add CreditHistory.PaidDuly. In step 3 add Age. In step 4 add Telephone. Export a table with your key results.
Choose some categorical variables and make a (beautiful) mosaic plot with credit class as dependent.
Calculate all odds ratios for model 4. Calculate the odds ratio for a shift in 1,000 DM (Deutsche Mark - pre-euro era) in credit status for model 4.
BONUS: Make a mosaic plot via ggmosaic() with credit class as dependent.
BONUS: Have a look at ROC curves and plot one for the credit data.

References (and further reading.)

Also check the reading list! (many more than listed here)

Crawley, M. J. (2013). The R book: Second edition. New York, NY: John Wiley & Sons.
Fox, J. (2015). Applied regression analysis and generalized linear models. London, UK: Sage.
Torres-Reyna, O. (2010). Regression 101.
Wickham, H., & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly.
Willems, K. (2017). R correlation tutorial
Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A., & Smith, G.M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer.