Introduction.

This is a worksheet for use with Lecture 4.

You have a videos of me narrating these slides. Note that there are potentially minor discrepancies between the current set of slides and the one in the video. The slide numbers refer to the current set. I do not cover every single slide but you can code along!

If you answer correctly the colour of the box will change! (Don't worry about bonus questions, they are very much just that: a bonus!)

Slides

Correlation (Slide 7)

Which range of values can Pearson r take?

(No decimals)

Have a look at this tutorial on correlation. This tutorial holds the clue. Note that convention states you start with the smallest number.

Assumptions (Slide 8)

Note that some of these are not an assumption about Pearson r per se but rather an assumption on when we make statistical inferences (using a (normally distributed) confidence interval or t-test).

Correlation (Slide 11)

What is the Spearman correlation between sepal width and sepal length (round to 3 decimals)?

My answer:

Make a pretty graph (Slide 14)

The graph is not fully fitted on the slide. You can run it in your R script.

Suppose I wanted the Wall Street Journal theme in font 14 rather than the tufte theme in font 12, what would I substitute theme_tufte(12) with?

My answer:

Try it yourself (Slide 19)

Note that in the previous version it stated to use 'filter' but that should have been 'select' (we are filtering out variables but filter is not the command)

Remember you can check the type of variables in your dataframe in the environment, view the dataframe, look at the description in the carData package or by running skim() from skimr first. This might help you identify which variables are continuous.

you might have to use dplyr::select rather than select if another package also use this.

Partial correlations (Slide 21)

What distribution does the 'statistic' in the output have?

t-distribution
F-distribution
z- distribution (standard normal)
None of the above

My answer:

Find the clue here, have a look at the equations 2.8 - 2.9

Bonus

What distribution would the 'statistic' in the output have if I had requested a Kendall correlation coefficient?

t-distribution
F-distribution
z- distribution (standard normal)
None of the above

My answer:

Find the clue here, have a look at the equations 2.10-2.11

Linear regression (Slide 22).

The following would be a scenario where I would use linear regression (True or False)

Testing the prediction that a respondent's self-reported hunger rating (a rating from 0 to 100) predicts the choice of apple or a candy bar (dependent variable)

Look ahead to slide 24, are the data likely to be normal when we have an outcome variable which can only take two values?

Plot: By and large linear? (Slide 27)

Note that it has printed a message: geom_smooth() using formula 'y ~ x'.

What is the unit for the variable on the X-axis?

Educational score
Z-scores
years
None of the above

My answer:

Outliers (Slide 31)

Some have suggested Cook's distance of 1 as a cut-off. However, another common guideline is that Cook's distance should be smaller than 4/n.

Bonus

In our dataset that would be the following cutoff (round to three decimals): ... .

My answer:

Note how this relates to what is highlighted on the plot!

Box test (Slide 35)

Note that this test is dealing with autocorrelation, we came across a Box test before but that was called BoxM, a test for the equality of covariances. Do you recall what technique this test was associated with?

My answer:

Interpreting B coefficients (Slide 41)

Centering rescales everything to a shift away from the ...

My answer:

Have a look at the formula -- what are we subtracting?

Centered model (Slide 42)

So the following is hidden but described verbally

2*5.428 = 10.86

alternatively: 2*coefficients(centered_model)[2]

A Shift of 10% (or simply move the decimal)

10*coefficients(centered_model)[3]

What do the square brackets in the above refer to?

the position within the centered model
the position within the coefficients object from the centered model
the position within the significance object from the centered model
the position within the t value object from the centered model

My answer:

Have a close look at the centered_model object. What does it contain? What does the formula above refer to?

from B to Beta... (Slide 43)

Bonus

Is the following statement true or false?

\(\beta\)'s can be larger than 1 in a multiple regression model.

Note that I am talking here about there about more than one predictor (hence, multiple rather than univariate regression model.)

This is a major headache for many people but \(\beta\)'s can be larger than 1! You can read more here or here in the context of factor analysis

Logistic regression (Slide 48)

Bonus

True or False:

A logistic function is said to be a sigmoid function.

So pretty technical but this is the f(x) = \(\frac{L}{1 + e^{-k(x-x_0)}}\)

wikipedia is helpful

You can read more here and here

Pima Indian data (Slide 52)

Use skimr to find the answer!

What is the median for age (no decimals)?

What is the maximum glucose level reported (no decimals)?

How many women in the sample are reported to have diabetes?

What does that mean? (Slide 61)

If there is no statistically significant effect then the 95% confidence interval for the odds ratio will contain ... .

You can find the answer in here

Exercise (Slide 68)

Complete the exercise and submit via Blackboard!

Going further.

Have a look at resources made by Alex and Sarah.
Have a look at the resources listed at the end.

Session Info.

Thanks to Lisa DeBruine for the webexercises package. Please see general disclaimer.

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] webexercises_1.1.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] cachem_1.1.0      knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.28   
##  [9] lifecycle_1.0.4   cli_3.6.3         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.4.2    rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.1   
## [17] bslib_0.8.0       yaml_2.3.10       jsonlite_1.8.9    rlang_1.1.4

PY0794: Lecture 4 - Worksheet

Dr. Thomas Pollet, Northumbria University, UK

2025-01-06

Introduction.

Slides

Correlation (Slide 7)

Assumptions (Slide 8)

Correlation (Slide 11)

Make a pretty graph (Slide 14)

Try it yourself (Slide 19)

Partial correlations (Slide 21)

Bonus

Linear regression (Slide 22).

Plot: By and large linear? (Slide 27)

Outliers (Slide 31)

Bonus

Box test (Slide 35)

Interpreting B coefficients (Slide 41)

Centered model (Slide 42)

from B to Beta... (Slide 43)

Bonus

Logistic regression (Slide 48)

Bonus

Pima Indian data (Slide 52)

What does that mean? (Slide 61)

Exercise (Slide 68)

Going further.

Session Info.

The end...