This is a worksheet for use with Lecture 4.
You have a videos of me narrating these slides. Note that there are potentially minor discrepancies between the current set of slides and the one in the video. The slide numbers refer to the current set. I do not cover every single slide but you can code along!
If you answer correctly the colour of the box will change! (Don't worry about bonus questions, they are very much just that: a bonus!)
Which range of values can Pearson r take?
to
(No decimals)
Have a look at this tutorial on correlation. This tutorial holds the clue. Note that convention states you start with the smallest number.
Note that some of these are not an assumption about Pearson r per se but rather an assumption on when we make statistical inferences (using a (normally distributed) confidence interval or t-test).
What is the Spearman correlation between sepal width and sepal length (round to 3 decimals)?
My answer:
The graph is not fully fitted on the slide. You can run it in your R script.
Suppose I wanted the Wall Street Journal theme in font 14 rather than the tufte theme in font 12, what would I substitute theme_tufte(12) with?
My answer:
Note that in the previous version it stated to use 'filter' but that should have been 'select' (we are filtering out variables but filter is not the command)
Remember you can check the type of variables in your dataframe in the environment, view the dataframe, look at the description in the carData package or by running skim() from skimr first. This might help you identify which variables are continuous.
you might have to use dplyr::select rather than select if another package also use this.
What distribution does the 'statistic' in the output have?
My answer:
Find the clue here, have a look at the equations 2.8 - 2.9
What distribution would the 'statistic' in the output have if I had requested a Kendall correlation coefficient?
My answer:
Find the clue here, have a look at the equations 2.10-2.11
The following would be a scenario where I would use linear regression (True or False)
Testing the prediction that a respondent's self-reported hunger rating (a rating from 0 to 100) predicts the choice of apple or a candy bar (dependent variable)
Look ahead to slide 24, are the data likely to be normal when we have an outcome variable which can only take two values?
Note that it has printed a message: geom_smooth()
using
formula 'y ~ x'.
What is the unit for the variable on the X-axis?
My answer:
Some have suggested Cook's distance of 1 as a cut-off. However, another common guideline is that Cook's distance should be smaller than 4/n.
In our dataset that would be the following cutoff (round to three decimals): ... .
My answer:
Note how this relates to what is highlighted on the plot!
Note that this test is dealing with autocorrelation, we came across a Box test before but that was called BoxM, a test for the equality of covariances. Do you recall what technique this test was associated with?
My answer:
Centering rescales everything to a shift away from the ...
My answer:
Have a look at the formula -- what are we subtracting?
So the following is hidden but described verbally
2*5.428 = 10.86
alternatively: 2*coefficients(centered_model)[2]
A Shift of 10% (or simply move the decimal)
10*coefficients(centered_model)[3]
What do the square brackets in the above refer to?
My answer:
Have a close look at the centered_model object. What does it contain? What does the formula above refer to?
Is the following statement true or false?
\(\beta\)'s can be larger than 1 in a multiple regression model.
Note that I am talking here about there about more than one predictor (hence, multiple rather than univariate regression model.)
This is a major headache for many people but \(\beta\)'s can be larger than 1! You can read more here or here in the context of factor analysis
True or False:
A logistic function is said to be a sigmoid function.
So pretty technical but this is the f(x) = \(\frac{L}{1 + e^{-k(x-x_0)}}\)
Use skimr to find the answer!
What is the median for age (no decimals)?
What is the maximum glucose level reported (no decimals)?
How many women in the sample are reported to have diabetes?
If there is no statistically significant effect then the 95% confidence interval for the odds ratio will contain ... .
You can find the answer in here
Complete the exercise and submit via Blackboard!
Thanks to Lisa DeBruine for the webexercises package. Please see general disclaimer.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] webexercises_1.1.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.33 R6_2.5.1 fastmap_1.1.1 xfun_0.41
## [5] cachem_1.0.8 knitr_1.45 htmltools_0.5.7 rmarkdown_2.25
## [9] cli_3.6.1 sass_0.4.7 jquerylib_0.1.4 compiler_4.3.2
## [13] rstudioapi_0.15.0 tools_4.3.2 evaluate_0.23 bslib_0.5.1
## [17] yaml_2.3.7 jsonlite_1.8.7 rlang_1.1.3