Load last week’s Salaries dataset. If necessary, calculate monthly_salary again.
calculate a Spearman and Pearson correlation between years since Ph.D. and monthly salary. Conduct a significance test for one of them. Pick one and discuss the outcome
Build a hierarchical regression model, in step 1 use monthly salary as dependent and year since Ph.D. as independent, in step 2 add gender (keep all variables from step 1), in step 3 add rank (keep all variables from step 2). Export the hierarchical regression table.
Test the assumptions for the final model, how would you combat any issues that arise in terms of violations of these assumptions?
Calculate the standardized coefficients for model 1 (\(\beta\)). Interpret this coefficient. Bootstrap this coefficient and report as you would do in a paper.
What does the final model predict in terms of monthly salary shift, when there is a shift of being 3 years post-Ph.D.
Install and load the ‘caret’ package. load the GermanCredit dataset. ??GermanCredit to find out more also here. It contains data on credit scores for a 1,000 cases and has 62 variables. As done before, you might want to explore the descriptive statistics.
Make a corrplot() for up to 10 continuous variables in the GermanCredit dataset. Export a correlation matrix (up to 10 variables).
Conduct a hierarchical logistic regression on ‘Class’ as dependent, in each step keeping the previous variables. In step 1 add Amount. In step 2 add CreditHistory.PaidDuly. In step 3 add Age. In step 4 add Telephone. Export a table with your key results.
Choose some categorical variables and make a (beautiful) mosaic plot with credit class as dependent.
Calculate all odds ratios for model 4. Calculate the odds ratio for a shift in 1,000 DM (Deutsche Mark - pre-euro era) in credit status for model 4.
BONUS: Make a mosaic plot via ggmosaic() with credit class as dependent.
BONUS: Have a look at ROC curves and plot one for the credit data.
setwd("~/Dropbox/Teaching_MRes_Northumbria/Lecture4")
require(carData)
## Loading required package: carData
salaries<-carData::Salaries
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
salaries<- salaries %>% mutate(monthly_salary=salary/9)
require(skimr)
## Loading required package: skimr
skim(salaries)
Name | salaries |
Number of rows | 397 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
rank | 0 | 1 | FALSE | 3 | Pro: 266, Ass: 67, Ass: 64 |
discipline | 0 | 1 | FALSE | 2 | B: 216, A: 181 |
sex | 0 | 1 | FALSE | 2 | Mal: 358, Fem: 39 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yrs.since.phd | 0 | 1 | 22.31 | 12.89 | 1.00 | 12.00 | 21.00 | 32.00 | 56.00 | ▇▇▆▅▁ |
yrs.service | 0 | 1 | 17.61 | 13.01 | 0.00 | 7.00 | 16.00 | 27.00 | 60.00 | ▇▅▃▂▁ |
salary | 0 | 1 | 113706.46 | 30289.04 | 57800.00 | 91000.00 | 107300.00 | 134185.00 | 231545.00 | ▅▇▅▂▁ |
monthly_salary | 0 | 1 | 12634.05 | 3365.45 | 6422.22 | 10111.11 | 11922.22 | 14909.44 | 25727.22 | ▅▇▅▂▁ |
cor(salaries$yrs.since.phd, salaries$monthly_salary)
## [1] 0.4192311
cor(salaries$yrs.since.phd, salaries$monthly_salary, method='spearman')
## [1] 0.4755676
Here I have also calculated the 99% confidence interval.
cor.test(salaries$yrs.since.phd, salaries$monthly_salary, conf.level = .99)
##
## Pearson's product-moment correlation
##
## data: salaries$yrs.since.phd and salaries$monthly_salary
## t = 9.1775, df = 395, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.3067831 0.5201368
## sample estimates:
## cor
## 0.4192311
cor.test(salaries$yrs.since.phd, salaries$monthly_salary, method = 'spearman', conf.level = .99)
## Warning in cor.test.default(salaries$yrs.since.phd, salaries$monthly_salary, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: salaries$yrs.since.phd and salaries$monthly_salary
## S = 5468989, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4755676
Sample write up: Years since Ph.D. and monthly salary were positively and moderately related (\(r_s\)= .476, p<.0001).
Build the models.
# rename sex to gender
colnames(salaries)[5] <-"gender"
# Linear models.
lm_1<-lm(monthly_salary ~ yrs.since.phd, data= salaries)
lm_2<-lm(monthly_salary ~ yrs.since.phd + gender, data= salaries)
lm_3<-lm(monthly_salary ~ yrs.since.phd + gender + rank, data= salaries)
summary(lm_1)
##
## Call:
## lm(formula = monthly_salary ~ yrs.since.phd, data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9352.3 -2159.1 -317.6 1787.4 11375.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10190.97 307.31 33.162 <2e-16 ***
## yrs.since.phd 109.48 11.93 9.177 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3059 on 395 degrees of freedom
## Multiple R-squared: 0.1758, Adjusted R-squared: 0.1737
## F-statistic: 84.23 on 1 and 395 DF, p-value: < 2.2e-16
summary(lm_2)
##
## Call:
## lm(formula = monthly_salary ~ yrs.since.phd + gender, data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9351.9 -2192.8 -283.5 1714.2 11336.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9464.65 527.59 17.939 <2e-16 ***
## yrs.since.phd 106.45 12.04 8.845 <2e-16 ***
## genderMale 880.40 520.45 1.692 0.0915 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3052 on 394 degrees of freedom
## Multiple R-squared: 0.1817, Adjusted R-squared: 0.1775
## F-statistic: 43.74 on 2 and 394 DF, p-value: < 2.2e-16
summary(lm_3)
##
## Call:
## lm(formula = monthly_salary ~ yrs.since.phd + gender + rank,
## data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7470 -1704 -170 1351 11702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8549.39 495.08 17.269 < 2e-16 ***
## yrs.since.phd -10.23 14.41 -0.710 0.47801
## genderMale 571.85 448.76 1.274 0.20332
## rankAssocProf 1556.98 482.53 3.227 0.00136 **
## rankProf 5292.88 490.30 10.795 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2626 on 392 degrees of freedom
## Multiple R-squared: 0.3973, Adjusted R-squared: 0.3912
## F-statistic: 64.61 on 4 and 392 DF, p-value: < 2.2e-16
Note that in the final model years since Ph.D. is no longer statistically significant. It is important to note that we have not yet looked into the assumptions or made graphs.
I chose American Sociological review.
require(stargazer)
## Loading required package: stargazer
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(lm_1,lm_2,lm_3, dep.var.labels=c("Monthly Salary (USD)"),covariate.labels=c("Years since Ph.D.","Gender: Male", "Rank: Associate Prof.", "Rank: Prof."), style="asr", out="hierarchical_exercise4_stargazer.htm", header=F)
##
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lccc}
## \\[-1.8ex]\hline \\[-1.8ex]
## \\[-1.8ex] & \multicolumn{3}{c}{Monthly Salary (USD)} \\
## \\[-1.8ex] & (1) & (2) & (3)\\
## \hline \\[-1.8ex]
## Years since Ph.D. & 109.482$^{***}$ & 106.453$^{***}$ & $-$10.234 \\
## Gender: Male & & 880.403 & 571.845 \\
## Rank: Associate Prof. & & & 1,556.978$^{**}$ \\
## Rank: Prof. & & & 5,292.878$^{***}$ \\
## Constant & 10,190.970$^{***}$ & 9,464.646$^{***}$ & 8,549.387$^{***}$ \\
## \textit{N} & 397 & 397 & 397 \\
## R$^{2}$ & 0.176 & 0.182 & 0.397 \\
## Adjusted R$^{2}$ & 0.174 & 0.178 & 0.391 \\
## Residual Std. Error & 3,059.287 (df = 395) & 3,052.104 (df = 394) & 2,625.931 (df = 392) \\
## F Statistic & 84.226$^{***}$ (df = 1; 395) & 43.742$^{***}$ (df = 2; 394) & 64.613$^{***}$ (df = 4; 392) \\
## \hline \\[-1.8ex]
## \multicolumn{4}{l}{$^{*}$p $<$ .05; $^{**}$p $<$ .01; $^{***}$p $<$ .001} \\
## \end{tabular}
## \end{table}
You can also rely on APA Tables. But I found it buggy when combining multiple models (issues with rbind()). (There is also apaStyle but that does not play nice with RJava on Mac). Here is an example.
require(apaTables)
## Loading required package: apaTables
apa.reg.table(lm_1,filename = 'APA_exercise4.doc',table.number = 1)
##
##
## Table 1
##
## Regression results using monthly_salary as the criterion
##
##
## Predictor b b_95%_CI beta beta_95%_CI sr2 sr2_95%_CI
## (Intercept) 10190.97** [9586.80, 10795.13]
## yrs.since.phd 109.48** [86.03, 132.94] 0.42 [0.33, 0.51] .18 [.11, .24]
##
##
##
## r Fit
##
## .42**
## R2 = .176**
## 95% CI[.11,.24]
##
##
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights.
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
##
Note that here we only checked the assumptions for the final model.
plot(lm_3, which = 1)
Plot 1: Not curvilinear but also not very nicely distributed. Not wholly linear as we would want it. The data shows clumping, likely as there are pay grades. There are no data between 11,000-13,000 USD and we linearly interpolate. That could be wrong - as in reality people move up in steps, something our model does not account for. While one would loose information, a solution could be to explore ordinal methods (incl. tobit regression). These more advanced methods can handle censoring issues. There is a trade-off between keeping things ‘simple’(- in reality most relationships are not wholly linear) and easily interpretable versus correctly describing the data (but also running the risk of overfitting). Within its remits we can still use our model, however - bearing in mind that it is a linear representation of the relationships between variables.
plot(lm_3, which=2)
Plot 2: Generally OK but at the top end our model ‘derails’. Also one extreme case (44), we could consider running the analysis with or without that case. Another potential solution is to log-transform our data. What happens then?
lm_3_log<-lm(log10(monthly_salary) ~ yrs.since.phd + gender + rank, data= salaries)
plot(lm_3_log, which=2)
This looks much nicer, does it not! But we should bear in mind that a shift in X now represents a shift in log10(Y) which might make it harder to directly interpret the coefficients (but less of a worry with distribution of the residuals now.)
plot(lm_3, which=3)
Plot 3: Ideally, we would want a straight line here. That is not the case. Again notice the issues with the gap in X. The ‘funnel’ is widening when X is increasing. So, this could point to issues with heteroscedasticity. The plot is slightly better for our log-transformed data. A solution is to work with Heteroscedasticity consistent errors.
plot(lm_3_log, which=3)
A Breusch-Pagan test shows that in both cases, there would be issues with the assumption of homoscedasticity. (Also it suggests that the problem is actually the same, if not stronger, in the ‘log’ model).
require(lmtest)
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(lm_3)
##
## studentized Breusch-Pagan test
##
## data: lm_3
## BP = 59.457, df = 4, p-value = 3.772e-12
bptest(lm_3_log)
##
## studentized Breusch-Pagan test
##
## data: lm_3_log
## BP = 62.284, df = 4, p-value = 9.599e-13
Cook’s distance gives an indication of influential cases in the regression model. Some put the cut-off at 4/(N - k -1) – (check here). But best to just check the plots.
plot(lm_3, which=4)
Plot 4: The plot is largely fine, there are some influential cases (351, 283), perhaps one would ignore 331 as it is close to other peaks. One could run the analyses with and without those cases. Alternatively, one could bootstrap the relevant statistics, in case of worry about robustness. (Or rely on ‘robust’ techniques - an example below).
Incidentally in our transformed model, 283 (and perhaps 351) are again potential ‘monkey wrenches’ you could consider again to bootstrap your models, or present models with and without those ‘outliers’.
plot(lm_3_log, which=4)
In this case both models exhibit autocorrelation.
require(lmtest)
dwtest(lm_3)
##
## Durbin-Watson test
##
## data: lm_3
## DW = 1.7452, p-value = 0.005363
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(lm_3_log)
##
## Durbin-Watson test
##
## data: lm_3_log
## DW = 1.6538, p-value = 0.0002655
## alternative hypothesis: true autocorrelation is greater than 0
Not that I expect you to run this but here is an example of a model which correctly deals with this issue. Notice the difference it makes for the Gender coefficient’s SE and significance level. Whereas before we might have ruled out the existence of an effect, this model suggests a trend (still not strong evidence though).
require(sandwich)
## Loading required package: sandwich
summary(lm_3)
##
## Call:
## lm(formula = monthly_salary ~ yrs.since.phd + gender + rank,
## data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7470 -1704 -170 1351 11702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8549.39 495.08 17.269 < 2e-16 ***
## yrs.since.phd -10.23 14.41 -0.710 0.47801
## genderMale 571.85 448.76 1.274 0.20332
## rankAssocProf 1556.98 482.53 3.227 0.00136 **
## rankProf 5292.88 490.30 10.795 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2626 on 392 degrees of freedom
## Multiple R-squared: 0.3973, Adjusted R-squared: 0.3912
## F-statistic: 64.61 on 4 and 392 DF, p-value: < 2.2e-16
vcv <- vcovHAC(lm_3) # variance covariance matrix (which we might want to store)
coeftest(lm_3, vcov = vcovHAC(lm_3))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8549.387 302.314 28.2798 < 2.2e-16 ***
## yrs.since.phd -10.234 17.789 -0.5753 0.56543
## genderMale 571.845 321.820 1.7769 0.07636 .
## rankAssocProf 1556.978 270.749 5.7506 1.791e-08 ***
## rankProf 5292.878 404.665 13.0796 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we calculate the \(\beta\)’s for model 1. I used dplyr to z-score (and the mosaic package to get an alternative method for z-scoring as scale() can run into problems).
require(dplyr)
require(mosaic)
## Loading required package: mosaic
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following object is masked from 'package:skimr':
##
## n_missing
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
salaries <- salaries %>% mutate(monthly_salary_stand=zscore(monthly_salary), yrs.since.phd_stand=zscore(yrs.since.phd))
lm_stand1<- lm(monthly_salary_stand ~ yrs.since.phd_stand-1, data= salaries)
summary(lm_stand1)
##
## Call:
## lm(formula = monthly_salary_stand ~ yrs.since.phd_stand - 1,
## data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7789 -0.6415 -0.0944 0.5311 3.3802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## yrs.since.phd_stand 0.41923 0.04562 9.189 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9079 on 396 degrees of freedom
## Multiple R-squared: 0.1758, Adjusted R-squared: 0.1737
## F-statistic: 84.44 on 1 and 396 DF, p-value: < 2.2e-16
# alternative
require(QuantPsyc)
## Loading required package: QuantPsyc
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:mosaic':
##
## logit
## The following object is masked from 'package:lattice':
##
## melanoma
## Loading required package: purrr
##
## Attaching package: 'purrr'
## The following object is masked from 'package:mosaic':
##
## cross
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:Matrix':
##
## norm
## The following object is masked from 'package:base':
##
## norm
lm.beta(lm_1)
## yrs.since.phd
## 0.4192311
The \(\beta\) is 0.42 this implies that a shift in one standard deviation of years since Ph.D. results in a shift of .42 standard deviation. (Notice how this is the same value as your Pearson correlation coefficient!)
# Bootstrap 95% CI for B coefficient.
library(boot)
# function to obtain p.values from F-test from the data
coeff_B <- function(data, indices) {
data_boot <- data[indices, ] # allows boot to select sample
B_boot <- lm(monthly_salary_stand ~ yrs.since.phd_stand-1, data = data_boot)
return(B_boot$coeff[1]) # position of years
}
# bootstrapping with 10000 replications
set.seed(1983) # slightly worse year.
results_coeff_B <- boot(data = salaries, statistic = coeff_B, R = 10000)
Now let us calculate the confidence interval.
plot(results_coeff_B)
# get 95% confidence interval
boot.ci(results_coeff_B, type='bca')
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 10000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results_coeff_B, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 0.3085, 0.5205 )
## Calculations and Intervals on Original Scale
Sample write up: The 95% confidence interval for \(\beta\) using 10,000 samples and bias-corrected accelerated bootstrapping is [.31, .52]. (If you had done a bootstrap of the Pearson correlation coefficient you should get similar values).
Best to center everything. No point in centering categorical variables, note that we are calculating a shift here based on the reference categories (women and asst profs.). A shift of 3 years is associated with around $31 less a month, all else being equal. Now that is not very impressive, and that’s in part explained by the fact that academic rank does so much of the ‘heavy lifting’ and we are now looking at a shift away from the means (!). For a shift at the start of the career, you could use the initial model. A shift from asst. to associate professor implies another $1557 a month (all else being equal). (For men add another $572 a month (!))
require(dplyr)
# create centered variables
salaries<-dplyr::mutate(salaries, monthly_salary_cent=monthly_salary-mean(monthly_salary),yrs.since.phd_cent=yrs.since.phd-mean(yrs.since.phd))
# lm.
lm_3_cent<-lm(monthly_salary_cent ~ yrs.since.phd_cent + gender + rank, data=salaries)
summary(lm_3_cent)
##
## Call:
## lm(formula = monthly_salary_cent ~ yrs.since.phd_cent + gender +
## rank, data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7470 -1704 -170 1351 11702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4313.03 564.20 -7.645 1.63e-13 ***
## yrs.since.phd_cent -10.23 14.41 -0.710 0.47801
## genderMale 571.85 448.76 1.274 0.20332
## rankAssocProf 1556.98 482.53 3.227 0.00136 **
## rankProf 5292.88 490.30 10.795 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2626 on 392 degrees of freedom
## Multiple R-squared: 0.3973, Adjusted R-squared: 0.3912
## F-statistic: 64.61 on 4 and 392 DF, p-value: < 2.2e-16
# a shift of 3 for all coefficients
3*lm_3_cent$coefficients
## (Intercept) yrs.since.phd_cent genderMale rankAssocProf
## -12939.08800 -30.70138 1715.53631 4670.93252
## rankProf
## 15878.63255
require(caret)
## Loading required package: caret
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:mosaic':
##
## dotPlot
data(GermanCredit)
data_credit<-GermanCredit
require(skimr)
skim(data_credit)
Name | data_credit |
Number of rows | 1000 |
Number of columns | 62 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 61 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Class | 0 | 1 | FALSE | 2 | Goo: 700, Bad: 300 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Duration | 0 | 1 | 20.90 | 12.06 | 4 | 12.0 | 18.0 | 24.00 | 72 | ▇▇▂▁▁ |
Amount | 0 | 1 | 3271.26 | 2822.74 | 250 | 1365.5 | 2319.5 | 3972.25 | 18424 | ▇▂▁▁▁ |
InstallmentRatePercentage | 0 | 1 | 2.97 | 1.12 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▂▃▁▂▇ |
ResidenceDuration | 0 | 1 | 2.85 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▂▆▁▃▇ |
Age | 0 | 1 | 35.55 | 11.38 | 19 | 27.0 | 33.0 | 42.00 | 75 | ▇▆▃▁▁ |
NumberExistingCredits | 0 | 1 | 1.41 | 0.58 | 1 | 1.0 | 1.0 | 2.00 | 4 | ▇▅▁▁▁ |
NumberPeopleMaintenance | 0 | 1 | 1.16 | 0.36 | 1 | 1.0 | 1.0 | 1.00 | 2 | ▇▁▁▁▂ |
Telephone | 0 | 1 | 0.60 | 0.49 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▆▁▁▁▇ |
ForeignWorker | 0 | 1 | 0.96 | 0.19 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▁▁▁▁▇ |
CheckingAccountStatus.lt.0 | 0 | 1 | 0.27 | 0.45 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
CheckingAccountStatus.0.to.200 | 0 | 1 | 0.27 | 0.44 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
CheckingAccountStatus.gt.200 | 0 | 1 | 0.06 | 0.24 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
CheckingAccountStatus.none | 0 | 1 | 0.39 | 0.49 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▅ |
CreditHistory.NoCredit.AllPaid | 0 | 1 | 0.04 | 0.20 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
CreditHistory.ThisBank.AllPaid | 0 | 1 | 0.05 | 0.22 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
CreditHistory.PaidDuly | 0 | 1 | 0.53 | 0.50 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▇▁▁▁▇ |
CreditHistory.Delay | 0 | 1 | 0.09 | 0.28 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
CreditHistory.Critical | 0 | 1 | 0.29 | 0.46 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
Purpose.NewCar | 0 | 1 | 0.23 | 0.42 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
Purpose.UsedCar | 0 | 1 | 0.10 | 0.30 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Furniture.Equipment | 0 | 1 | 0.18 | 0.39 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
Purpose.Radio.Television | 0 | 1 | 0.28 | 0.45 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
Purpose.DomesticAppliance | 0 | 1 | 0.01 | 0.11 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Repairs | 0 | 1 | 0.02 | 0.15 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Education | 0 | 1 | 0.05 | 0.22 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Vacation | 0 | 1 | 0.00 | 0.00 | 0 | 0.0 | 0.0 | 0.00 | 0 | ▁▁▇▁▁ |
Purpose.Retraining | 0 | 1 | 0.01 | 0.09 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Business | 0 | 1 | 0.10 | 0.30 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Purpose.Other | 0 | 1 | 0.01 | 0.11 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
SavingsAccountBonds.lt.100 | 0 | 1 | 0.60 | 0.49 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
SavingsAccountBonds.100.to.500 | 0 | 1 | 0.10 | 0.30 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
SavingsAccountBonds.500.to.1000 | 0 | 1 | 0.06 | 0.24 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
SavingsAccountBonds.gt.1000 | 0 | 1 | 0.05 | 0.21 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
SavingsAccountBonds.Unknown | 0 | 1 | 0.18 | 0.39 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
EmploymentDuration.lt.1 | 0 | 1 | 0.17 | 0.38 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
EmploymentDuration.1.to.4 | 0 | 1 | 0.34 | 0.47 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▅ |
EmploymentDuration.4.to.7 | 0 | 1 | 0.17 | 0.38 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
EmploymentDuration.gt.7 | 0 | 1 | 0.25 | 0.43 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
EmploymentDuration.Unemployed | 0 | 1 | 0.06 | 0.24 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Personal.Male.Divorced.Seperated | 0 | 1 | 0.05 | 0.22 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Personal.Female.NotSingle | 0 | 1 | 0.31 | 0.46 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
Personal.Male.Single | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▆▁▁▁▇ |
Personal.Male.Married.Widowed | 0 | 1 | 0.09 | 0.29 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Personal.Female.Single | 0 | 1 | 0.00 | 0.00 | 0 | 0.0 | 0.0 | 0.00 | 0 | ▁▁▇▁▁ |
OtherDebtorsGuarantors.None | 0 | 1 | 0.91 | 0.29 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▁▁▁▁▇ |
OtherDebtorsGuarantors.CoApplicant | 0 | 1 | 0.04 | 0.20 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
OtherDebtorsGuarantors.Guarantor | 0 | 1 | 0.05 | 0.22 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Property.RealEstate | 0 | 1 | 0.28 | 0.45 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
Property.Insurance | 0 | 1 | 0.23 | 0.42 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
Property.CarOther | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
Property.Unknown | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
OtherInstallmentPlans.Bank | 0 | 1 | 0.14 | 0.35 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
OtherInstallmentPlans.Stores | 0 | 1 | 0.05 | 0.21 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
OtherInstallmentPlans.None | 0 | 1 | 0.81 | 0.39 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
Housing.Rent | 0 | 1 | 0.18 | 0.38 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
Housing.Own | 0 | 1 | 0.71 | 0.45 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▃▁▁▁▇ |
Housing.ForFree | 0 | 1 | 0.11 | 0.31 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Job.UnemployedUnskilled | 0 | 1 | 0.02 | 0.15 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
Job.UnskilledResident | 0 | 1 | 0.20 | 0.40 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
Job.SkilledEmployee | 0 | 1 | 0.63 | 0.48 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▅▁▁▁▇ |
Job.Management.SelfEmp.HighlyQualified | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.00 | 1 | ▇▁▁▁▂ |
You could also use the ‘summarytools’ package.
You might have used different variables. Because they are all scored as between and 0 and 1 you could use them all in these plots.
require(apaTables)
require(dplyr)
cor_table<-dplyr::select(data_credit, c(Age,Amount,Telephone))
apa.cor.table(cor_table, filename = "correlations-credit.doc")
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2
## 1. Age 35.55 11.38
##
## 2. Amount 3271.26 2822.74 .03
## [-.03, .09]
##
## 3. Telephone 0.60 0.49 -.15** -.28**
## [-.21, -.08] [-.33, -.22]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
# ggcorrplot, you can then further tweak this, as it is a ggplot.
require(ggcorrplot)
## Loading required package: ggcorrplot
cormatrix<-cor(cor_table)
ggcorrplot(cormatrix, hc.order = TRUE, type = "lower", lab = TRUE)
We built 4 hierarchical models.
levels(data_credit$Class) # Bad is the reference category.
## [1] "Bad" "Good"
data_credit$Class_ref_good<-relevel(data_credit$Class, ref="Good") # recode, as this is perhaps more sensible if we want to know who will default.
log1<-glm(Class_ref_good~ Amount, data=data_credit, family = binomial)
log2<-glm(Class_ref_good~ Amount+CreditHistory.PaidDuly, data=data_credit, family = binomial)
log3<-glm(Class_ref_good~ Amount+CreditHistory.PaidDuly+Age, data=data_credit, family = binomial)
log4<-glm(Class_ref_good~ Amount+CreditHistory.PaidDuly+Age+Telephone, data=data_credit, family = binomial)
summary(log1)
##
## Call:
## glm(formula = Class_ref_good ~ Amount, family = binomial, data = data_credit)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.229e+00 1.083e-01 -11.348 < 2e-16 ***
## Amount 1.119e-04 2.355e-05 4.751 2.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1199.1 on 998 degrees of freedom
## AIC: 1203.1
##
## Number of Fisher Scoring iterations: 4
summary(log2)
##
## Call:
## glm(formula = Class_ref_good ~ Amount + CreditHistory.PaidDuly,
## family = binomial, data = data_credit)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.383e+00 1.390e-01 -9.949 < 2e-16 ***
## Amount 1.162e-04 2.376e-05 4.890 1.01e-06 ***
## CreditHistory.PaidDuly 2.579e-01 1.415e-01 1.823 0.0683 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1195.7 on 997 degrees of freedom
## AIC: 1201.7
##
## Number of Fisher Scoring iterations: 4
summary(log3)
##
## Call:
## glm(formula = Class_ref_good ~ Amount + CreditHistory.PaidDuly +
## Age, family = binomial, data = data_credit)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.000e-01 2.730e-01 -2.564 0.01035 *
## Amount 1.194e-04 2.394e-05 4.988 6.11e-07 ***
## CreditHistory.PaidDuly 1.975e-01 1.436e-01 1.375 0.16917
## Age -1.889e-02 6.623e-03 -2.852 0.00434 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1187.2 on 996 degrees of freedom
## AIC: 1195.2
##
## Number of Fisher Scoring iterations: 4
summary(log4)
##
## Call:
## glm(formula = Class_ref_good ~ Amount + CreditHistory.PaidDuly +
## Age + Telephone, family = binomial, data = data_credit)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.030e+00 3.123e-01 -3.298 0.000972 ***
## Amount 1.357e-04 2.518e-05 5.388 7.14e-08 ***
## CreditHistory.PaidDuly 1.942e-01 1.440e-01 1.349 0.177475
## Age -1.686e-02 6.659e-03 -2.531 0.011359 *
## Telephone 3.383e-01 1.543e-01 2.193 0.028315 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1182.3 on 995 degrees of freedom
## AIC: 1192.3
##
## Number of Fisher Scoring iterations: 4
This exports the results. Note that the odds ratio of amount is close to 1. This is because the base unit is 1 DM (Deutsche Mark) - which might not make a lot of sense! (Rescore via log10 or standardize?)
require(stargazer)
stargazer(log1,log2,log3,log4,dep.var.labels=c("Credit status (reference= Good)"), covariate.labels=c("Amount (in DM)", "Credit History (Duly paid)","Age", "Telephone (yes/no)"), report = "vc*", omit=c("Constant"), apply.coef=exp, t.auto=F, p.auto=F, style="asr", out="logistic_base_exercise4.htm", header=FALSE)
##
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lcccc}
## \\[-1.8ex]\hline \\[-1.8ex]
## \\[-1.8ex] & \multicolumn{4}{c}{Credit status (reference= Good)} \\
## \\[-1.8ex] & (1) & (2) & (3) & (4)\\
## \hline \\[-1.8ex]
## Amount (in DM) & 1.000$^{***}$ & 1.000$^{***}$ & 1.000$^{***}$ & 1.000$^{***}$ \\
## & & & & \\
## Credit History (Duly paid) & & 1.294 & 1.218 & 1.214 \\
## & & & & \\
## Age & & & 0.981$^{**}$ & 0.983$^{*}$ \\
## & & & & \\
## Telephone (yes/no) & & & & 1.402$^{*}$ \\
## & & & & \\
## \textit{N} & 1,000 & 1,000 & 1,000 & 1,000 \\
## Log Likelihood & $-$599.532 & $-$597.860 & $-$593.606 & $-$591.163 \\
## AIC & 1,203.064 & 1,201.721 & 1,195.213 & 1,192.326 \\
## \hline \\[-1.8ex]
## \multicolumn{5}{l}{$^{*}$p $<$ .05; $^{**}$p $<$ .01; $^{***}$p $<$ .001} \\
## \end{tabular}
## \end{table}
Note that a value <1 means that odds will become lower. For ease of interpretation one would take the inverse of the odds ratio.
Here are the Pseudo- \(R^2\) values.
require(rcompanion)
## Loading required package: rcompanion
nagelkerke(log1, restrictNobs=T)
## $Models
##
## Model: "glm, Class_ref_good ~ Amount, binomial, data_credit"
## Null: "glm, Class_ref_good ~ 1, binomial, fit$model"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0185513
## Cox and Snell (ML) 0.0224098
## Nagelkerke (Cragg and Uhler) 0.0317743
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -1 -11.332 22.665 1.9288e-06
##
## $Number.of.observations
##
## Model: 1000
## Null: 1000
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
nagelkerke(log2, restrictNobs=T)
## $Models
##
## Model: "glm, Class_ref_good ~ Amount + CreditHistory.PaidDuly, binomial, data_credit"
## Null: "glm, Class_ref_good ~ 1, binomial, fit$model"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0212879
## Cox and Snell (ML) 0.0256727
## Nagelkerke (Cragg and Uhler) 0.0364008
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -2 -13.004 26.008 2.2513e-06
##
## $Number.of.observations
##
## Model: 1000
## Null: 1000
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
nagelkerke(log3, restrictNobs=T)
## $Models
##
## Model: "glm, Class_ref_good ~ Amount + CreditHistory.PaidDuly + Age, binomial, data_credit"
## Null: "glm, Class_ref_good ~ 1, binomial, fit$model"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0282516
## Cox and Snell (ML) 0.0339269
## Nagelkerke (Cragg and Uhler) 0.0481042
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -3 -17.258 34.516 1.5417e-07
##
## $Number.of.observations
##
## Model: 1000
## Null: 1000
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
nagelkerke(log4, restrictNobs=T)
## $Models
##
## Model: "glm, Class_ref_good ~ Amount + CreditHistory.PaidDuly + Age + Telephone, binomial, data_credit"
## Null: "glm, Class_ref_good ~ 1, binomial, fit$model"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0322516
## Cox and Snell (ML) 0.0386365
## Nagelkerke (Cragg and Uhler) 0.0547819
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -4 -19.701 39.403 5.7518e-08
##
## $Number.of.observations
##
## Model: 1000
## Null: 1000
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
Here is an example of a mosaic plot. It becomes a bit more difficult with more variables added.
require(RColorBrewer)
## Loading required package: RColorBrewer
# pick some colours.
colours<-c(color = brewer.pal(2, "Set2"))
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
fill_colours<-rep(colours, each=2) # 4 quadrants to fill
require(vcd)
## Loading required package: vcd
## Loading required package: grid
##
## Attaching package: 'vcd'
## The following object is masked from 'package:mosaic':
##
## mplot
mosaicplot1<-mosaic(~ Class_ref_good + Telephone, zero_size = 0, data=data_credit, gp = gpar(fill = fill_colours, col = 0), legend=F, labeling = labeling_values)
Another example, now with rental status and owning a telephone. You can further tinker with settings and variables. You can even have tests already included, explore the vcd package vignette for more!.
require(RColorBrewer)
# pick some colours.
colours<-c(color = brewer.pal(2, "Set2"))
## Warning in brewer.pal(2, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
fill_colours<-rep(colours, each=4) # 4 quadrants to fill
require(vcd)
mosaicplot2<-mosaic(~ Class_ref_good + Housing.Rent + Telephone, zero_size = 0, data=data_credit, gp = gpar(fill = fill_colours, col = 0), legend=F, labeling = labeling_values)
A shift of a 1,000 DM increases the odds of bad credit (instead of good credit) with a factor 1.145 (while keeping everything else constant). (Note that this model is based on people of an average age, who do not own a phone and who did not duly pay).
require(oddsratio)
## Loading required package: oddsratio
require(dplyr)
require(questionr)
## Loading required package: questionr
##
## Attaching package: 'questionr'
## The following object is masked from 'package:mosaic':
##
## prop
data_credit<- data_credit %>% mutate(Age_cent=Age-mean(Age), Amount_cent= Amount-mean(Amount))
log_4_cent<- glm(Class_ref_good~ Amount_cent+CreditHistory.PaidDuly+Age_cent+Telephone, data=data_credit, family = binomial)
odds.ratio(log_4_cent)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % p
## (Intercept) 0.30559 0.22964 0.4027 < 2.2e-16 ***
## Amount_cent 1.00014 1.00009 1.0002 7.142e-08 ***
## CreditHistory.PaidDuly 1.21432 0.91630 1.6119 0.17747
## Age_cent 0.98328 0.97031 0.9960 0.01136 *
## Telephone 1.40250 1.03885 1.9027 0.02832 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# shift of 1,000 DM -- everything else put to 0
or_glm(data_credit, incr= list(Amount_cent=1000, CreditHistory.PaidDuly=0, Age_cent=0, Telephone=0), log_4_cent)
## predictor oddsratio ci_low (2.5) ci_high (97.5) increment
## 1 Amount_cent 1.145 1.091 1.204 1000
## 2 CreditHistory.PaidDuly 1.000 1.000 1.000 0
## 3 Age_cent 1.000 1.000 1.000 0
## 4 Telephone 1.000 1.000 1.000 0
# alternative, which gaves same core resuls (other coeff's not interpretable)
exp(log_4_cent$coefficients*1000)
## (Intercept) Amount_cent CreditHistory.PaidDuly
## 0.000000e+00 1.145310e+00 2.147984e+84
## Age_cent Telephone
## 4.774967e-08 7.982351e+146
Facetted by Housing: rental status. Could be further improved by changing 0’s and 1’s.
You might run into errors, see this. The solution is to load the package ‘devtools’ and install the github version of ggmosaic, this line will do the trick: install_github(‘cran/ggmosaic’).
require(ggmosaic) # install development version as ggplot came with major updates breaking old ggmosaic.
## Loading required package: ggmosaic
##
## Attaching package: 'ggmosaic'
## The following objects are masked from 'package:vcd':
##
## mosaic, spine
require(ggthemes)
## Loading required package: ggthemes
##
## Attaching package: 'ggthemes'
## The following object is masked from 'package:mosaic':
##
## theme_map
ggplot(data=data_credit) +
geom_mosaic(aes(x=product(Class_ref_good,Telephone), fill=Class_ref_good)) + facet_grid(Housing.Rent~.) + labs(x='Telephone') + scale_fill_brewer(palette="Dark2") + theme_gdocs()
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ROC curves are supercool and they are a tool used across a broad range of sciences. You can read up on them in the Agresti reference from the reading list. You can also read more in the reading lists or here. Anyway, have a look what’s out there and try to have a go at making one yourself.
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:mosaic':
##
## cov, var
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
resRoc <- roc(data_credit$Class_ref_good ~ log4$fitted)
## Setting levels: control = Good, case = Bad
## Setting direction: controls < cases
plot(resRoc, legacy.axes = TRUE)
auc(resRoc)
## Area under the curve: 0.616
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pROC_1.18.5 ggthemes_4.2.4 ggmosaic_0.3.3 questionr_0.7.8
## [5] oddsratio_2.0.1 vcd_1.4-11 RColorBrewer_1.1-3 rcompanion_2.4.34
## [9] ggcorrplot_0.1.4.1 caret_6.0-94 QuantPsyc_1.6 MASS_7.3-60
## [13] purrr_1.0.2 boot_1.3-28.1 mosaic_1.9.0 mosaicData_0.20.4
## [17] ggformula_0.12.0 Matrix_1.6-1.1 ggplot2_3.4.4 lattice_0.21-9
## [21] sandwich_3.0-2 lmtest_0.9-40 zoo_1.8-12 apaTables_2.0.8
## [25] stargazer_5.2.3 skimr_2.1.5 dplyr_1.1.3 carData_3.0-5
##
## loaded via a namespace (and not attached):
## [1] libcoin_1.0-10 rstudioapi_0.15.0 jsonlite_1.8.7
## [4] magrittr_2.0.3 TH.data_1.1-2 modeltools_0.2-23
## [7] farver_2.1.1 rmarkdown_2.25 vctrs_0.6.4
## [10] base64enc_0.1-3 htmltools_0.5.7 forcats_1.0.0
## [13] haven_2.5.3 broom_1.0.5 cellranger_1.1.0
## [16] sass_0.4.7 parallelly_1.36.0 bslib_0.5.1
## [19] htmlwidgets_1.6.2 plyr_1.8.9 plotly_4.10.3
## [22] rootSolve_1.8.2.4 lubridate_1.9.3 cachem_1.0.8
## [25] mime_0.12 lifecycle_1.0.3 iterators_1.0.14
## [28] pkgconfig_2.0.3 R6_2.5.1 fastmap_1.1.1
## [31] shiny_1.7.5.1 future_1.33.0 digest_0.6.33
## [34] Exact_3.2 colorspace_2.1-0 productplots_0.1.1
## [37] labeling_0.4.3 fansi_1.0.5 timechange_0.2.0
## [40] httr_1.4.7 mgcv_1.9-0 compiler_4.3.2
## [43] proxy_0.4-27 withr_2.5.2 backports_1.4.1
## [46] highr_0.10 lava_1.7.3 gld_2.6.6
## [49] ModelMetrics_1.2.2.2 tools_4.3.2 httpuv_1.6.12
## [52] future.apply_1.11.1 nnet_7.3-19 glue_1.6.2
## [55] promises_1.2.1 nlme_3.1-163 reshape2_1.4.4
## [58] generics_0.1.3 recipes_1.0.9 gtable_0.3.4
## [61] MBESS_4.9.3 nortest_1.0-4 labelled_2.12.0
## [64] class_7.3-22 tidyr_1.3.0 data.table_1.14.8
## [67] lmom_3.0 hms_1.1.3 coin_1.4-3
## [70] utf8_1.2.4 ggrepel_0.9.4 foreach_1.5.2
## [73] pillar_1.9.0 stringr_1.5.0 later_1.3.1
## [76] splines_4.3.2 survival_3.5-7 tidyselect_1.2.0
## [79] miniUI_0.1.1.1 knitr_1.45 stats4_4.3.2
## [82] xfun_0.41 expm_0.999-9 hardhat_1.3.0
## [85] mosaicCore_0.9.4.0 timeDate_4022.108 matrixStats_1.0.0
## [88] stringi_1.7.12 lazyeval_0.2.2 yaml_2.3.7
## [91] evaluate_0.23 codetools_0.2-19 tibble_3.2.1
## [94] multcompView_0.1-9 cli_3.6.1 rpart_4.1.21
## [97] xtable_1.8-4 DescTools_0.99.52 repr_1.1.6
## [100] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.11
## [103] readxl_1.4.3 globals_0.16.2 parallel_4.3.2
## [106] ellipsis_0.3.2 gower_1.0.1 listenv_0.9.0
## [109] viridisLite_0.4.2 mvtnorm_1.2-3 ipred_0.9-14
## [112] scales_1.2.1 prodlim_2023.08.28 ggridges_0.5.5
## [115] e1071_1.7-13 crayon_1.5.2 rlang_1.1.1
## [118] multcomp_1.4-25