Chapter 4 Logistic Regression
4.1 TL;DR
- What it does
- Models the probability that an observation falls into one of two categories
- When to do it
- When you are trying to predict a categorical variable with two possible outcomes, like true/false, instead of something continuous like in linear regression
- How to do it
-
With the
glm()
function, specifyingfamily = "binomial"
- How to assess it
- Look for a significant \(p\)-value for the predictor, and assess its accuracy against training data
4.2 What it does
Logistic regression is used to make a prediction about whether an observation falls into one of two categories. This is an alternative to linear regression, which predicts a continuous outcome. A logistic regression results in a model that gives the probability of Y given X.

Figure 4.1: Linear regression plot (source: StatQuest)
4.3 When to do it
As the TL;DR says: When you are trying to predict a categorical variable with two possible outcomes, like true/false, instead of something continuous.
4.4 How to do it
First, it’s important to note that logistic regression will give better results when the model is fit using held out or training data, and then tested against other data, to help avoid overfitting. It is not required, but it’s recommended where possible.
First, before that, fit a model using glm()
, making sure in this case that we have a categorical value to test for:
data(Boston)
<- Boston %>%
boston mutate(
# Create the categorical crim_above_med response variable
crim_above_med = as.factor(ifelse(crim > median(crim), "Yes", "No")),
# Also make a numeric version of crim_above_med for the plot
crim_above_med_num = ifelse(crim > median(crim), 1, 0)
)<- glm(crim_above_med ~ nox, data = boston, family = binomial)
boston.fits summary(boston.fits)
##
## Call:
## glm(formula = crim_above_med ~ nox, family = binomial, data = boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.27324 -0.37245 -0.06847 0.39620 2.53124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.818 1.386 -11.41 <2e-16 ***
## nox 29.365 2.599 11.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 320.39 on 504 degrees of freedom
## AIC: 324.39
##
## Number of Fisher Scoring iterations: 6
We can see the that \(p\) value for the nox
variable is nearly 0, suggesting a strong association between nox
and crim_above_med
.
Now split the data into training and test sets, using, say, 80% of the data for training:
set.seed(1235)
<- rep(FALSE, nrow(boston))
boston.training sample(nrow(boston), nrow(boston) * 0.8)] <- TRUE
boston.training[<- !boston.training boston.test
Fit the above model to the training data:
<-
boston.fits glm(
~ nox,
crim_above_med data = boston,
subset = boston.training,
family = binomial
)
And use the fit to predict the test data with predict()
:
<- predict(boston.fits, boston[boston.test,], type = "response") boston.probs
4.5 How to assess it
Assess the accuracy of the predictions by building a confusion matrix, using the same values (“No” and “Yes” in this case) as contained in the outcome variable we want to predict:
<- rep("No", sum(boston.test))
boston.pred > 0.5] <- "Yes"
boston.pred[boston.probs table(boston.pred, boston[boston.test,]$crim_above_med)
##
## boston.pred No Yes
## No 36 12
## Yes 5 49
The above table shows that the model correctly predicted crim_above_med
from nox
85 times out of 102, with 5 false positives and 12 false negatives. The accuracy of the model can be computed via the mean()
of how often the predictions match the test data, a neat trick:
mean(boston.pred == boston[boston.test,]$crim_above_med)
## [1] 0.8333333
The regression can be plotted with ggplot2
and the stat_smooth()
function, utilizing the glm
method, like so:
ggplot(boston, aes(x = nox, y = crim_above_med_num)) +
geom_point(alpha = .5) +
stat_smooth(
method = "glm",
formula = y ~ x,
se = FALSE,
method.args = list(family = binomial),
col = "red",
lty = 2
)
This shows that an observation with a nox
value of 0.6 would have around a 90% probability of crim_above_med
being true, or 1, while an observation with a nox
value of 0.5 would only have around a 25% probability of crim_above_med
being true.
4.6 Where to learn more
- Chapter 4 - 4.4 in James et al. (2021)
- StatQuest: Logistic Regression
4.7 Notes
Like with linear regression, the “multiple” flavor of logistic regression works much the same way, with the specification of additional predictors in the formula, and so doesn’t really warrant a whole chapter of its own. The selection of variables requires some effort, and is addressed under best subser selection.