In this article, we will be providing an in-depth tutorial on how to plot a logistic regression curve using the R programming language. Logistic regression, which is a statistical model that uses a logistic function to model a binary dependent variable, is widely used for predictive analytics and machine learning tasks.
Understanding the Data
For the purpose of this tutorial, we will be using a built-in R dataset called ‘mtcars’. This dataset is widely used for examples and teaching. It includes data on 32 models of car, with 11 different characteristics for each model. For logistic regression, we’ll predict whether a car has an automatic or manual transmission (column ‘am’) based on its horsepower (column ‘hp’).
Let’s load our data and take a quick look:
Now that we have loaded the data, the next step is to preprocess the data to fit the logistic regression model.
In this case, the ‘am’ variable is already in binary format (0 for automatic, 1 for manual), which is suitable for logistic regression. For the predictor variable ‘hp’, we will normalize it to have a mean of zero and standard deviation of one. This process is known as standardization or Z-score normalization.
You can standardize ‘hp’ using the following code:
mtcars$hp <- scale(mtcars$hp)[,1]
Building the Logistic Regression Model
Now we will build our logistic regression model using the ‘glm’ function in R. We specify the binomial family to perform logistic regression.
logit_model <- glm(am ~ hp, data = mtcars, family = binomial) summary(logit_model)
The summary function provides a detailed overview of the model. Pay particular attention to the coefficients of the predictor variable, which indicate how much a one-unit change in the predictor variable ‘hp’ changes the log-odds of the response variable ‘am’.
Plotting the Logistic Regression Curve
Now we will plot the logistic regression curve using ggplot2. The curve is a plot of the predicted probabilities of the logistic regression model.
First, we’ll create a new data frame for prediction. This data frame contains equally spaced values of ‘hp’ over its range.
newdata <- data.frame(hp = seq(min(mtcars$hp), max(mtcars$hp), length.out = 100))
Then, we’ll predict the probabilities using the logistic regression model.
newdata$am <- predict(logit_model, newdata = newdata, type = "response")
Next, we’ll plot the logistic regression curve:
library(ggplot2) ggplot(mtcars, aes(x = hp, y = am)) + geom_point() + geom_line(data = newdata, aes(y = am), color = 'blue') + labs(x = 'Horsepower (standardized)', y = 'Probability of Manual Transmission') + theme_minimal()
The geom_point function adds the observed values (as points), and the geom_line function adds the predicted probabilities (as a line). We’ve also labeled the axes and applied a minimal theme to the plot.
Interpreting the Logistic Regression Curve
In the plot, the x-axis represents the standardized horsepower and the y-axis represents the probability that a car has a manual transmission.
The blue line represents the logistic regression curve, which is an S-shaped curve. This shape is characteristic of the logistic function. It shows how the probability of having a manual transmission (versus an automatic transmission) changes as horsepower increases.
From the plot, you can see that the probability of having a manual transmission increases as horsepower increases. This is consistent with the positive coefficient of ‘hp’ in the logistic regression model.
In this article, we’ve walked through how to plot a logistic regression curve in R using the ggplot2 package. We used a built-in R dataset ‘mtcars’ to demonstrate how to build a logistic regression model and predict probabilities. We also discussed how to interpret the logistic regression curve.
Logistic regression is a powerful tool for understanding the relationship between binary outcomes and predictor variables, and this type of visualization is a useful way to communicate these insights.