
Introduction
Linear regression is one of the fundamental statistical and machine learning models. It is a statistical approach that models the relationship between dependent variable (also known as the ‘outcome’, ‘target’ or ‘response’ variable) and one or more independent variables (also known as ‘predictors’, ‘covariates’, or ‘features’).
If there is only one predictor, the model is called simple linear regression. If there is more than one predictor, the model is known as multiple linear regression. Despite its simplicity, linear regression is extremely useful both conceptually and practically.
In this article, we will explain how to implement both simple and multiple linear regression in Python using different libraries, such as NumPy
, SciPy
, statsmodels
, and scikit-learn
. We will discuss how to evaluate the model and the assumptions underlying the linear regression model.
Libraries and Installation
To implement and perform linear regression in Python, you will primarily need the following libraries:
- NumPy: A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas: A library providing high-performance, easy-to-use data structures and data analysis tools for Python.
- SciPy: A free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers, and other tasks common in science and engineering.
- Statsmodels: A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and exploring the data.
- Scikit-Learn: One of the most widely used machine learning libraries in Python.
- Matplotlib: A plotting library for the Python programming language and its numerical mathematics extension NumPy.
- Seaborn: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
You can install these libraries using pip:
pip install numpy pandas scipy statsmodels scikit-learn matplotlib seaborn
Simple Linear Regression
Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:
- One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
- The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
Simple Linear Regression with NumPy
You can implement a simple linear regression in Python using NumPy
‘s polyfit
function. Let’s say you have the following data:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 7, 11])
You can fit a linear regression model using np.polyfit()
, specifying the degree of the polynomial as 1:
coefficients = np.polyfit(x, y, 1)
print(coefficients)
np.polyfit()
returns an array with the coefficients of the polynomial. For a linear regression model, the array contains the slope and the intercept of the line, in that order.
Simple Linear Regression with Scipy
Scipy
also provides a way to fit a linear regression model, with the linregress()
function:
import scipy.stats
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
Here, linregress()
returns the slope and intercept of the line, as well as the correlation coefficient (R-squared value), the p-value for a hypothesis test whose null hypothesis is that the slope is zero, and the standard error of the estimate.
Simple Linear Regression with statsmodels
statsmodels
is a powerful Python library for statistics and econometrics. It provides a different way to specify the model that resembles the way you would specify it in a statistical formula:
import statsmodels.api as sm
# Add a constant (intercept term) to the predictors
X = sm.add_constant(x)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
Here, sm.OLS()
is used to create a linear regression model, which is then fitted to the data using the fit()
method. The summary()
method prints a detailed summary of the results.
Simple Linear Regression with scikit-learn
Scikit-learn
is the most commonly used library for machine learning in Python. Here’s how you can fit a linear regression model with scikit-learn
:
from sklearn.linear_model import LinearRegression
# Reshape the data to fit the model
X = x.reshape(-1, 1)
y = y.reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)
slope = model.coef_
intercept = model.intercept_
Note that scikit-learn
requires the predictors to be in a 2D array, so we need to reshape the data if it’s in a 1D array. LinearRegression()
creates a linear regression model, and the fit()
method fits it to the data.
Multiple Linear Regression
Multiple linear regression is a generalization of simple linear regression to the case where the dependent variable is predicted by more than one independent variable.
Multiple Linear Regression with statsmodels
Here is how you can implement multiple linear regression with statsmodels
. Assume that we have a dataset df
with columns ‘A’, ‘B’, ‘C’, and we want to predict ‘C’ based on ‘A’ and ‘B’:
import statsmodels.api as sm
X = df[['A', 'B']]
y = df['C']
# Add a constant (intercept term) to the predictors
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
Multiple Linear Regression with scikit-learn
Here is how you can do the same with scikit-learn
:
from sklearn.linear_model import LinearRegression
X = df[['A', 'B']]
y = df['C']
model = LinearRegression()
model.fit(X, y)
coefficients = model.coef_
intercept = model.intercept_
In both of these examples, the procedure is the same as for simple linear regression, but now X
can be a DataFrame with more than one column.
Evaluating the Model
To evaluate the performance of a linear regression model, we can use several metrics, including:
- R-squared: The proportion of the variance in the dependent variable that is predictable from the independent variables.
- Adjusted R-squared: The R-squared that has been adjusted for the number of predictors in the model.
- F-statistic: A statistic indicating whether there is a relationship between the dependent variable and the predictors.
- AIC/BIC: Information criteria that can be used for model selection, with a lower value indicating a better fit.
- Confidence intervals: Indicate the range within which the coefficients are likely to fall, with a certain level of confidence.
You can get all these metrics (and more) from the summary()
method of statsmodels
‘ OLS
class.
Assumptions of Linear Regression
There are several assumptions that underpin the linear regression model:
- Linearity: The relationship between the predictors and the response variable is linear.
- Independence: The residuals (i.e., the differences between the observed and predicted responses) are independent.
- Homoscedasticity: The variance of the residuals is constant.
- Normality: The residuals are normally distributed.
If these assumptions are violated, the results of the linear regression analysis may be misleading or inaccurate. Therefore, it’s important to check these assumptions when fitting a linear regression model. This can be done by inspecting the residuals, performing a normality test, or using diagnostic plots such as the Q-Q plot.
Conclusion
In this article, we have explained how to implement both simple and multiple linear regression in Python using different libraries. We have also discussed how to evaluate the model and the assumptions underlying the linear regression model. While linear regression is a simple method, it is a fundamental concept in statistical learning and serves as the basis for many other more complex methods. Therefore, understanding linear regression is crucial for anyone interested in data analysis or machine learning.