How to Create a Baseline Regression Model in scikit Learn

Spread the love

Problem –

You want to compare a simple baseline regression model to compare against your actual model.

Solution –

In Scikit Learn, you can use the DummyRegressor to create a simple baseline model.

Let’s read a dataset.

import pandas as pd
from sklearn import datasets

housing = datasets.fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X.head()
y
output - 
array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])

Then split the dataset into a training and a test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Now, create a baseline model using DummyRegressor.

from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# create a dummy regressor
dummy_reg = DummyRegressor(strategy='mean')
# fit it on the training set
dummy_reg.fit(X_train, y_train)
# make predictions on the test set
y_pred = dummy_reg.predict(X_test)

# calculate root mean squared error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Dummy RMSE:", rmse)

output - 
Dummy RMSE: 1.1448563543099792

Now, we can create our actual model to compare with it.

from sklearn.linear_model import LinearRegression

# create a linear regression model
lin_reg = LinearRegression()
# fit on the training data
lin_reg.fit(X_train, y_train)
# make predictions on the test set
y_pred = lin_reg.predict(X_test)

# calculate root mean squared error
mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(mse)
print("Linear Regression RMSE:", lin_rmse)

output - 
Linear Regression RMSE: 0.7455813830127761

If you want you can change the strategy from mean to others like median, quantile and constant. By default it is mean.

from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# create a dummy regressor
dummy_reg = DummyRegressor(strategy='constant', constant=1)
# fit it on the training set
dummy_reg.fit(X_train, y_train)
# make predictions on the test set
y_pred = dummy_reg.predict(X_test)

# calculate root mean squared error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Dummy Constant RMSE:", rmse)

output - 
Dummy Constant RMSE: 1.5567403478625699

Rating: 1 out of 5.

Leave a Reply