How to Create a Baseline Classification Model in Scikit Learn

Spread the love

Problem –

You want to create a simple baseline classification model so that you can compare it with your actual model.

Solution –

In scikit Learn, you can use the DummyClassifier to create a baseline classification model.

Let’s read a dataset to work with.

import pandas as pd
import numpy as np
from sklearn import datasets

cancer = datasets.load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target
X.head()

Now, split the data into a training and a test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Now, create a baseline model using DummyClassifier.

from sklearn.dummy import DummyClassifier

# create dummy classifier
dummy_clf = DummyClassifier(strategy='uniform', random_state=42)
# train a model
dummy_clf.fit(X_train, y_train)
# get accuracy score
dummy_clf.score(X_test, y_test)

output - 
0.5964912280701754

Here, we used the strategy=’uniform’ but you can also use other strategy like most_frequent, prior, stratified and constant. Details can be found here – strategy to use.

Now, we can create our model with which you want to compare the baseline model to understand the performance.

from sklearn.linear_model import LogisticRegression

# create a logistic regression model
clf = LogisticRegression(max_iter=10000, random_state=42)
# train the model on training dataset
clf.fit(X_train, y_train)
# get accuracy score on test set
clf.score(X_test, y_test)

output - 
0.956140350877193

The accuracy of this model is far better than the baseline model. If you want, you can also try some other model and see how performs the best and choose the one which is best.

Rating: 1 out of 5.

Leave a Reply