A Brief Introduction to K-Fold Cross Validation in Machine Learning

Spread the love

Why do we need K-Fold Cross Validation ?

Evaluating a Supervised Machine Learning model might appear straight forward. You train a model and then calculate how well it did using some performance metric like accuracy, rmse etc. However this approach is fundamentally flawed. If we train a model on our data and evaluate how well it did on the same data then we are not applying the correct technique. Our goal is not to evaluate how well the model does on our training data but how well it does on data it has never seen before.

One strategy might be to hold off a some portion of the data for testing. This is called validation or hold out. In validation we split the data into two sets, traditionally called the training set and test set. We train the data on training set and evaluate the model on the test set. However this validation approach has two major weakness. First the performance of the model can be highly dependent on which few observations were selected for the test set. Second, the model is not being trained using all the available data and not being evaluated on all the available data.

A better strategy which overcomes these weaknesses is called k-fold cross validation.

What is K-Fold Cross validation ?

In K-Fold Cross validation, we split the data into k parts called folds. The model is then trained using k -1 folds – combined into one training set and then the last fold is used as a test set. We repeat this process k times, each time using a different fold as the test set. The performance on the model for each of the k iteration is then averaged to produce an overall measurement.

How to do K-Fold Cross Validation ?

Read Data –

Let’s read a dataset to work with.

import pandas as pd
import numpy as np

# read data in pandas dataframe
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv"
df = pd.read_csv(url)

our target columns is the diagnosis column.

B – B stands for Benign, Which means the cell is non cancerous. The patient is not sick.

M – M stands for Malignant, which means the cell is cancerous. The patient is sick.

Let’s map the Malignant and Benign as 1s and 0s to reduce the confusion. All the observations related to sick patients(M) will be 1 which is our positive class and all healthy patients(B) will be 0 which is our negative class.

values = {"B": 0, "M": 1}
df["diagnosis"] = df["diagnosis"].map(values)

To build the model we will use the random forest classifier. If You want you can use any other algorithms.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split the data into training and test set
X = df.drop("diagnosis", axis=1).copy()
y = df["diagnosis"].copy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=26

# initiate an rf classifier using a pipeline
clf = make_pipeline(
    SimpleImputer(strategy="mean"), RandomForestClassifier(random_state=26)

# train the classifier on training data
clf.fit(X_train, y_train)

# make predictions on test data
pred = clf.predict(X_test)

# measure accuracy
accuracy_score(y_test, pred)

output - 0.9707602339181286

Now, Let’s apply K-Fold cross validation. We will use 5 fold cross validation. First we need to import cross_val_score from scikit-learn which helps us perform cross validation for us. Here, we pass the pipeline, the features and the targets and the number of folds that you want to use and the metric that you want to use to evaluate the model. For more information see the document – cross_val_score

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
# accuracy scores on 5 folds
array([0.9375    , 0.9875    , 0.9375    , 0.96202532, 0.92405063])

# mean accuracy score

So the mean accuracy scores on 5-fold cross validation is 95% compared to 97% that we got earlier. Now, on average we expect the model to be 95% accurate. This is much better estimate of the accuracy then what we got earlier.

Rating: 1 out of 5.

Leave a Reply