Hyperparameter Optimization with Grid Search in Machine Learning

Spread the love

Grid Search –

In Machine Learning, there are various learning algorithms to learn the parameters of a model. And in addition to these learning algorithms, we also have various hyperparameters associated with these learning algorithms that must be set before training a model. Finding the best value for these hyperparameter is one of the most important part of building a better model. This is often referred as hyperparameter tuning, hyperparameter optimization, or model selection.

One option to tune these hyperparameter is doing manually which is very tedious and not efficient. Another option is use the scikit-Learn’s GridSearchCV.

GridSearchCV is a brute-force approach to model selection using cross-validation. In GridSearchCV we define a sets of possible values for one or multiple hyperparameters and then GridSearchCV trains a model using every combination of values. The model with the best performance score is selected as the best model.

How to do Hyperparameter Optimization with GridSearchCV in Scikit-Learn ?

Let’s read a dataset to work with.

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/bprasad26/lwd/master/data/breast_cancer.csv'
df = pd.read_csv(url)
df.head()

Divide the data into training and test set.

from sklearn.model_selection import train_test_split

X = df.drop('diagnosis', axis=1).copy()
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Perform hyperparameter optimization with grid search.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# create a random forest classifier
rf_clf = RandomForestClassifier(random_state=42)

# create a dictionary of hyperparameters
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15]
}

# create grid search 
grid_search = GridSearchCV(rf_clf, param_grid, cv=5, scoring='accuracy', return_train_score=True)

# fit grid search
grid_search.fit(X_train, y_train)

The param_grid tells scikit-learn to evaluate 3 * 3 = 9 combinations of n_estimators and max_depth hyperparameter values. And since we are using 5 fold cross validation, in total 9 * 5 = 45 rounds of training will be done.

Once the search is finished we can get the the best hyperparameters like this.

grid_search.best_params_
# output
{'max_depth': 10, 'n_estimators': 200}

You can also get the best estimator directly.

grid_search.best_estimator_
# output 
RandomForestClassifier(max_depth=10, n_estimators=200, random_state=42)

You can also get the evaluation scores.

cvresult = grid_search.cv_results_
for mean_score, params in zip(cvresult['mean_test_score'], cvresult['params']):
    print(mean_score, params)
# output
0.9604395604395604 {'max_depth': 5, 'n_estimators': 100}
0.9582417582417582 {'max_depth': 5, 'n_estimators': 200}
0.956043956043956 {'max_depth': 5, 'n_estimators': 300}
0.956043956043956 {'max_depth': 10, 'n_estimators': 100}
0.9626373626373625 {'max_depth': 10, 'n_estimators': 200}
0.9626373626373625 {'max_depth': 10, 'n_estimators': 300}
0.956043956043956 {'max_depth': 15, 'n_estimators': 100}
0.9626373626373625 {'max_depth': 15, 'n_estimators': 200}
0.9626373626373625 {'max_depth': 15, 'n_estimators': 300}

In this example we get the best model by setting max_depth to 10 and n_estimators to 200.

Also in this example we only used a single algorithm but if you want you can also search for different learning algorithms and their hyperparameters values together.

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# create a pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

param_grid = [
    {'classifier': [RandomForestClassifier(random_state=42)],
    'classifier__max_depth': [5, 10, 15],
    'classifier__n_estimators': [100, 200, 300]},
    {'classifier': [SVC()],
    'classifier__kernel': ['linear','poly','rbf']}
]

# create grid search
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

Here we defined a search space that include two algorithms – SVC and random forest classifier. And each learning algorithms has it’s own hyperparameters and we defined their candidate values using the format classifier__[hyperparameter name] (it’s double underscore).

After the search is complete, we can use the best_estimator_ to view the best model’s learning algorithm and hyperparameters.

grid_search.best_estimator_.get_params()['classifier']
# output
RandomForestClassifier(max_depth=10, n_estimators=200, random_state=42)

To learn more about how to use scikit-learn pipeline for grid search, please read this post –

How to Build Machine Learning Pipeline with Scikit-Learn?

Related Posts –

  1. Hyperparameter Tuning with Randomized Search in Machine Learning

Rating: 1 out of 5.

Leave a Reply