OneHotEncoder – How to do One Hot Encoding in sklearn.

Spread the love

In this post, you will learn How to do One Hot Encoding in Scikit_learn.

What is One Hot Encoding?

One-Hot Encoding is a method of converting categorical data to numeric data in which for every unique value in the categorical column we create a new numeric column. Let’s take this example.

When the quality of wine is bad then the bad column gets a value of 1 and all the other column gets a value of 0 and when the quality is medium then the medium column gets a value of 1 and all the other columns get the value of 0. This is why it is called One-Hot Encoding.

How to do One-Hot Encoding in scikit-Learn?

Let’s read a dataset to work with.

import pandas as pd
url = ""
df = pd.read_csv(url)
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked']]

Now, let’s create a training and test set.

from sklearn.model_selection import train_test_split

# separate features and target
X = df.drop('Survived', axis=1)
y = df['Survived']
# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

Now, to do one hot encoding in scikit-learn we use OneHotEncoder.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
titanic_1hot = ohe.fit_transform(X_train)

To get the feature names after one hot encoding you can use


If you run the above code you will find that scikit-learn applied one hot encoding on numeric columns also which we do not want. We only want to apply the transformation on categorical columns. To handle this we can use the ColumnTransformer which applies different transformation on numeric and categorical columns. And sparse=False means that we want numpy array instead of sparse matrix.

Let’s see how to do it.

# import libraries
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# get the categorical and numeric column names
num_cols = X_train.select_dtypes(exclude=['object']).columns.tolist()
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# pipeline for numerical columns
num_pipe = make_pipeline(
# pipeline for categorical columns
cat_pipe = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='N/A'),
    OneHotEncoder(handle_unknown='ignore', sparse=False)

# combine both the pipelines
full_pipe = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)

# build the model
logreg = make_pipeline(
    full_pipe, LogisticRegression(max_iter=1000, random_state=42))

# train the model, y_train)

# make predictions on the test set
y_pred = logreg.predict(X_test)

# measure accuracy
score = accuracy_score(y_test, y_pred)
print("Accuracy Score:", score)

output - 
Accuracy Score: 0.7947761194029851

If you are not familiar with scikit-learn pipeline then please read the related post below.

Related Posts –

1 . How to Build Machine Learning Pipeline in Scikit-Learn

2 . How to do One Hot Encoding in pandas

Leave a Reply