get_dummies – How to do One Hot Encoding in Pandas

Spread the love

In this post, you will learn How to do one Hot Encoding in pandas using pd.get_dummies() method.

What is One-Hot Encoding?

One-Hot Encoding is a method of converting categorical data to numeric data in which for every unique value in the categorical column we create a new numeric column. Let’s take this example.

When the quality of wine is bad then the bad column gets a value of 1 and all the other column gets a value of 0 and when the quality is medium then the medium column gets a value of 1 and all the other columns get the value of 0. This is why it is called One-Hot Encoding.

How to do One-Hot Encoding in Pandas –

Let’s read a dataset to work with.

import pandas as pd
url = ""
df = pd.read_csv(url)
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked']]

Now, to do One-Hot Encoding in Pandas we use the pd.get_dummies() method.

dummies_df = pd.get_dummies(df)

Now, we can train a logistic regression model on this data.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# drop rows that contains NaN

# separate features and target
X = dummies_df.drop('Survived', axis=1)
y = dummies_df['Survived']

# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# initiate a logic regression class
logreg = LogisticRegression(max_iter=1000,random_state=42)

# train the model, y_train)

# make predictions on test set
y_pred = logreg.predict(X_test)

# measure accuracy
score = accuracy_score(y_test, y_pred)
print("Accuracy:", score)

output - 
Accuracy: 0.8186046511627907

Note –

If you are going to do One-Hot Encoding then it is better to use scikit-Learn OneHotEncoder instead of pandas get_dummies. Sometimes it happens that training set contains all the unique values in a column but the test set only contains fewer unique values compared to the training set. In this situation scikit-learn will throw an error as you are providing less columns than what is in the training set and you won’t able to build the model. Scikit-Learn automatically handles these kinds of problems so it is better to use it.

Leave a Reply