
In this post, you will learn How to do one Hot Encoding in pandas using pd.get_dummies() method.
What is One-Hot Encoding?
One-Hot Encoding is a method of converting categorical data to numeric data in which for every unique value in the categorical column we create a new numeric column. Let’s take this example.

When the quality of wine is bad then the bad column gets a value of 1 and all the other column gets a value of 0 and when the quality is medium then the medium column gets a value of 1 and all the other columns get the value of 0. This is why it is called One-Hot Encoding.
How to do One-Hot Encoding in Pandas –
Let’s read a dataset to work with.
import pandas as pd
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/titanic.csv"
df = pd.read_csv(url)
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked']]
df.head()

Now, to do One-Hot Encoding in Pandas we use the pd.get_dummies() method.
dummies_df = pd.get_dummies(df)
dummies_df.head()

Now, we can train a logistic regression model on this data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# drop rows that contains NaN
dummies_df.dropna(inplace=True)
# separate features and target
X = dummies_df.drop('Survived', axis=1)
y = dummies_df['Survived']
# split the data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# initiate a logic regression class
logreg = LogisticRegression(max_iter=1000,random_state=42)
# train the model
logreg.fit(X_train, y_train)
# make predictions on test set
y_pred = logreg.predict(X_test)
# measure accuracy
score = accuracy_score(y_test, y_pred)
print("Accuracy:", score)
output -
Accuracy: 0.8186046511627907
Note –
If you are going to do One-Hot Encoding then it is better to use scikit-Learn OneHotEncoder instead of pandas get_dummies. Sometimes it happens that training set contains all the unique values in a column but the test set only contains fewer unique values compared to the training set. In this situation scikit-learn will throw an error as you are providing less columns than what is in the training set and you won’t able to build the model. Scikit-Learn automatically handles these kinds of problems so it is better to use it.