In this post, you will learn How to do one Hot Encoding in pandas using pd.get_dummies() method.
What is One-Hot Encoding?
One-Hot Encoding is a method of converting categorical data to numeric data in which for every unique value in the categorical column we create a new numeric column. Let’s take this example.
When the quality of wine is bad then the bad column gets a value of 1 and all the other column gets a value of 0 and when the quality is medium then the medium column gets a value of 1 and all the other columns get the value of 0. This is why it is called One-Hot Encoding.
How to do One-Hot Encoding in Pandas –
Let’s read a dataset to work with.
import pandas as pd url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/titanic.csv" df = pd.read_csv(url) df = df[['Survived','Pclass','Sex','Age','Fare','Embarked']] df.head()
Now, to do One-Hot Encoding in Pandas we use the pd.get_dummies() method.
dummies_df = pd.get_dummies(df) dummies_df.head()
Now, we can train a logistic regression model on this data.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # drop rows that contains NaN dummies_df.dropna(inplace=True) # separate features and target X = dummies_df.drop('Survived', axis=1) y = dummies_df['Survived'] # split the data into training and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # initiate a logic regression class logreg = LogisticRegression(max_iter=1000,random_state=42) # train the model logreg.fit(X_train, y_train) # make predictions on test set y_pred = logreg.predict(X_test) # measure accuracy score = accuracy_score(y_test, y_pred) print("Accuracy:", score) output - Accuracy: 0.8186046511627907
If you are going to do One-Hot Encoding then it is better to use scikit-Learn OneHotEncoder instead of pandas get_dummies. Sometimes it happens that training set contains all the unique values in a column but the test set only contains fewer unique values compared to the training set. In this situation scikit-learn will throw an error as you are providing less columns than what is in the training set and you won’t able to build the model. Scikit-Learn automatically handles these kinds of problems so it is better to use it.