How to remove Highly Correlated Features from a dataset

Spread the love

One of the easiest way to reduce the dimensionality of a dataset is to remove the highly correlated features. The idea is that if two features are highly correlated then the information they contain is very similar, and it is likely redundant to include both the features. So it is better to remove one of them from the feature set.

How to remove highly correlated features ?

# import libraries
import pandas as pd
import numpy as np
from sklearn import datasets

# load a dataset
hosuing = datasets.fetch_california_housing()
X = pd.DataFrame(hosuing.data, columns=hosuing.feature_names)
y = hosuing.target

# create correlation  matrix
corr_matrix = X.corr().abs()

# select upper traingle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# drop the columns
X.drop(X.columns[to_drop], axis=1)

Rating: 1 out of 5.

Leave a Reply