One of the easiest way to reduce the dimensionality of a dataset is to remove the highly correlated features. The idea is that if two features are highly correlated then the information they contain is very similar, and it is likely redundant to include both the features. So it is better to remove one of them from the feature set.
How to remove highly correlated features ?
# import libraries import pandas as pd import numpy as np from sklearn import datasets # load a dataset hosuing = datasets.fetch_california_housing() X = pd.DataFrame(hosuing.data, columns=hosuing.feature_names) y = hosuing.target # create correlation matrix corr_matrix = X.corr().abs() # select upper traingle of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of columns with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column] > 0.95)] # drop the columns X.drop(X.columns[to_drop], axis=1)