Feature Selection Using Variance Threshold in sklearn

Spread the love

What is variance Thresholding ?

The idea behind variance Thresholding is that the features with low variance are less likely to be useful than features with high variance. In variance Thresholding, we first calculates the variance of each features and then drops all features whose variance does not meet that threshold.

How to do Feature Selection Using Variance Threshold ?

import pandas as pd
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

# load a dataset
housing = datasets.fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# create thresholder
thresholder = VarianceThreshold(threshold=0.5)

# create high variance features
X_high_variance = thresholder.fit_transform(X)

# show high variance features
X_high_variance

output - 
array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]])
# shape after thresholding
X_high_variance.shape

output - 
(20640, 7)
# see the variances
thresholder.variances_

output - 
array([3.60914769e+00, 1.58388586e+02, 6.12123614e+00, 2.24580619e-01,
       1.28240832e+06, 1.07864799e+02, 4.56207160e+00, 4.01394488e+00])

Rating: 1 out of 5.

Leave a Reply