
What is variance Thresholding ?
The idea behind variance Thresholding is that the features with low variance are less likely to be useful than features with high variance. In variance Thresholding, we first calculates the variance of each features and then drops all features whose variance does not meet that threshold.
How to do Feature Selection Using Variance Threshold ?
import pandas as pd
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
# load a dataset
housing = datasets.fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
# create thresholder
thresholder = VarianceThreshold(threshold=0.5)
# create high variance features
X_high_variance = thresholder.fit_transform(X)
# show high variance features
X_high_variance
output -
array([[ 8.3252 , 41. , 6.98412698, ..., 2.55555556,
37.88 , -122.23 ],
[ 8.3014 , 21. , 6.23813708, ..., 2.10984183,
37.86 , -122.22 ],
[ 7.2574 , 52. , 8.28813559, ..., 2.80225989,
37.85 , -122.24 ],
...,
[ 1.7 , 17. , 5.20554273, ..., 2.3256351 ,
39.43 , -121.22 ],
[ 1.8672 , 18. , 5.32951289, ..., 2.12320917,
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, ..., 2.61698113,
39.37 , -121.24 ]])
# shape after thresholding
X_high_variance.shape
output -
(20640, 7)
# see the variances
thresholder.variances_
output -
array([3.60914769e+00, 1.58388586e+02, 6.12123614e+00, 2.24580619e-01,
1.28240832e+06, 1.07864799e+02, 4.56207160e+00, 4.01394488e+00])