
Introduction
Preparing data for machine learning involves an integral step known as feature scaling, where we ensure all features are on the same scale. This is crucial because a feature with large values can outweigh a feature with smaller values, potentially skewing the model’s learning process. Python’s Scikit-Learn library is a handy tool for this task, providing a variety of scaling techniques. In this comprehensive guide, we will use a sample dataset to illustrate how to rescale data for machine learning in Python using Scikit-Learn.
For our example, let’s assume we have a dataset related to house pricing, where we have two features: ‘Size’ (in square feet) and ‘Bedrooms’. We want to scale these features.
Preparing the Data
import pandas as pd
# create a simple dataframe
data = {'Size': [750, 800, 850, 900, 950],
'Bedrooms': [1, 2, 2, 3, 4]
}
df = pd.DataFrame(data)
Standardization (Standard Scaler)
Standardization adjusts the values so that they have a mean of zero and a standard deviation of one.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Normalization (Min-Max Scaler)
Normalization, also known as min-max scaling, adjusts the features to fit into a fixed range, typically 0 to 1, or -1 to 1 if the original data contains negative values.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Robust Scaler
Robust Scaler uses the interquartile range, so it is less susceptible to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Quantile Transformer Scaler
The Quantile Transformer Scaler transforms features to follow a uniform or a normal distribution. It spreads out the most frequent values and reduces the impact of outliers.
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()
df_quantile = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Power Transformer Scaler
The Power Transformer is a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible.
from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer()
df_power = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Unit Vector Scaler (Normalization)
Unit Vector Scaling is commonly used in text classification or clustering for scaling features so they’re in the same order of magnitude.
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
df_unit = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
Conclusion
Rescaling data is a key preprocessing step in machine learning, ensuring all numerical features used by the model are on a similar scale. This process is crucial for optimal performance of many machine learning algorithms. The Scikit-Learn library in Python provides several ways to rescale data, each of which may be best suited to different types of applications. By familiarizing yourself with these tools, you can make informed decisions about the best way to preprocess your data for machine learning.