Rescaling Data for Machine Learning in Python Using Scikit-Learn

Spread the love

Introduction

Preparing data for machine learning involves an integral step known as feature scaling, where we ensure all features are on the same scale. This is crucial because a feature with large values can outweigh a feature with smaller values, potentially skewing the model’s learning process. Python’s Scikit-Learn library is a handy tool for this task, providing a variety of scaling techniques. In this comprehensive guide, we will use a sample dataset to illustrate how to rescale data for machine learning in Python using Scikit-Learn.

For our example, let’s assume we have a dataset related to house pricing, where we have two features: ‘Size’ (in square feet) and ‘Bedrooms’. We want to scale these features.

Preparing the Data

import pandas as pd

# create a simple dataframe
data = {'Size': [750, 800, 850, 900, 950],
        'Bedrooms': [1, 2, 2, 3, 4]
        }

df = pd.DataFrame(data)

Standardization (Standard Scaler)

Standardization adjusts the values so that they have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Normalization (Min-Max Scaler)

Normalization, also known as min-max scaling, adjusts the features to fit into a fixed range, typically 0 to 1, or -1 to 1 if the original data contains negative values.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Robust Scaler

Robust Scaler uses the interquartile range, so it is less susceptible to outliers.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Quantile Transformer Scaler

The Quantile Transformer Scaler transforms features to follow a uniform or a normal distribution. It spreads out the most frequent values and reduces the impact of outliers.

from sklearn.preprocessing import QuantileTransformer

scaler = QuantileTransformer()
df_quantile = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Power Transformer Scaler

The Power Transformer is a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible.

from sklearn.preprocessing import PowerTransformer

scaler = PowerTransformer()
df_power = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Unit Vector Scaler (Normalization)

Unit Vector Scaling is commonly used in text classification or clustering for scaling features so they’re in the same order of magnitude.

from sklearn.preprocessing import Normalizer

scaler = Normalizer()
df_unit = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)

Conclusion

Rescaling data is a key preprocessing step in machine learning, ensuring all numerical features used by the model are on a similar scale. This process is crucial for optimal performance of many machine learning algorithms. The Scikit-Learn library in Python provides several ways to rescale data, each of which may be best suited to different types of applications. By familiarizing yourself with these tools, you can make informed decisions about the best way to preprocess your data for machine learning.

Leave a Reply