Min-Max Scaling (normalization) –
There are various machine learning algorithms which do not perform very well when the features have very different scales. There are various techniques to scaled the features to the same scale, one of them is Min-Max Scaling.
The Min-Max Scaling uses the minimum and maximum value of a feature to rescale values within a range typically between 0 to 1 or -1 to 1. Scikit-Learn has a MinMaxScaler which helps us do min-max scaling.
Formula for Min-Max Scaling –
where is an original value, is the normalized value.
Let’s see how to do it.
# import libraries import pandas as pd from sklearn import datasets # get features and target housing = datasets.fetch_california_housing() X = housing.data y = housing.target # create pandas dataframe X = pd.DataFrame(X, columns=housing.feature_names) X.head()
Here we have some housing data. Let’s now apply Min-Max scaling.
from sklearn.preprocessing import MinMaxScaler # apply min-max scaling minmax_scaler = MinMaxScaler(feature_range=(0, 1)) scaled_feature = minmax_scaler.fit_transform(X) scaled_feature[:3] output - array([[0.53966842, 0.78431373, 0.0435123 , 0.02046866, 0.00894083, 0.00149943, 0.5674814 , 0.21115538], [0.53802706, 0.39215686, 0.03822395, 0.01892926, 0.0672104 , 0.00114074, 0.565356 , 0.21215139], [0.46602805, 1. , 0.05275646, 0.02194011, 0.01381765, 0.00169796, 0.5642933 , 0.21015936]])
By default MinMaxScaler scale the feature between 0 and 1 but if you need to change to some other value, you can do this with the feature_range haperparameter.
Let’s see how to apply Min-Max Scaling in a end to end machine learning problem.
# import libraries import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import make_pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # split the data into training and test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42) # create a ml model using pipeline model = make_pipeline(MinMaxScaler(), LinearRegression()) # fit the model on training data model.fit(X_train, y_train) # test the model on test set y_pred = model.predict(X_test) # measure error mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) print("Root Mean Squre Error:", rmse) output - Root Mean Squre Error: 0.7284008391515451