In this post, you will learn several ways to handle outliers in a dataset in python.
1 . Remove Outliers –
One of the simplest way to handle outliers is to just remove them from the data. If you believe that the outliers in the dataset are because of errors during the data collection process then you should remove it or replace it with NaN.
Let’s read a dataset for illustration.
import pandas as pd import numpy as np url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/batting.csv" df = pd.read_csv(url) df.head()
# IQR Method to remove outliers q1, q3 = np.percentile(df['Runs'], [25, 75]) iqr = q3 - q1 lower_bound = q1 - (1.5 * iqr) upper_bound = q3 + (1.5 * iqr) df = df[(df['Runs'] > lower_bound) & (df['Runs'] < upper_bound)]
2 . Mark them as Outliers and Use them as a Feature –
But if you believe that the outliers in the dataset are because of genuine data then you should mark them as outliers and use them as a feature or transform their values.
df['Outlier'] = np.where((df['Runs'] > upper_bound) | (df['Runs'] < lower_bound), 1, 0)
3 . Transform the outliers –
To reduce the effect of outliers in a dataset, we can do log transformation.
df['log_of_runs'] = [np.log(x) for x in df['Runs']]