How to Handle Outliers in a dataset in Python

Spread the love

In this post, you will learn several ways to handle outliers in a dataset in python.

1 . Remove Outliers –

One of the simplest way to handle outliers is to just remove them from the data. If you believe that the outliers in the dataset are because of errors during the data collection process then you should remove it or replace it with NaN.

Let’s read a dataset for illustration.

import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/batting.csv"
df = pd.read_csv(url)
df.head()

To remove the outliers we will use the IQR method. If you don’t know what is IQR method then please read this post – How to Detect Outliers in a dataset in Python.

# IQR Method to remove outliers
q1, q3 = np.percentile(df['Runs'], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

df = df[(df['Runs'] > lower_bound) & (df['Runs'] < upper_bound)]

2 . Mark them as Outliers and Use them as a Feature –

But if you believe that the outliers in the dataset are because of genuine data then you should mark them as outliers and use them as a feature or transform their values.

df['Outlier']  = np.where((df['Runs'] > upper_bound) | (df['Runs'] < lower_bound), 1, 0)

3 . Transform the outliers –

To reduce the effect of outliers in a dataset, we can do log transformation.

df['log_of_runs'] = [np.log(x) for x in df['Runs']]

Related Posts –

1 . How to detect outliers in a dataset in Python

Leave a Reply