How to Detect Outliers in a dataset in Python?

Spread the love

In this post, you will learn various ways to detect Outliers in a dataset.

What is an outlier in data?

In statistics, an outlier is a data point that differs significantly from other observation. A dataset can have outliers because of genuine reasons or it could be because of error during data collection process. Let’s see how to find outliers in a dataset.

Finding Outliers in a dataset –

1 . Detecting outliers using 1.5*IQR Rule –

A very common method of finding outliers is using the 1.5*IQR rule. This Rules tells us that any data point that greater than Q3 + 1.5*IQR or less than Q1 – 1.5*IQR is an outlier. Q1 is the first quartile and q3 is the third quartile. Q1 is the value below which 25% of the data lies and Q3 is the value below which 75% of the data lies. And IQR (Interquartile range) is the difference between Q3 – Q1, It is the middle 50% of the data in a distribution.

Let’s create a function to find the index of outliers in a dataset.

import pandas as pd
import numpy as np

def detect_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    return np.where((x > upper_bound) | (x < lower_bound))

And read a dataset to work with.

url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/batting.csv"
df = pd.read_csv(url)
df.head()

Now, let’s apply us the function to get the index of outliers.

detect_outliers(df['Runs'])

output - 
(array([  0, 100, 101, 102, 200, 300, 301, 302, 500, 600, 601, 700],
       dtype=int64),)

2 . Detecting outliers using Box Plot.

Another way to detect an outlier in a dataset is using Box plot.

import plotly.express as px
fig = px.box(df, y='Runs')
fig.show()

All the data outside the whiskers shown as a circle is an outlier.

Rating: 1 out of 5.

Leave a Reply