
In this post, you will learn various ways to detect Outliers in a dataset.
What is an outlier in data?
In statistics, an outlier is a data point that differs significantly from other observation. A dataset can have outliers because of genuine reasons or it could be because of error during data collection process. Let’s see how to find outliers in a dataset.
Finding Outliers in a dataset –
1 . Detecting outliers using 1.5*IQR Rule –
A very common method of finding outliers is using the 1.5*IQR rule. This Rules tells us that any data point that greater than Q3 + 1.5*IQR or less than Q1 – 1.5*IQR is an outlier. Q1 is the first quartile and q3 is the third quartile. Q1 is the value below which 25% of the data lies and Q3 is the value below which 75% of the data lies. And IQR (Interquartile range) is the difference between Q3 – Q1, It is the middle 50% of the data in a distribution.
Let’s create a function to find the index of outliers in a dataset.
import pandas as pd
import numpy as np
def detect_outliers(x):
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
return np.where((x > upper_bound) | (x < lower_bound))
And read a dataset to work with.
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/batting.csv"
df = pd.read_csv(url)
df.head()

Now, let’s apply us the function to get the index of outliers.
detect_outliers(df['Runs'])
output -
(array([ 0, 100, 101, 102, 200, 300, 301, 302, 500, 600, 601, 700],
dtype=int64),)
2 . Detecting outliers using Box Plot.
Another way to detect an outlier in a dataset is using Box plot.
import plotly.express as px
fig = px.box(df, y='Runs')
fig.show()

All the data outside the whiskers shown as a circle is an outlier.