In this post, you will learn various ways to detect Outliers in a dataset.
What is an outlier in data?
In statistics, an outlier is a data point that differs significantly from other observation. A dataset can have outliers because of genuine reasons or it could be because of error during data collection process. Let’s see how to find outliers in a dataset.
Finding Outliers in a dataset –
1 . Detecting outliers using 1.5*IQR Rule –
A very common method of finding outliers is using the 1.5*IQR rule. This Rules tells us that any data point that greater than Q3 + 1.5*IQR or less than Q1 – 1.5*IQR is an outlier. Q1 is the first quartile and q3 is the third quartile. Q1 is the value below which 25% of the data lies and Q3 is the value below which 75% of the data lies. And IQR (Interquartile range) is the difference between Q3 – Q1, It is the middle 50% of the data in a distribution.
Let’s create a function to find the index of outliers in a dataset.
import pandas as pd import numpy as np def detect_outliers(x): q1, q3 = np.percentile(x, [25, 75]) iqr = q3 - q1 lower_bound = q1 - (1.5 * iqr) upper_bound = q3 + (1.5 * iqr) return np.where((x > upper_bound) | (x < lower_bound))
And read a dataset to work with.
url = "https://raw.githubusercontent.com/bprasad26/lwd/master/data/batting.csv" df = pd.read_csv(url) df.head()
Now, let’s apply us the function to get the index of outliers.
detect_outliers(df['Runs']) output - (array([ 0, 100, 101, 102, 200, 300, 301, 302, 500, 600, 601, 700], dtype=int64),)
2 . Detecting outliers using Box Plot.
Another way to detect an outlier in a dataset is using Box plot.
import plotly.express as px fig = px.box(df, y='Runs') fig.show()
All the data outside the whiskers shown as a circle is an outlier.