Pandas – drop_duplicates() – remove duplicate data in pandas.

Spread the love

In this post, you will learn –

1 . Find duplicate data in pandas.

2 . Drop duplicate rows in pandas.

3 . Drop duplicate data based on a single column

4. Drop duplicate data based on multiple columns

5. Keep – Determining which row to keep and drop.

1 . Find duplicate data in pandas.

Let’s read a dataset to work with.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/bprasad26/lwd/master/data/clothing_store_sales.csv")
df = df[['Method of Payment','Gender','Marital Status']].head(10)
df

To find duplicate rows in a dataframe, we can use the duplicated method.

df.duplicated()

This returns a boolean series with True and False. To see actual data, you can use the loc method in pandas.

df.loc[df.duplicated()]

To find the total number of rows which has duplicate values, you can use the following syntax.

df.duplicated().sum()

2 . Drop duplicate rows in pandas

To drop duplicate rows in pandas, you need to use the drop_duplicates method. This will delete all the duplicate rows and keep one rows from each. If you want to permanently change the dataframe then use inplace parameter like this df.drop_duplicates(inplace=True)

df.drop_duplicates()

3 . Drop duplicate data based on a single column

To drop duplicate data based on a particular column, you have to use the subset parameter.

df.drop_duplicates(subset='Marital Status')

Since all the rows in this column contains the same values, pandas drop all the duplicate rows and only kept one row.

4. Drop duplicate data based on multiple columns –

To delete duplicate rows based on multiple rows, you need to pass the names of columns in a list to the subset parameter.

df.drop_duplicates(subset=['Gender','Marital Status'])

5. Keep – Determining which row to keep and drop

The keep parameter let’s you decide which row to keep after deleting the duplicate rows.

Keep : {‘first’, ‘last’, False}, default ‘first’

first – By default it is set to first, means it will drop all duplicate rows except the first one.

last – last will delete all the duplicate rows except the last one.

False – will drop all duplicates.

df.drop_duplicates(keep='last')
df.drop_duplicates(keep=False)

Rating: 1 out of 5.

Leave a Reply