How to Remove Duplicates from Pandas DataFrame

Spread the love

Data duplication is a common problem in data analysis. Duplicate data can distort your results, leading to inaccuracies in your analyses and misleading conclusions. Fortunately, with the powerful Python library Pandas, it’s relatively easy to deal with duplicates.

In this comprehensive guide, we will walk you through different ways to identify and remove duplicates from a DataFrame using Pandas.

Creating a DataFrame

First, let’s create a DataFrame. For this guide, we will work with a simple DataFrame:

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'John', 'Charles', 'Anna', 'Charles'],
    'Age': [28, 22, 28, 24, 22, 24],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

print(df)

Identifying Duplicates

Pandas provides the duplicated() function, which returns a Boolean series that is True for each duplicated row. Here’s how to use it:

print(df.duplicated())

By default, duplicated() considers all columns.

Removing Duplicates

To remove duplicates, we use the drop_duplicates() function. This function returns a new DataFrame with the duplicates removed:

df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

By default, drop_duplicates() considers all columns and keeps the first occurrence of each duplicate.

Keeping the Last Occurrence

By default, drop_duplicates() keeps the first occurrence of each duplicate. If you want to keep the last occurrence instead, you can use the keep parameter:

df_no_duplicates = df.drop_duplicates(keep='last')

print(df_no_duplicates)

In this example, the DataFrame df_no_duplicates contains the last occurrence of each duplicate.

Considering Certain Columns

By default, duplicated() and drop_duplicates() consider all columns. If you want to consider only certain columns, you can pass them as a list:

print(df.duplicated(subset=['Name']))
df_no_duplicates = df.drop_duplicates(subset=['Name'])

print(df_no_duplicates)

In this example, the functions only consider the ‘Name’ column. Hence, only the first occurrence of each name is kept in df_no_duplicates.

Removing Duplicates Inplace

By default, drop_duplicates() returns a new DataFrame and does not modify the original. If you want to remove duplicates from the original DataFrame, you can use the inplace parameter:

df.drop_duplicates(inplace=True)

print(df)

In this example, the duplicates are removed from df itself.

Conclusion

In this comprehensive guide, we learned how to identify and remove duplicates from a DataFrame using Pandas. We covered how to identify duplicates using the duplicated() function, how to remove duplicates using the drop_duplicates() function, how to control which occurrences to keep using the keep parameter, how to consider only certain columns using the subset parameter, and how to remove duplicates in place using the inplace parameter.

Removing duplicates is a crucial step in data cleaning and preparation, and understanding how to handle them using Pandas will help ensure the accuracy of your data analysis.

Leave a Reply