
Data duplication is a common problem in data analysis. Duplicate data can distort your results, leading to inaccuracies in your analyses and misleading conclusions. Fortunately, with the powerful Python library Pandas, it’s relatively easy to deal with duplicates.
In this comprehensive guide, we will walk you through different ways to identify and remove duplicates from a DataFrame using Pandas.
Creating a DataFrame
First, let’s create a DataFrame. For this guide, we will work with a simple DataFrame:
import pandas as pd
data = {
'Name': ['John', 'Anna', 'John', 'Charles', 'Anna', 'Charles'],
'Age': [28, 22, 28, 24, 22, 24],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Identifying Duplicates
Pandas provides the duplicated()
function, which returns a Boolean series that is True
for each duplicated row. Here’s how to use it:
print(df.duplicated())
By default, duplicated()
considers all columns.
Removing Duplicates
To remove duplicates, we use the drop_duplicates()
function. This function returns a new DataFrame with the duplicates removed:
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
By default, drop_duplicates()
considers all columns and keeps the first occurrence of each duplicate.
Keeping the Last Occurrence
By default, drop_duplicates()
keeps the first occurrence of each duplicate. If you want to keep the last occurrence instead, you can use the keep
parameter:
df_no_duplicates = df.drop_duplicates(keep='last')
print(df_no_duplicates)
In this example, the DataFrame df_no_duplicates
contains the last occurrence of each duplicate.
Considering Certain Columns
By default, duplicated()
and drop_duplicates()
consider all columns. If you want to consider only certain columns, you can pass them as a list:
print(df.duplicated(subset=['Name']))
df_no_duplicates = df.drop_duplicates(subset=['Name'])
print(df_no_duplicates)
In this example, the functions only consider the ‘Name’ column. Hence, only the first occurrence of each name is kept in df_no_duplicates
.
Removing Duplicates Inplace
By default, drop_duplicates()
returns a new DataFrame and does not modify the original. If you want to remove duplicates from the original DataFrame, you can use the inplace
parameter:
df.drop_duplicates(inplace=True)
print(df)
In this example, the duplicates are removed from df
itself.
Conclusion
In this comprehensive guide, we learned how to identify and remove duplicates from a DataFrame using Pandas. We covered how to identify duplicates using the duplicated()
function, how to remove duplicates using the drop_duplicates()
function, how to control which occurrences to keep using the keep
parameter, how to consider only certain columns using the subset
parameter, and how to remove duplicates in place using the inplace
parameter.
Removing duplicates is a crucial step in data cleaning and preparation, and understanding how to handle them using Pandas will help ensure the accuracy of your data analysis.