There could be many reasons why the data is missing. It is not uncommon when working with real messy data. knowing how to deal with missing value is an important skills for any Data Professionals. In this post, you will learn how to use pandas DataFrame.dropna() method to handle missing values.
DataFrame.dropna() – Drop Missing Values –
One of the easiest way to deal with missing data is to simply drop the features (columns) or samples (rows) from the dataset. We can easily do this using pandas DataFrame.dropna() method.
First, Let’s read a dataset to work with.
import pandas as pd df = pd.read_csv( "https://raw.githubusercontent.com/bprasad26/lwd/master/data/fruit_prices.csv" ) df
Here, we have a dataset of fruit prices. And you can see we have some missing prices for certain fruits in certain months. Missing data in pandas is represented with NaN which stands for Not a Number.
Check How much Data is Missing –
Before we drop any missing values we need to know how much data is missing. A simple way to this in pandas is the following.
# check missing data df.isnull().sum()
The df.isnull() method returns a DataFrame with Boolean values that indicate whether a cell contains a missing value or not and the df.sum() method return the count of missing values in each columns.
Dropping Missing Values from Rows –
To drop all the rows which contains at least one missing value, we use
By default the axis parameter of dropna is 0 or ‘index’ which drop rows which contains any missing values.
To drop columns with missing values we use the axis= 1 or axis=’columns’. This will drop any column which has at least one missing value.
Since all of our columns contains at least one NaN value, all the columns gets dropped.
The how parameter of dropna let’s determine how much missing value a row or column should have before it gets dropped.
By default how=’any’ means it will drop any row or column which has any missing values. To drop only the rows or columns whose all the data are missing we use how=’all’.
Since none of our columns contains all of the data missing, pandas keeps all of the columns. To drop all the rows where all of the data is missing, just remove the axis parameter or set it to axis=0.
df.dropna() method also have a thresh parameter. It let’s you specify how much non missing data a row or column should have before it gets dropped.
Let’s say we set the thresh=3, now pandas will drop any rows or column which does not have that many non missing values. In our case the row index – 1, 2, 6 and 8 will be dropped as they rows have less than 3 non missing values in them.
The df.dropna() method also have a subset parameter. It let’s you control on which columns pandas should look for missing values and perform dropna actions.
Let’s say we want to drop any rows which contains missing values in Orange and Banana column then in our case it will drop the row index – 1, 2, 6 and 8.
The last parameter in df.dropna() method is the inplace parameter. By default it is set to False which means the original dataframe will not be changed instead a new dataframe with the applied dropna method will be returned and you have to assign it to a new variable like this –
new_df = df.dropna() new_df
If you wish to change the original dataframe then set it to True like this