In our previous post, we talked about how to select rows and columns from a dataframe using labels and indices, In this post we will learn how to use boolean vectors to filter or select data from a dataframe and a series.
1 . Boolean Indexing on a DataFrame with single condition –
Let’s read the data to work with.
import pandas as pd import numpy as np df = pd.read_csv("https://raw.githubusercontent.com/bprasad26/lwd/master/data/gapminder.tsv", sep='\t') df.head()
suppose we want to find out, all the countries where life expectancy is less than 50 years. It will be hard to select the data based on label or indices because it is not obvious. But this can be easily done using boolean indexing.
To select data using boolean indexing, first we need to create a boolean mask like this.
# boolean mask # all rows where lifeExp < 50 years df['lifeExp'] < 50
Here, you can see that whenever a rows has lifeExp value less than 50 years, you get a True value otherwise a False value.
Now, you can use this mask to save in a variable then select the required data like this –
# create a mask life_mask = df['lifeExp'] < 50 # then select the data df[life_mask]
Or a more easier and shortcut way is to directly pass the mask to the dataframe to select the data like this.
# applying directly df[df['lifeExp'] < 50]
You can use all kinds of comparison operators. Let’s say that you want to select all the data relate to India.
You can also combine .loc method with boolean indexing.
# using loc with boolean index df.loc[df['country']=='India', :]
If you want, you can also select just a fewer columns instead of selecting all columns.
Boolean indexing with iloc method does not work, if you try you will get NotImplementedError.
2 . Boolean indexing on a DataFrame with Multiple Conditions –
You can also apply multiple conditions to select data. For or we use the pipe symbol (|),for and we use (&) and for not we use (~) symbol.
Let’s say you want to find out all the data where the lifeExp is less than 50 years and the population is greater than 100 million.
df[(df['lifeExp'] < 50) & (df['pop'] > 100000000)]
one thing to notice that when you combine multiple conditions, you need to wrap every conditions with a parenthesis as I did here. If you don’t then you will get a ValueError as the truth value of a series is ambiguous.
If you want to select all the countries data which is not an Asian country, and the life expectancy is greater than 50 years.
df[~(df['continent']=='Asia') & (df['lifeExp'] > 50)]
3 . Boolean Indexing on a Series with single and multiple conditions –
You can also apply boolean indexing on a series. Let’s create a series from the dataframe.
s = df['year'] s
Now, to select all the rows of data where year is greater than 1970.
s[s > 1970]
And for multiple conditions, if you want to select all the data where year less than 1960 or greater than 1980.
s[(s< 1960) | (s > 1980)]
That’s it for today, see you soon.