
where() and filter() Methods –
To select or filter rows from a DataFrame in PySpark, we use the where() and filter() method. Both of these methods performs the same operation and accept the same argument types when used with DataFrames. You can use anyone whichever you want.
1 . Create a PySpark DataFrame –
Let’s read a dataset to illustrate how where() and filter() methods works.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv')
df.show(5)

2 . Filter Rows Based on Single Conditions –
Let’s first see how to filter rows from a pyspark dataframe based on single conditions. We will look at various comparison operators and see how to apply them on a dataframe.
Equal to ( == ) operator –
Let’s say we want to select all rows where Gender is Female.
df.where(df['Gender'] == 'Female').show(5)

You can also do this same operation using the filter() method. Both operations are same.
df.filter(df['Gender'] == 'Female').show(5)

For the rest of the post we will stick with where() because of it’s familiarity to SQL.
Not Equal to ( != ) Operator –
Select all rows where Gender is not Female.
df.where(df['Gender'] != 'Female').show(5)

You can also write this expression like this.
from pyspark.sql.functions import col
df.where(col("Gender") != 'Female').show(5)
Or you could write –
df.where("Gender != 'Female'").show(5)
Greater than ( > ) Operator –
Select all rows where Net Sales is greater than 100.
df.where(df['Net Sales'] > 100).show(5)

Less than ( < ) operator –
Select all rows where the Net Sales is less than 100.
df.where(df['Net Sales'] < 100).show(5)

Similarly you can do for less than or equal to and greater than or equal to operations. Let’s head over to multiple conditions.
3 . Filter Rows Based on Multiple conditions –
You can also filter rows from a pyspark dataframe based on multiple conditions. Let’s see some examples for it.
AND operation –
Select all the Rows where Method of Payment is Discover and Gender is Female.
df.where((df['Method of Payment'] == 'Discover') & (df['Gender'] == 'Female')).show(5)

OR Operation –
Select all rows where Type of Customer is Regular or Gender is Female.
df.where((df['Type of Customer']=='Regular') | (df['Gender']=='Female')).show(5)

4 . Filter Rows Based on List of Values –
To filter rows based on list of values you can use the isin() method.
list_of_val = ['Discover','MasterCard']
df.where(df['Method of Payment'].isin(list_of_val)).show(5)

You can also select rows which are not present in the list of values like this –
df.where(~df['Method of Payment'].isin(list_of_val)).show(5)

5 . Filter Rows Based on Starts with, Ends with and Contains –
You can also filter rows from a PySpark DataFrame using startswith() , endswith() and contains() methods.
startswith –
df.where(df['Gender'].startswith('M')).show(5)

endswith –
df.where(df['Method of Payment'].endswith('a')).show(5)

contains –
df.where(df['Marital Status'].contains('S')).show(5)
