## where() and filter() Methods –

To select or filter rows from a DataFrame in PySpark, we use the **where()** and **filter()** method. Both of these methods performs the same operation and accept the same argument types when used with DataFrames. You can use anyone whichever you want.

### 1 . Create a PySpark DataFrame –

Let’s read a dataset to illustrate how **where() and filter()** methods works.

```
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv')
df.show(5)
```

### 2 . Filter Rows Based on Single Conditions –

Let’s first see how to filter rows from a pyspark dataframe based on single conditions. We will look at various comparison operators and see how to apply them on a dataframe.

#### Equal to ( == ) operator –

Let’s say we want to select all rows where Gender is Female.

`df.where(df['Gender'] == 'Female').show(5)`

You can also do this same operation using the **filter()** method. Both operations are same.

`df.filter(df['Gender'] == 'Female').show(5)`

For the rest of the post we will stick with **where()** because of it’s familiarity to SQL.

#### Not Equal to ( != ) Operator –

Select all rows where Gender is not Female.

`df.where(df['Gender'] != 'Female').show(5)`

You can also write this expression like this.

```
from pyspark.sql.functions import col
df.where(col("Gender") != 'Female').show(5)
```

Or you could write –

`df.where("Gender != 'Female'").show(5)`

#### Greater than ( > ) Operator –

Select all rows where Net Sales is greater than 100.

`df.where(df['Net Sales'] > 100).show(5)`

#### Less than ( < ) operator –

Select all rows where the Net Sales is less than 100.

`df.where(df['Net Sales'] < 100).show(5)`

Similarly you can do for less than or equal to and greater than or equal to operations. Let’s head over to multiple conditions.

### 3 . Filter Rows Based on Multiple conditions –

You can also filter rows from a pyspark dataframe based on multiple conditions. Let’s see some examples for it.

#### AND operation –

Select all the Rows where Method of Payment is Discover and Gender is Female.

`df.where((df['Method of Payment'] == 'Discover') & (df['Gender'] == 'Female')).show(5)`

#### OR Operation –

Select all rows where Type of Customer is Regular or Gender is Female.

`df.where((df['Type of Customer']=='Regular') | (df['Gender']=='Female')).show(5)`

### 4 . Filter Rows Based on List of Values –

To filter rows based on list of values you can use the** isin()** method.

```
list_of_val = ['Discover','MasterCard']
df.where(df['Method of Payment'].isin(list_of_val)).show(5)
```

You can also select rows which are not present in the list of values like this –

`df.where(~df['Method of Payment'].isin(list_of_val)).show(5)`

### 5 . Filter Rows Based on Starts with, Ends with and Contains –

You can also filter rows from a PySpark DataFrame using **startswith() , endswith() and contains()** methods.

#### startswith –

`df.where(df['Gender'].startswith('M')).show(5)`

#### endswith –

`df.where(df['Method of Payment'].endswith('a')).show(5)`

#### contains –

`df.where(df['Marital Status'].contains('S')).show(5)`