How to Delete Columns from a DataFrame in PySpark?

Spread the love

In this post you will learn how to delete one or more columns from a dataframe in pyspark.

drop method –

The drop method in pyspark let’s you delete one or more columns from a dataframe.

Let’s read a dataset to illustrate it. We will use the clothing store sales data.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv')
df.show(5)

Delete a single column from a dataframe in pyspark –

To delete a single column from the dataframe just pass the name of the column to the drop method.

Let’s say we want to drop the Age column.

df = df.drop("Age")
df.show(5)

Delete multiple columns from a dataframe in pyspark –

We can also delete multiple columns by passing in multiple columns as arguments to the drop method.

Let’s say we want to drop Gender and Marital Status columns.

df = df.drop("Gender", "Marital Status")
df.show(5)

Rating: 1 out of 5.

Leave a Reply