In this post you will learn how to delete one or more columns from a dataframe in pyspark.
drop method –
The drop method in pyspark let’s you delete one or more columns from a dataframe.
Let’s read a dataset to illustrate it. We will use the clothing store sales data.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv') df.show(5)
Delete a single column from a dataframe in pyspark –
To delete a single column from the dataframe just pass the name of the column to the drop method.
Let’s say we want to drop the Age column.
df = df.drop("Age") df.show(5)
Delete multiple columns from a dataframe in pyspark –
We can also delete multiple columns by passing in multiple columns as arguments to the drop method.
Let’s say we want to drop Gender and Marital Status columns.
df = df.drop("Gender", "Marital Status") df.show(5)