withColumn – Add a New Column to a PySpark DataFrame

Spread the love

In this post you will learn how to add a new column to a dataframe in PySpark.

withColumn –

The withColumn method in PySpark let’s you add a new column to a dataframe in pyspark. Let’s read a dataset to work with. We will use the Restaurant dataset.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.format('csv').option('header','true').load('../data/Restaurant.csv')
df.show(5)

Add a new column with constant value –

To add a new column with constant value, we can use the lit function. Let’s say that we want the tip percentage to be 20% for everyone.

from pyspark.sql.functions import lit

df = df.withColumn("tip_percentage", lit(0.2))
df.show(5)

Add a column based on other columns in the dataframe –

Let’s say we want to calculate the tip amount using the Meal price and tip_percentage columns.

df = df.withColumn("tip_amount", df['Meal Price ($)'] * df['tip_percentage'])
df.show(5)

Rating: 1 out of 5.

Leave a Reply