
In this post you will learn how to add a new column to a dataframe in PySpark.
withColumn –
The withColumn method in PySpark let’s you add a new column to a dataframe in pyspark. Let’s read a dataset to work with. We will use the Restaurant dataset.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('csv').option('header','true').load('../data/Restaurant.csv')
df.show(5)

Add a new column with constant value –
To add a new column with constant value, we can use the lit function. Let’s say that we want the tip percentage to be 20% for everyone.
from pyspark.sql.functions import lit
df = df.withColumn("tip_percentage", lit(0.2))
df.show(5)

Add a column based on other columns in the dataframe –
Let’s say we want to calculate the tip amount using the Meal price and tip_percentage columns.
df = df.withColumn("tip_amount", df['Meal Price ($)'] * df['tip_percentage'])
df.show(5)
