select and selectExpr –
select and selectExpr allow you to do the DataFrame equivalent of SQL queries on a table of data.
# in SQL SELECT * FROM dataFrameTable SELECT columnName1, columnName2 FROM dataFrameTable
Let’s read a dataset to illustrate it.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.read.format('csv').option('header','true').load('../data/clothing_store_sales.csv') df.show(5)
Now, let’s say you want to select the Gender column. You can do that by passing the column name as string to the select method.
And to select multiple columns, just pass the name of the columns as a string.
In PySpark you can refer to columns in a number of different ways.
from pyspark.sql.functions import expr, col, column df.select(expr("Gender"), col("Gender"), column("Gender")).show(5)
Out of all expr is the most flexible one. It can refer to a plain column or a string manipulation of a column.
Let’s say I want to select a column but also want to change the name of the column like we do in SQL
# SQL SELECT Gender AS male_or_female FROM Table1
We can do this using expr.
df.select(expr("Gender AS male_or_female")).show(5)
This changes the column name to male_or_female. You can further manipulate the result of your expression as another expression.
Let’s change the name of the column back to it’s original name.
df.select(expr("Gender AS male_or_female").alias("Gender")).show(5)
Because select followed by a series of expr is such a common pattern, Spark has a shorthand for doing this efficiently – selectExpr
df.selectExpr("Gender","Age AS age_of_customer").show(5)
With selectExpr we can build complex expressions that creates a new dataframe.
df.selectExpr("COUNT(Gender) as total_customers", "AVG(Age) as avg_age").show(2)
Here, we calculated the total numbers of customers in the dataset and the average age of the customers.