select and selectExpr in PySpark Explained with Examples

Spread the love

select and selectExpr –

select and selectExpr allow you to do the DataFrame equivalent of SQL queries on a table of data.

# in SQL
SELECT * FROM dataFrameTable
SELECT columnName1, columnName2 FROM dataFrameTable

Let’s read a dataset to illustrate it.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df ='csv').option('header','true').load('../data/clothing_store_sales.csv')

Now, let’s say you want to select the Gender column. You can do that by passing the column name as string to the select method."Gender").show(5)

And to select multiple columns, just pass the name of the columns as a string."Gender","Marital Status").show(5)

In PySpark you can refer to columns in a number of different ways.

from pyspark.sql.functions import expr, col, column"Gender"), col("Gender"), column("Gender")).show(5)

Out of all expr is the most flexible one. It can refer to a plain column or a string manipulation of a column.

Let’s say I want to select a column but also want to change the name of the column like we do in SQL

SELECT Gender AS male_or_female 
FROM Table1

We can do this using expr."Gender AS male_or_female")).show(5)

This changes the column name to male_or_female. You can further manipulate the result of your expression as another expression.

Let’s change the name of the column back to it’s original name."Gender AS male_or_female").alias("Gender")).show(5)

Because select followed by a series of expr is such a common pattern, Spark has a shorthand for doing this efficiently – selectExpr

df.selectExpr("Gender","Age AS age_of_customer").show(5)

With selectExpr we can build complex expressions that creates a new dataframe.

df.selectExpr("COUNT(Gender) as total_customers", "AVG(Age) as avg_age").show(2)

Here, we calculated the total numbers of customers in the dataset and the average age of the customers.

Rating: 1 out of 5.

Leave a Reply