
Two frequently encountered operations when handling DataFrames in PySpark are Pivot and Unpivot. This article offers an in-depth look into these operations, their applications, and examples.
PySpark Pivoting
Pivoting is a method that reshapes data, transitioning it from a long format to a wide one. This process involves rotating the data, converting unique values from a particular column into new columns in the output DataFrame, and populating these columns with corresponding values.
In PySpark, the groupBy().pivot().agg()
chain of functions is the common technique for pivoting. Here’s the syntax:
DataFrame.groupBy().pivot(pivot_col).agg()
groupBy()
: Specifies the columns to group by.pivot(pivot_col)
: Specifies the column to pivot on, withpivot_col
being the column name.agg()
: Specifies the aggregate function to apply.
PySpark Unpivoting
Unpivoting is essentially the reverse process of pivoting. It transforms data from a wide format back to a long format. While PySpark does not have a built-in function for this operation, it can be achieved using select()
, expr()
, and union()
.
Detailed Examples of Pivoting and Unpivoting
Now, let’s demonstrate the use of these operations with a practical example.
Pivoting
Consider the following DataFrame, which represents the sales data for different types of fruit in various months:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pivotExample").getOrCreate()
data = [("apple", "2023-01", 100),
("banana", "2023-01", 200),
("apple", "2023-02", 150),
("banana", "2023-02", 250)]
df = spark.createDataFrame(data, ["fruit", "month", "sales"])
df.show()
This would output:
+------+-------+-----+
| fruit| month|sales|
+------+-------+-----+
| apple|2023-01| 100|
|banana|2023-01| 200|
| apple|2023-02| 150|
|banana|2023-02| 250|
+------+-------+-----+
If we want to pivot the DataFrame on the ‘fruit’ column, creating a separate column for each fruit type, we can do this:
pivotDF = df.groupBy("month").pivot("fruit").sum("sales")
pivotDF.show()
In this example, we first group the data by the ‘month’ column using groupBy()
. We then pivot on the ‘fruit’ column using pivot()
, and calculate the total sales for each fruit in each month using sum("sales")
. This results in the following:
+-------+-----+------+
| month|apple|banana|
+-------+-----+------+
|2023-01| 100| 200|
|2023-02| 150| 250|
+-------+-----+------+
Unpivoting
Taking the pivoted DataFrame, we’ll now unpivot it back to the original structure:
from pyspark.sql.functions import expr
unpivotExpr = "stack(2, 'apple', apple, 'banana', banana) as (fruit,sales)"
unpivotDF = pivotDF.select("month", expr(unpivotExpr)) \
.where("sales is not null")
unpivotDF.show()
The stack()
function in expr()
is used to unpivot the DataFrame. The ‘2’ indicates the number of pivoted columns. The where()
function is used to exclude null values introduced during the unpivot operation, resulting in:
+-------+------+-----+
| month| fruit|sales|
+-------+------+-----+
|2023-01| apple| 100|
|2023-01|banana| 200|
|2023-02| apple| 150|
|2023-02|banana| 250|
+-------+------+-----+
Use Cases for Pivoting and Unpivoting
Pivoting and unpivoting are critical tools for data preprocessing and exploration. Here are some scenarios where they come in handy:
- Data Analysis: Pivoting can provide a summarised view of data, presenting it more intuitively or making it more suitable for presentation. For instance, pivoting sales data on the product category can provide insights into sales trends per category.
- Machine Learning: Some algorithms necessitate data in a specific format. For instance, wide-format data might be more suitable for certain time-series analyses or for algorithms that require one-hot encoding.
- Data Normalisation: Unpivoting helps normalise data which might be in a denormalised form. Normalised data is typically easier to maintain and less redundant, which can be advantageous for certain operations and analyses.
Conclusion
Pivoting and unpivoting operations are foundational to effective data manipulation and analysis in PySpark. These operations provide the flexibility to reshape data to meet specific needs, be it data analysis, machine learning, or data normalisation. With the discussions and examples presented in this article, you should be well-equipped to navigate data pivoting and unpivoting tasks in PySpark.