PySpark — Optimize Pivot Data Frames like a PRO

Pivoting data is a very common scenario in every data engineering pipeline. Spark provides out-of-the-box pivot() method to do the job right. But, do you know we have a performance trade off in Spark Data Frame using pivot(), if it is not used properly.

Lets check that out in action.

Pivot Data Frames

First, we create our example Data Frame

# Example Data Set
Example Data Frame

To measure the performance, we will create a simple Python decorator.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out :
https://www.geeksforgeeks.org/decorators-in-python/
import time
Python decorator to measure performance

Method 1 — Pivoting the data without specifying the Pivot column names

# Pivot data without specifying the column names(values) and checking the execution time
from pyspark.sql.functions import sum
Performance measure without specifying column names

Checking the data

# Lets check the data and schema
pivot_df_1 = df.groupBy("NAME").pivot("SUBJECT").agg(sum("MARKS"))
pivot_df_1.printSchema()
pivot_df_1.show(truncate = False)
Pivot Data Frame 1

Method 2 — Specifying the column names

First, we have to get the distinct column names from the SUBJECT column

# Get the distinct list of Subjects
_subjects = df.select("SUBJECT").distinct().rdd.map(lambda x: x[0]).collect()
_subjects
Distinct column names

Now, if we use the distinct column name for PIVOT

# Pivot data specifying the column names(values) and checking the execution time
from pyspark.sql.functions import sum
Performance measure with column names

Check the data

# Lets check the data and schema
pivot_df_2 = df.groupBy("NAME").pivot("SUBJECT", _subjects).agg(sum("MARKS"))
pivot_df_2.printSchema()
pivot_df_2.show(truncate = False)
Pivot Data Frame 2

As we can see the second time with column names specified the pivot() method ran much quicker.

Conclusion: We can now easily conclude that if the column names are specified the execution is much quicker. But, don’t forget the execution time required to get the distinct columns as well.

So, If the column name are already known/pre-specified for a larger dataset, we should always try to specify them.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/7_pivot_data_frame.ipynb

Checkout the PySpark Series on Medium — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

Wish to Buy me a Coffee: Buy Subham a Coffee

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store