PySpark — Optimize Pivot Data Frames like a PRO

Subham Khandelwal
3 min readOct 9, 2022

Pivoting data is a very common scenario in every data engineering pipeline. Spark provides out-of-the-box pivot() method to do the job right. But, do you know we have a performance trade off in Spark Data Frame using pivot(), if it is not used properly.

Lets check that out in action.

Pivot Data Frames

First, we create our example Data Frame

# Example Data Set_data = [
["Ramesh", "PHY", 90],
["Ramesh", "MATH", 95],
["Ramesh", "CHEM", 100],
["Sangeeta", "PHY", 90],
["Sangeeta", "MATH", 100],
["Sangeeta", "CHEM", 83],
["Mohan", "BIO", 90],
["Mohan", "MATH", 70],
["Mohan", "CHEM", 76],
["Imran", "PHY", 96],
["Imran", "MATH", 87],
["Imran", "CHEM", 79],
["Imran", "BIO", 82]
]
_cols = ["NAME", "SUBJECT", "MARKS"]# Generate Data Frame
df = spark.createDataFrame(data=_data, schema = _cols)
df.show(truncate = False)
Example Data Frame

To measure the performance, we will create a simple Python decorator.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out

--

--