Member-only story

PySpark — Optimize Pivot Data Frames like a PRO

3 min readOct 9, 2022

Pivoting data is a very common scenario in every data engineering pipeline. Spark provides out-of-the-box pivot() method to do the job right. But, do you know we have a performance trade off in Spark Data Frame using pivot(), if it is not used properly.

Lets check that out in action.

First, we create our example Data Frame

# Example Data Set_data = [
 ["Ramesh", "PHY", 90],
 ["Ramesh", "MATH", 95],
 ["Ramesh", "CHEM", 100],
 ["Sangeeta", "PHY", 90],
 ["Sangeeta", "MATH", 100],
 ["Sangeeta", "CHEM", 83],
 ["Mohan", "BIO", 90],
 ["Mohan", "MATH", 70],
 ["Mohan", "CHEM", 76],
 ["Imran", "PHY", 96],
 ["Imran", "MATH", 87],
 ["Imran", "CHEM", 79],
 ["Imran", "BIO", 82]
]_cols = ["NAME", "SUBJECT", "MARKS"]# Generate Data Frame
df = spark.createDataFrame(data=_data, schema = _cols)
df.show(truncate = False)

To measure the performance, we will create a simple Python decorator.

# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out …

PySpark — Optimize Pivot Data Frames like a PRO

Written by Subham Khandelwal

Responses (1)