PySpark — Optimize Pivot Data Frames like a PRO
3 min readOct 9, 2022
Pivoting data is a very common scenario in every data engineering pipeline. Spark provides out-of-the-box pivot() method to do the job right. But, do you know we have a performance trade off in Spark Data Frame using pivot(), if it is not used properly.
Lets check that out in action.
First, we create our example Data Frame
# Example Data Set_data = [
["Ramesh", "PHY", 90],
["Ramesh", "MATH", 95],
["Ramesh", "CHEM", 100],
["Sangeeta", "PHY", 90],
["Sangeeta", "MATH", 100],
["Sangeeta", "CHEM", 83],
["Mohan", "BIO", 90],
["Mohan", "MATH", 70],
["Mohan", "CHEM", 76],
["Imran", "PHY", 96],
["Imran", "MATH", 87],
["Imran", "CHEM", 79],
["Imran", "BIO", 82]
]_cols = ["NAME", "SUBJECT", "MARKS"]# Generate Data Frame
df = spark.createDataFrame(data=_data, schema = _cols)
df.show(truncate = False)
To measure the performance, we will create a simple Python decorator.
# Lets create a simple Python decorator - {get_time} to get the execution timings
# If you dont know about Python decorators - check out …