PySpark — The Famous Salting Technique
Handle data skewness in Apache Spark efficiently
Out-of-memory errors are the most frequent and common error known to every data engineer. Data Skewness and Improper Shuffle are the most influencing reason for the same.
Before Spark 3 introduced — Adaptive Query Language(AQL), there was a famous technique called “Salting” which was used to avoid data skewness and distribute data evenly across partitions.
From Spark 3.2.0 AQL is enabled by default. Checkout documentation — https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution
Basically the technique involves adding a simple salt to the keys which are major part of joining datasets. Now, in place of the normal key, we use the salted key for join.
Currently this might sound a bit weird but the below example in action will clear all your doubts.
Now if you are new to Spark, PySpark or want to learn more — I teach Big Data, Spark, Data Engineering & Data Warehousing on my YouTube Channel — Ease With Data