PySpark — The Magic of AQE Coalesce

With the introduction of Adaptive Query Engine aka AQE in Spark, there has been a lot changes in term of Performance improvements. Bad Shuffles always had been a trouble to data engineers also trigger other problem such as skewness, spillage etc.

AQE Coalesce is now a Out of the box magic which coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasks created.

Checkout the AQE documentation :

So, lets check this out in action. First we check the default configuration.

Default Spark conf

Lets disable AQE and change Spark Shuffle Partition to random to understand it better

Disable AQE

Lets read sales dataset of approx. 431MB of size and groupBy to trigger Shuffle

Shuffle without AQE

Since AQE was disabled the job created 289 partitions as we defined in shuffle partitions above.

Now, lets do the same operation by this time with AQE enabled

Enable AQE
AQE Enabled output

Since the output dataset was less than 64MB as defined for spark.sql.adaptive.advisoryPartitionSizeInBytes , thus only single shuffle partition is created.

Now, we change the group by condition to generate more data

AQE Enabled output

More number of partitions created this time. But AQE automatically took care of the coalesce to reduce unwanted partitions and reduce the number of tasks in further pipeline.

Note: its not mandatory to have all partitions with 64MB size. There are multiple other factors involved as well.

AQE Coalesce feature is available from Spark 3.2.0 and is enabled by default.

Checkout the iPython notebook on Github —

Checkout PySpark Series on Medium —

Wish to Buy me a Coffee: Buy Subham a Coffee



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store