PySpark — Optimize Data Scanning exponentially

More often in Data Warehouse the data is partitioned on multiple levels such as Year > Month > Day, and the number of partitioned folders increase exponentially. To read data every time, Spark takes a lot of time and the reason behind this is scanning of data.

So, is there a way to minimize this scanning and increase the performance? Yes, there is a simple and very effective solution.

Representation Image (source: Wikipedia)

Lets consider our sales partitioned dataset, which is partitioned based on trx_year, trx_month and trx_date.

Partitioned dataset

Test 1: Read data directly without specifying any schema. Keep a note on the timings

Performance without Schema

Test 2: Read data specifying the Schema

Performance with Schema

OK, So with schema it takes a little less time for scanning.

Test 3: We define the dataset as table for the first time and then we read from it.

Registering as table

Registering the tables took time, as the data is scanned for all partitions and datatype. We can reduce this time as well specifying the schema in Create table command.

Now, lets read the data

Performance with registered table

Did you see the time? It decreased exponentially like Magic. Now, you can read any number of time from the table, the time for scanning would always remain negligible.

In case table data changes with background tasks and you need to read the updated data, then the MSCK REPAIR TABLE command must be ran to refresh the table.

Conclusion: We can reduce and optimize the scanning exponentially by registering the dataset as table for reads. In case that is not possible then always try to specify the schema beforehand.

Checkout my personal blog:

Checkout the PySpark Medium Series:

Wish to Buy me a Coffee: Buy Subham a Coffee



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store