PySpark — Optimize Data Scanning exponentially

Subham Khandelwal
3 min readOct 25, 2022

More often in Data Warehouse the data is partitioned on multiple levels such as Year > Month > Day, and the number of partitioned folders increase exponentially. To read data every time, Spark takes a lot of time and the reason behind this is scanning of data.

So, is there a way to minimize this scanning and increase the performance? Yes, there is a simple and very effective solution.

Representation Image (source: Wikipedia)

Lets consider our sales partitioned dataset, which is partitioned based on trx_year, trx_month and trx_date.

Partitioned dataset

Test 1: Read data directly without specifying any schema. Keep a note on the timings

# Read without any Schema
df_1 = spark.read \
.format("parquet") \
.load("/user/hive/delta-warehouse/sales_partitioned.parquet")
Performance without Schema

Test 2: Read data specifying the Schema

# Read with Schema
_schema = "transacted_at timestamp, trx_id int, retailer_id int, description string, amount decimal(38,18), city_id int, trx_year int, trx_month…

--

--