PySpark — Optimize Data Scanning exponentially
More often in Data Warehouse the data is partitioned on multiple levels such as Year > Month > Day, and the number of partitioned folders increase exponentially. To read data every time, Spark takes a lot of time and the reason behind this is scanning of data.
So, is there a way to minimize this scanning and increase the performance? Yes, there is a simple and very effective solution.
Lets consider our sales partitioned dataset, which is partitioned based on trx_year, trx_month and trx_date.
Test 1: Read data directly without specifying any schema. Keep a note on the timings
# Read without any Schema
df_1 = spark.read \
.format("parquet") \
.load("/user/hive/delta-warehouse/sales_partitioned.parquet")
Test 2: Read data specifying the Schema
# Read with Schema
_schema = "transacted_at timestamp, trx_id int, retailer_id int, description string, amount decimal(38,18), city_id int, trx_year int, trx_month…