Data Lakehouse with PySpark — Batch Loading Strategy

Subham Khandelwal
3 min readFeb 25, 2023

In this article we are going to design the Batch Loading Strategy to load data in our Data Lakehouse for the series Data Lakehouse with PySpark. This is a well-known and re-usable strategy which can be implemented as a part any Data Warehouse or Data Lakehouse project.

Representation Image

If you are new to Data Lakehouse checkout — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b

We broke down the complete architecture of the Data Lakehouse into three parts/layers — Landing, Staging and DW.

Architecture of Batch Loading Strategy

There are few basic rules for data population for all three layers, lets check them out.

Landing Layer

The landing layer will hold the extracted source data. The data will be inserted into the landing table always in APPEND mode with insert_dt and update_dt as audit column.

The value for these audit columns is the current_timestamp when the data is written to the layer. It will help us to read only the incremental data to the subsequent layers.

Also, we will cast all data to string datatype in landing to store the data as is. There will be…

--

--