Data Lakehouse with PySpark — Batch Loading Strategy
In this article we are going to design the Batch Loading Strategy to load data in our Data Lakehouse for the series Data Lakehouse with PySpark. This is a well-known and re-usable strategy which can be implemented as a part any Data Warehouse or Data Lakehouse project.
If you are new to Data Lakehouse checkout — https://youtube.com/playlist?list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b
We broke down the complete architecture of the Data Lakehouse into three parts/layers — Landing, Staging and DW.
There are few basic rules for data population for all three layers, lets check them out.
Landing Layer
The landing layer will hold the extracted source data. The data will be inserted into the landing table always in APPEND mode with insert_dt and update_dt as audit column.
The value for these audit columns is the current_timestamp when the data is written to the layer. It will help us to read only the incremental data to the subsequent layers.
Also, we will cast all data to string datatype in landing to store the data as is. There will be…