PySpark — Upsert or SCD1 with Dynamic Overwrite

Upsert or Incremental Update or Slowly Changing Dimension 1 aka SCD1 is basically a concept in data modelling, that allows to update existing records and insert new records based on identified keys from an incremental/delta feed.

Representation Image

To implement the same in PySpark on a partitioned dataset, we would take help of Dynamic Partition Overwrite. Lets jump into action.

To know more on Dynamic Partition Overwrite, checkout my previous article — http://urlit.me/75B6B

First we will change the Partition Overwrite configuration to DYNAMIC.

Configuration

Create Orders History Dataset

History Dataset

Lets write Orders History data, partitioned on order_date and validate the same

Partitioned History Dataset

Validate the Orders History dataset

Orders History data

Lets generate our Incremental/Delta feed with following changes:

  1. Existing Order ORD1002, qty is updated to 16 for same order_date
  2. New Order ORD1011 created with order_date 01–20–2022
Delta feed for Orders

Now, to make things simpler, We will break the whole Incremental Update process in few Steps.

Step 1: Identify the impacted partitions from History dataset using Delta feed.

Unique Partitions from Delta feed
Impacted records from History Dataset

Step 2: Union and Select the latest records based on insert_date and create a final delta dataset (latest impacted history + delta records)

Union and Rank based on Insert Date
Select the latest records only

Step 3: Write the Final Delta dataset with Partition Overwrite mode as dynamic after dropping rank column

Impacted Partitions are only Overwritten

As its evident, only the impacted partitions are overwritten with the final delta feed.

Lets validate the final data in Orders History dataset

Final History Dataset

Note: For the above use case we have used only order_id as key for history dataset. In general Production scenario there can be more than one key to identify the change, thus - the rank logic to select the latest data will change accordingly.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/25_upsert_in_pyspark.ipynb

Checkout my personal blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store