PySpark — Dynamic Partition Overwrite

INSERT OVERWRITE is a very wonderful concept of overwriting few partitions rather than overwriting the whole data in partitioned output. We have seen this implemented in Hive, Impala etc. But can we implement the same Apache Spark?

Yes, we can implement the same functionality in Spark with Version > 2.3.0 with a small configuration change with write mode “overwrite” itself.

Representation Image

As usual, we will follow along with an example to test this implementation. To implement this functionality we will change the following Spark Parameter

spark.sql.sources.partitionOverwriteMode — Check out Spark documentation https://spark.apache.org/docs/3.0.0/configuration.html

Test 1: We overwrite the partitions for the data with default value for partitionOverwriteMode i.e. STATIC

Lets create our Initial example dataset

Initial dataset

Current/Default configuration for spark.sql.sources.partitionOverwriteMode

Default conf set to STATIC

Repartition and save the Initial dataset, we have 4 partition with order_date

Initial dataset partitioned
Validate Initial dataset

Generate the new delta dataset for overwrite

Delta dataset

Now, we overwrite the delta data with default configuration

Final Data Partitioned
Final data validation

From the above test we conclude, The data is completely overwritten for all existing partitions from Initial data with Delta dataset.

Test 2: We overwrite the partitions for the data with partitionOverwriteMode as DYNAMIC

We follow the same above steps but this time we set the configuration value to DYNAMIC

Setting conf value to DYNAMIC

Initial dataset

Initial Dataset

Write the partitioned Initial dataset

Initial Partitioned dataset
Validate Initial data

We create the same delta dataset

Delta Dataset

Now, we write the dataset with same mode “overwrite

Final partitioned dataset
Final data validation

As its evident from above test, the partitioned in the delta dataset are only overwritten rather than the whole dataset.

Note: This will only work with partitioned dataset.

Conclusion: We can achieve INSERT OVERWRITE functionality in Spark > 2.3.0 with configuration parameter for changed for Partition Overwrite to dynamic.

Checkout the iPython Notebook on Github — https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/24_Partition_Overwrite.ipynb

Checkout my Personal blog — https://urlit.me/blog/

Checkout the PySpark Medium Series — https://subhamkharwal.medium.com/learnbigdata101-spark-series-940160ff4d30

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store