PySpark — Dynamic Partition Overwrite
Overwrite only effected partitions without effecting all other data
INSERT OVERWRITE is a very wonderful concept of overwriting few partitions rather than overwriting the whole data in partitioned output. We have seen this implemented in Hive, Impala etc. But can we implement the same Apache Spark?
Yes, we can implement the same functionality in Spark with Version > 2.3.0 with a small configuration change with write mode “overwrite” itself.
As usual, we will follow along with an example to test this implementation. To implement this functionality we will change the following Spark Parameter
spark.sql.sources.partitionOverwriteMode — Check out Spark documentation https://spark.apache.org/docs/3.0.0/configuration.html
Test 1: We overwrite the partitions for the data with default value for partitionOverwriteMode i.e. STATIC
Lets create our Initial example dataset
# Example dataset
from pyspark.sql.functions import cast, to_date
_data = [
["ORD1001", "P003", 70, "01-21-2022"],
["ORD1004", "P033", 12, "01-24-2022"],
["ORD1005", "P036", 10, "01-20-2022"],
["ORD1002", "P016", 2, "01-10-2022"],
["ORD1003", "P012", 6, "01-10-2022"],
]