Recently, a colleague asked if I had utilized the recently released PartitionOverwrite API by Databricks for Delta tables. Although I haven’t used it yet, it appears to be very similar to the Replacewhere function. In this article, my goal is to compare the two to understand their use cases, similarities, and differences in implementation or behavior. This article focuses on delta file or delta tables in Databricks
What is PartitionOverwrite?
Before we talk about what is PartitionOverwrite, let’s review what is a partition.
What is a Partition?
A partition refers to a subset of a large dataset that is manageable in size. These partitions are created based on certain criteria and are designed to improve the efficiency of data processing and analysis. In big data file formats like Parquet, when writing a substantial dataset (for example, terabytes of data) to your distributed storage, it is advisable to partition your data based on a key column with low cardinality (a low number of unique values). Therefore, when data is stored on the disk, it is divided based on the partition key and written in subfolders of manageable chunks, each based on the partition key.
This is useful when you want to update or replace the contents of specific partitions without affecting the entire dataset.
Suppose you have a large dataset partitioned by date, and you receive new data for a specific date range. Instead of rewriting the entire dataset, you can use
partitionOverwrite to update only the partitions corresponding to the new data.
Static VS Dynamic PartitionOverwrite
Static Option: Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not.
Note: This could be dangerous. If there is no incoming data for some partitions those partitions are deleted.
Example: Given target data set