Comparing Replacewhere to PartitionOverwrite in Databricks Delta Tables
Recently, a colleague asked if I had utilized the recently released PartitionOverwrite API by Databricks for Delta tables. Although I haven’t used it yet, it appears to be very similar to the Replacewhere function. In this article, my goal is to compare the two to understand their use cases, similarities, and differences in implementation or behavior. This article focuses on delta file or delta tables in Databricks
What is PartitionOverwrite?
Before we talk about what is PartitionOverwrite, let’s review what is a partition.
What is a Partition?
A partition refers to a subset of a large dataset that is manageable in size. These partitions are created based on certain criteria and are designed to improve the efficiency of data processing and analysis. In big data file formats like Parquet, when writing a substantial dataset (for example, terabytes of data) to your distributed storage, it is advisable to partition your data based on a key column with low cardinality (a low number of unique values). Therefore, when data is stored on the disk, it is divided based on the partition key and written in subfolders of manageable chunks, each based on the partition key.
PartitionOverwrite:
This is useful when you want to update or replace the contents of specific partitions without affecting the entire dataset.
Use case:
Suppose you have a large dataset partitioned by date, and you receive new data for a specific date range. Instead of rewriting the entire dataset, you can use partitionOverwrite
to update only the partitions corresponding to the new data.
Static VS Dynamic PartitionOverwrite
Static Option: Spark should delete all the partitions that match the partition specification regardless of whether there is data to be written to or not.
Note: This could be dangerous. If there is no incoming data for some partitions those partitions are deleted.
Example: Given target data set
Using the below subset of data to overwrite the partition in static mode
Results in the dataset below
As you can see the entire target data got replaced leaving only the new partitioned data subset
Dynamic PartitionOverwrite:
It will delete only those partitions that will have data written into it while the other partitions remains untouched.
Given target dataset
Using the below subset of data to do dynamic partitionoverwrite of the target delta file
This will result to the below.
As you can see only the partitions that exist in the input data overwrites the target delta table partition. All the 1991, 1992 and 1994 records where replaced completely by the incoming dataset
Partition creation: Whenever the input data set has a new partition not existing in the target delta table, using partition overwrite will create that new partition. This is another difference with Replacewhere
As you can see the input data containing a new partition ‘1993’ was created
ReplaceWhere
The replaceWhere option atomically replaces all records that match a given predicate. It validates that all rows match the predicate, and performs an atomic replacement using overwrite
semantics. If any values in the operation fall outside the constraint, this operation fails with an error by default.
By atomic it means it can operate at the record level. The atomic replacement functions similarly to an UPDATE command in SQL.
Use case
- It can function like the PartitionOverwrite, when we want to replace a specific partition
Given the original target dataset
If we want to overwrite the ‘1991’ partition with the below incoming data below. Then we can do this script
Result of target table
As you can see the 1991 dataset got replaced from Ken Jane to Mathew Mary
2. You can overwrite a subset of records within the partition based on the predicate using the input data. This is similar to update and more granular than the Partitionoverwrite.
Given the target data
If we decide to Replace the data for the ‘1991’ partition only for where the lastname is Jane
The result of the target delta table will be similar to an update for that record. The other records in the ‘1991’ partition remains untouched
As you can see Mathew Jane record got replaced to Keith Jane
3. Replacewhere can be used to overwrite multiple partitions of data based on the predicate. If the predicate encompasses multiple partitions, the input data will overwrite it. This can only be done with Replacewhere and not possible with Partitionoverwrite
Given a Source table with below data, with partition key of Birthdate
It has the below partition scheme in ADLS storage
Using the below input dataset, we can do a Replacewhere across multiple partitions. In this case I am replacing all the partitions that fell in the month of January 1991 with the below two records
The resulting target table will be
As you can see all the partitions that where in 1991 Jan got replaced
Conclusion
Similarity of ReplaceWhere and PartitionOverwrite
- Both need to operate on delta files that have a partition scheme
- Both can overwrite a partition based on the input dataset
- Both append new partitions to the target dataset based the existence of those partitions in the source dataset
Differences between ReplaceWhere and PartitionOverwrite
- PartitionOverwrite can add a new partition from the input dataset into the target if that partition does not exist without a predicate. Replacewhere needs the predicate to be satisfied for a replacement to happen. Though it will add new partitions from the input dataset.
- Replacewhere can replace only a subset of records in the partition mentioned in the predicate, giving you a more granular control. In Dynamic PartitionOverwrite there is no option to replace a subset of records in a partition.
- It appears that PartitionOverwrite will be faster than Replacewhere for instances where all the entire partition needs to be overwritten because it won’t need to prune files.
About Me
I am Nnaemezue Obi-eyisi, a Senior Azure Databricks Data Engineer at Capgemini and the founder of AfroInfoTech, an online coaching platform for Azure data engineers specializing in Databricks. I have a passion for learning and sharing knowledge. If you’re interested, join my waitlist for the upcoming Data Engineer bootcamp by signing up with this link: https://afroinfotech.ck.page/d8b6f6da0e. You can also visit my official Data Engineering coaching website at https://afroinfotech.teachable.com/. Follow me on other platforms via https://linktr.ee/nobieyisi.