In this article, I will give my unbiased assessment of Databricks Delta Lake change data feed, analyzing the potential use cases, pros and cons.
One of the recent features introduced by Databricks is Delta Table Change Data feed. Databricks describes it as below
“Change data feed allows Databricks to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records change events for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.” Reference
This means that in Delta Lake Tables we can now access the audit trail of the changes that happened in the table.
Before this we would need to use Databricks time travel feature to see what changed from the previous version of the table to the current version of the table.
How is change data feed useful
- Let’s say you want to implement an incremental load data extraction from your bronze delta table to your silver and gold tables. You can identify the changed records with a watermark field. However, Change Data Feed provides this information in a much cleaner and robust way. We can use this to implement incremental data extraction across our Delta Lake house platform.
- We can use the Change Data Feed files as a streaming source to propagate the changes to a target of our choice. This is useful when your Delta Tables are sources to an event driven application.
Prerequisites for using Change Data feed
- Create a Delta Table registered in Hive metastore or Unity Catalog
- Enable the Change Data feed property for that table
ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
Querying Change Data Feed Delta Tables
My preferred method of querying the Change Data Feed is to use the version number like below
-- version as ints or longs e.g. changes from version 0 to 10
SELECT * FROM table_changes('tableName', 0, 10)