Time Travel versus Slowly Changing Dimension Type 2
The motivation for writing this article is to help readers realize and recognize the opportunity to utilize the Delta Lake time travel feature when faced with a problem that involves analyzing historical versions of a record (slowly changing dimension type 2). Additionally, I hope to help readers understand when to stick to Slowly changing dimension type 2 implementation instead of time travel.
Many years ago, whenever data engineers had requirements that involve looking at historical versions of a dataset, it involved building out complex logic that implemented record versioning by introducing a couple of metadata fields. However, with the advent of Delta Lake in Databricks, we have an inbuilt solution, making it simpler to build use cases that leverage the functionality of looking back at a dataset’s previous versions.
Slowly Changing Dimension
For those that are not aware, Slowly changing dimensions is a data warehouse concept for managing entities like Address, Customer demographics information, location data etc. that change occasionally. In data warehousing we have two main types. I will focus my discussion on Type 2. In Type 2, we focus on maintaining record versioning in the table.
Type 2: Add new row, introduce Effective from and Effective To dates, and Current Flag
In this case let’s say over time in a Supplier Table, they changed the head quarter state twice. The table will look like the below.
To implement this, we would need to alter the table to create the 3 new columns Start_date, End-date and Current_flag
Overall ETL Steps
1. Create a staging table to store input data.
2. Alter the target table by adding the 3 metadata columns; Start_date, End_date, Current_flag
3. Load new data into Staging table.
4. Run the Merge statement to expire the old versions of the record based on the Primary key
5. Insert records from stage table into the Target table as new records
As you can see this is a lot of logic to implement and easy to make a mistake. This is the simplest version of the code. There are…