Beginner’s Guide: Extract Transform Load (ETL) Playbook- Incremental load design pattern Part2
The intended audience of this article are IT enthusiasts and beginner-level data engineers interested in understanding some data engineering principles
In the previous post (read here), I described the various incremental data extraction design considerations, and requirements. In this article, I will continue from where I left off to explain the sample solution approach.
Example of Incremental Source Database Table Extraction to a Staging Table using ETL tool
One of the best practices when designing an incremental extraction process from a source system that involves multiple tables is to use an ETL control table.
ETL Control Table
This is the configuration table that will control and drive the ETL process. This table normally will be at the granularity of each table that needs to be extracted. Below are a couple of fields we tend to capture in the control table
a) Source Table name: This is the name of the source table that we will extract data from
b) Staging Table Name: This is the staging table we will temporarily write the incremental records extracted from the source table.
c) Source Select Column Extraction Query: This is the list of columns we want to extract from the specific source table. This gives us flexibility and efficiency in our ETL pipeline to focus on only the data we need
d) Source Filter (Where) condition: This also helps us apply additional filter conditions to limit the data we want to extract
e) Last Extraction Begin Date: This date is used as the starting date to extract data from the source table for the current incremental load ETL process.
f) Extraction Type: This field is used to dictate if the specific table should be a Full or Incremental load. Sometimes, some tables have a small number of records that extracting the full table data every time has a negligible effect compared to an incremental extraction.
g) ETL Status: This is the status of the ETL job for that table, it can tell if the current table being extracted is in progress, completed, or in an error state. This way we can prevent running an ETL job in an error state or in progress
h) ETL Completion date: This lists the date/time of ETL job completion for the specific table.
This is a table that is truncated (all records deleted from the table) before each ETL run. You can think of it as a temporary storage table, used to store the data before it is inserted or updated against the target table.
Below are some of the reasons why it is recommended to first copy data into a Staging table before loading data into the Target table. This is also known as the ELT (Extract Load Transform) approach.
- Leveraging the strengths of SQL to implement incremental load:
a) It is easier to perform any incremental logic on the database using SQL instead of using an ETL tool. SQL supports robust statements that can perform multiple transactions (insert, update, delete) all at once. For example, the SQL MERGE statement is commonly used to apply incremental inserts/updates to target tables.
b) It is easier to maintain and understand transformation rules written in SQL compared to some ETL tools. Many people that work with data are proficient in SQL, but due to the vast number of various ETL tools, many people may not be familiar with any specific ETL tool.
c) ETL tools have a tendency to fade out of popularity, as new or better tools become available but SQL will always be present. When this happens a lot of work will need to be done to port the code from the old ETL tool to the new ETL tool
2. The incremental data loading and transformation performance is much better in a database for large volumes of data when we perform incremental logic from staging to the target table.
3. The staging table helps isolate the modularize the ETL process for easy manageability and debugging. If there was an issue we can easily analyze the records extracted into staging and compare them to the target.
In contrast, before the advent of powerful database engines, most data extraction was done using an ETL approach, where the data was extracted from the source table using an ETL tool, transformed on the fly using the same ETL tool, and loaded into the target table. All these were done using the ETL tool.
This is the program we use to transform and implement incremental load logic against the target table. Normally, in this logic in SQL, we would implement a MERGE statement. This compares the Staging table against the Target table using the primary keys and performs inserts if there areA no matching records based on the primary key and update when matched.
In the next article, I would explore the methods of inserting and updating data in the target table and discuss near real-time incremental data extraction processes using Change Data Capture (CDC) and Streaming Kafka.