Beginner’s Guide: Extract Transform Load (ETL) Playbook- Incremental load
--
The intended audience of this article are IT enthusiasts and beginner-level data engineers interested in understanding some data engineering principles
In the previous post (read here), I explored the rationale behind ETL and talked about how we can achieve full data extraction from a source system to a target system. In this post, I would focus our attention on incremental data extraction and some common design patterns.
We know that as we run analytics on our downstream systems, it needs to be on updated data. However, there are different ways of updating our downstream analytical system. One of them is performing a full data extraction (Full Load) like discussed in the previous post another method that will be discussed in detail in this post is called Incremental Load.
What is Incremental data load/extraction?
Incremental data extraction is the process of copying only changed or new data from source to target system based on a preset time interval. Some people would call it batch data load because you copy new or modified data after a certain time interval has elapsed. For example, in many organizations, incremental data extraction occurs hourly, daily, weekly, monthly, etc. These ETL jobs are designed to extract only the new or changed records from the last extraction date/time to the current date/time.
Why do we have to perform incremental data extraction?
Some of the major challenges with relying on full load is that it is time-consuming, resource-intensive, and inefficient as a mode of consistent data refresh. In doing an incremental load, we can limit the amount of time business users will wait for new data. The downstream analytical system will have low downtime and won’t be overburdened with the data load process since the data volume is smaller and more manageable.
Benefits of Incremental data extraction
Time: Generally incremental loads will take less time than full refresh because we are not copying all the records in the source. Additionally, the business users will wait for a lesser amount of time to get the latest data for their analytics. Time of data system unavailability is reduced
Efficiency: If we have data existing in the source that never changed since the last full load extraction, why should it be copied over again every time the ETL job…