Beginner’s Guide: Extract Transform Load (ETL) Playbook- Incremental load

The intended audience of this article are IT enthusiasts and beginner-level data engineers interested in understanding some data engineering principles

In the previous post (read here), I explored the rationale behind ETL and talked about how we can achieve full data extraction from a source system to a target system. In this post, I would focus our attention on incremental data extraction and some common design patterns.

We know that as we run analytics on our downstream systems, it needs to be on updated data. However, there are different ways of updating our downstream analytical system. One of them is performing a full data extraction (Full Load) like discussed in the previous post another method that will be discussed in detail in this post is called Incremental Load.

What is Incremental data load/extraction?

Incremental data extraction is the process of copying only changed or new data from source to target system based on a preset time interval. Some people would call it batch data load because you copy new or modified data after a certain time interval has elapsed. For example, in many organizations, incremental data extraction occurs hourly, daily, weekly, monthly, etc. These ETL jobs are designed to extract only the new or changed records from the last extraction date/time to the current date/time.

Why do we have to perform incremental data extraction?

Some of the major challenges with relying on full load is that it is time-consuming, resource-intensive, and inefficient as a mode of consistent data refresh. In doing an incremental load, we can limit the amount of time business users will wait for new data. The downstream analytical system will have low downtime and won’t be overburdened with the data load process since the data volume is smaller and more manageable.

Benefits of Incremental data extraction

Time: Generally incremental loads will take less time than full refresh because we are not copying all the records in the source. Additionally, the business users will wait for a lesser amount of time to get the latest data for their analytics. Time of data system unavailability is reduced

Efficiency: If we have data existing in the source that never changed since the last full load extraction, why should it be copied over again every time the ETL job runs. Limiting the ETL process for only new and changed records is, therefore, more efficient

Resource utilization: Performing incremental load will result in lesser computation time due to the reduced data volume and it will save cost

Maintenance: If any failure occurs during incremental load, it will be easier to restart and retry the entire extraction process, and it makes supporting the ETL jobs easier. Whereas full data extraction will be overwhelming and harder to maintain

Designing an Incremental load data extraction process consideration

In designing incremental data load, one of the most important decisions is setting the frequency of the incremental data refresh. Figuring out an acceptable frequency impacts the design of your incremental load. For example, some source systems that store reference data change once every 3–6 months. In this scenario, designing your ETL to perform full data load or incremental every 3–6 months is not such a big deal. In other cases, the business might want the latest data every 15 mins. In this scenario, we have to think about leveraging change data capture technologies, data streaming technologies, or highly optimized micro batch ETL jobs.

Below are some of the common questions that are raised when designing Incremental ETL jobs

  1. How frequently do we need to run this data extraction job?
  2. What is the acceptable data refresh latency from the business?
  3. What kind of source system are we dealing with? Is it a relational database, Flat file, etc?
  4. What is the volume of records that we expect to pull on an average per run?
  5. Does our source system have a way for us to detect changed records?
  6. How frequently does the data change in the source system?
  7. Do we have delete transactions in the source system?
  8. What is the primary key in the source system? (A primary key is a column that uniquely identifies a record in a table)

Common Incremental Design Pattern for structure database source systems.

The overwhelming majority of incremental ETL jobs that I have designed involve extracting data from a database table into either another database table or file storage. In the below deep dive, I will focus on this use case

Common Incremental Design Pattern for structured database source systems.

The overwhelming majority of incremental ETL jobs that I have designed involve extracting data from a database table into either another database table or file storage. In the below deep dive, I will focus on this use case

Extracting Data from a Database table Source system considerations

  1. Identify your ETL tool of choice. There are many ETL tools on the market, however, whenever you have to build an ETL process you need to pick the ETL tool that is most compatible with your source and target systems. In this case compatible means, the ETL tool has very efficient connectors to the source and target system so that data can be moved efficiently.

2. Analyze the source table to identify the following:

a) Modified Date field: This is an important field that will help in identifying the new and changed records. Using this field we can filter for the incremental records within a date range. However, this modified date field won’t help you identify deleted records

b) Primary key field: This is an important field that will help in identifying the new and changed records since this field uniquely identifies all records in the database. Using this field we can decide if the incremental records are to update the target table or insert into the target table. This field is useful for identifying deleted records as well.

c) Auto Incremental key field: This is a useful field though not mandatory to identify new records especially for append-only log tables. Normally, this auto-increment field has an index or is the primary key and is a numeric field. Using this field makes it very efficient to identify the new records. Also note this does not identify deleted records

Sample Incremental ETL Architecture

In the next article, I will deep dive into the above example incremental data extraction process.

I am passionate about empowering and encouraging people of color in the data analytics career path

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store