Data Engineering Beginner’s Guide: Modularity and Checkpointing
In this article, I will discuss a very important ETL programming concept called Checkpointing. If you are reading my blog for the first time and not familiar with ETL/ELT please review my prior post about ETL. Checkpoint is nothing new in Software engineering a checkpoint literary means “a point where a check is performed”. In the real world, checkpoints are associated with security spots(points) where a traveler either is searched or identified.
The goal of writing this article is to help Data Engineers think and apply these concepts as they build data pipelines. I have purposely not gone into implementation details because there are so many ETL tools in the market and various ways to implement these concepts.
In the context of Data Engineering, where data is traveling from one point to another, a checkpoint is a logic in our data pipeline that keeps track of the successfully completed steps in our ETL code. The purpose of a checkpoint is to save or keep track of all (or the last) successfully completed step(s) in our Data pipeline (ETL) so that in an event of a data pipeline failure, the ETL job will skip successfully completed steps and resume from that failed step.
Before we implement Checkpoints we need to first discuss ETL code modularity (steps). By modularity, I mean how can we split our complex ETL code as concise logical blocks/steps of code. This enables us to easily manage, maintain, decouple, unit test, debug and checkpoint our Data pipeline.
ETL Modularity Analogy
Imagine you are trying to learn a new skill, for example swimming. You would probably prefer your lessons to be broken down into logical steps. As you master each step, you get introduced to further complex steps. Ideally, you would prefer to be taught the easiest and foundational steps first.
This same analogy can be applied in Data Pipeline (ETL) coding. Modularity means breaking things down into very logical concise blocks of logic that aim to achieve only one functionality at a time.
ETL Job Checkpoint Sample Use case
Assuming you have a new ETL job requirement to read data from a file and load it into a table. However, you…