My Guide to Becoming a “Market-Ready” Data Engineer With a $100 Investment in 3 Months

Nnaemezue Obi-Eyisi
Geek Culture
Published in
11 min readJan 25, 2021

--

As of March 2024, revisiting this article reveals its enduring relevance. The fundamental building blocks outlined here have stood the test of time remarkably well. The building blocks essential for a career in data engineering can be categorized as follows:

Programming languages: Python, SQL, Pyspark, Scala, Java, etc.

Systems and Tools Mastery: Relational Database Systems (SQL Server, Oracle), Distributed Systems: (Spark, Kafka, HDFS, Databricks), ETL Tools (Data Factory, Fivetran, Qlik Replicate), Orchestration Tools (Airflow), etc.

Data Engineering Concepts: CI/CD, Normalization, Batch, Streaming, OLTP, OLAP, Star Schema, Data Modeling, Data Warehouse, Data Lake, Lakehouse, Cloud Fundamentals, Checkpointing, Optimizations, etc.

In my article, I have focused on the most important ones to be aware of

Perhaps you’re aware that the demand for Data Engineers has skyrocketed in the past 2 years, leading to more job openings than qualified candidates, especially for those authorized to work in the US. Some companies even offer referral bonuses as high as $1500.

My aim with this post is to provide a simplified learning path for aspiring data engineers who are self-motivated beginners. I’ve outlined the essential foundational skills required to break into a Data Engineer role, at least at the junior level. The journey to becoming a data engineer can feel overwhelming, and it’s worth noting that no single person possesses all the skills listed in an ideal data engineer job description. For those curious about the comprehensive list of skills needed, I recommend checking out this data engineering roadmap here.

However, by mastering the building blocks outlined below, individuals can gain confidence in applying for roles and may even secure a job before completing their entire training. I encourage beginners to periodically assess their knowledge against job market expectations by applying for roles and participating in interviews. This approach helps maintain motivation and provides an honest evaluation of one’s skills

You can accomplish anything as long as you set your mind on it

After five years of experience as a data engineer across three different companies, I can attest to the enduring importance of fundamental skills in achieving success. Despite the ever-changing landscape of tools, with some rising and falling within months, a proficient data engineer with a solid grasp of foundational skills can quickly adapt to new tools and data services.

I wanted to highlight learnable skills within a realistic timeframe that can lead to job offers. Once employed, there’s ample opportunity for further learning and skill enhancement on the job. Certain data engineering skills are best acquired through real-world experience, such as knowledge of CI/CD DevOps pipelines, Agile/Scrum methodologies, and workflow scheduling.

In this post, I’ll cover vendor-agnostic foundational skills before delving into specific skills necessary for becoming an Azure data engineer, since it has been my focus for the past three years. I recommend beginners concentrate their learning on tools associated with a chosen cloud provider (AWS, Azure, GCP) for a more practical approach to securing a Data Engineering job quickly. Most companies tend to favor a particular cloud provider and utilize tools within its ecosystem.

Having scrutinized numerous job requirements and undergone countless interviews in data engineering, I’ve gained clarity on what truly matters to excel in most companies. I’ve outlined the top five foundational skills required for success and provided useful resources that I’ve personally reviewed and found sufficient. Additionally, I’ve allocated realistic timeframes for absorbing the material and estimated costs. While most of the courses I recommend are on Udemy, feel free to explore other platforms like YouTube, where valuable content can often be found. I don’t receive any referral payments from Udemy for these courses; my aim is simply to ensure you receive the best knowledge available.

It’s important to dedicate ample time to practice during self-learning, so I recommend budgeting twice the duration of the course videos. A pro tip: If you’re not seeing the discounted price for video courses on Udemy, try clearing your browser cookies or signing up with a new email address to potentially access the discount.

Foundational skills

1. Good Foundational Knowledge of any programming language preferably Python

Why: Data engineering shares many similarities with software engineering, to the point where the boundaries between the two are becoming increasingly blurred in my experience. A proficient data engineer essentially embodies the expertise of a software engineer with a deep understanding of data.

Tasks such as crafting optimized data pipelines, implementing complex business rule transformations, and designing automated data flow systems all require the application of programming concepts. Even the tools utilized in data engineering are built upon programming languages such as Python, Scala, Java, or PowerShell.

While you may not frequently implement advanced algorithms like depth-first search, having a solid grasp of programming concepts such as if-else statements, loops (for and while), and some advanced algorithms remains crucial for success as a data engineer. Analyzing the runtime and space requirements of your code not only enhances your effectiveness but also saves computational costs for your company and ensures timely data delivery.

Moreover, many Data Engineer job interviews assess candidates’ coding abilities, underscoring the importance of proficiency in a programming language.

Please note: If you already possess a background in another programming language, you may choose to skip this section entirely or focus solely on learning the syntax of Python

Courses

a) Beginner to Intermediate Python Course:
Complete Python Bootcamp: From Zero to Hero in Python:
This will give you a good grasp of some fundamentals of coding in Python and object-oriented programming.
https://www.udemy.com/course/complete-python-bootcamp/
Cost 15–20 dollars
Course Time: 24hrs
Learning Time: 1 month

b. Python Algorithms and Data Structures (for Mid to Senior Data Engineers)
Python for Data Structures algorithms and interviews
This course is crucial for understanding the fundamentals of software engineering. Please note you have to be at an intermediate level before taking this course. This is essential to get through most coding interviews for mid or senior roles.
https://www.udemy.com/course/python-for-data-structures-algorithms-and-interviews/
Cost 12–15 dollars
Course Time: 17hrs
Learning Time: 1 month

Bonus: Python for Data Analysis: Numpy, Panda’s Dataframe
This free videos in youtube are very comprehensive as it goes over the most popular python libraries used in the real world for data analysis like Pandas, Numpy. Feel free to skip the 4 hr course and jump straight to pandas if you don’t have time.
Numpy + Pandas 4 hr course
https://youtu.be/r-uOLxNrNk8
Pandas 1 hr course
https://youtu.be/vmEHCJofslg
Pandas Advanced concepts 1 hr course
https://youtu.be/P_t8LO-KgWM

2. Good Foundational Knowledge of SQL Programming (SQL Query writing) and Relational Database Systems

Why: SQL programming stands as the lingua franca among data technologists worldwide. Despite numerous claims by new tools that they will surpass SQL, it has remained not just relevant but has solidified its position as a standard in the realm of data professionals. It’s hard to assert one’s proficiency in working with data without a strong command of SQL. However, in today’s landscape, to excel as a data engineer, one must not only be proficient but truly an expert in SQL.

Consider that as a data engineer, you’ll collaborate extensively with Data Analysts, Data Stewards, Business Analysts, and others who are already quite adept in SQL. They’ll often turn to you for assistance with their most challenging SQL queries. Moreover, SQL is pervasive throughout the data engineering stack, from data sources to ETL tools and reporting platforms — all rely on SQL to execute their operations.

Furthermore, a deep understanding of relational databases remains crucial for data engineers. SQL was originally conceived for relational databases, and you’ll find that modern data warehouse tools still heavily draw upon and replicate core functionalities found in relational databases. Therefore, I strongly advise individuals to master SQL as it remains a cornerstone skill in the field of data engineering

Courses

a) SQL Database Bootcamp: Go from Zero to mastery on Udemy. I like this course because it covers the database concepts like primary key, foreign key, data types, Indexing etc. which are very important to most data processing systems. It also covers SQL Query writing or programming

https://www.udemy.com/course/complete-sql-databases-bootcamp-zero-to-mastery/

Cost: 20 dollars

Course time: 25 hrs

learning time: 6 weeks

b) SQL — Beyond The Basics
This course focuses on advanced concepts that are crucial in getting through most interviews these days and having that expert level knowledge as a data engineer. It will go over the most efficient ways to write elegant queries that will optimize your ETL workloads.
https://www.udemy.com/course/sql-beyond-the-basics/
Cost 11–15 dollars
Course time: 5hrs
Learning Time: 1.5 weeks

Bonus

The Complete SQL Bootcamp 2020: Go from Zero to Hero in Udemy
This is a good first step to get you from beginner to intermediate in SQL
https://www.udemy.com/course/the-complete-sql-bootcamp/
Cost: 11–15 dollars
Course time: 9 hrs
Learning Time: 3 weeks (Spending 10 hrs a week)

3. Good Foundational knowledge of common Data Analytics concepts- ELT, Data warehousing, and Data Modelling

Why ELT: ETL (Extract Transform Load) Data engineers serve as the engine room crew, ensuring that all systems remain operational and illuminated within data-dependent organizations. ETL, or extract, transform, load, has traditionally been the approach. However, the industry has shifted towards ELT due to the affordability of storage and the immense volumes of data being processed. This shift involves bringing code to the data instead of transporting all data to costly computation machines.

In the role of a data engineer, it’s crucial to understand that many organizations rely on your expertise to ensure timely updates of their daily reports. Often, these reports necessitate the consolidation of data from diverse source systems, followed by transformation and modeling within a data warehouse. This facilitates easy consumption by business intelligence reports or AI/ML models. Thus, proficiency in ELT/ETL, data warehousing, and data modeling is indispensable. Without these skills, organizations would struggle to integrate disparate source systems and provide consolidated analytical reports to C-level executives.

Please note that the following videos and learning materials merely scratch the surface of a much deeper topic. Personally, I gained understanding of many concepts through hands-on work experience. However, as a data engineer, it’s essential to possess at least a basic familiarity with the terminology and concepts to avoid feeling overwhelmed on the job

a) Data Modelling Fundamentals
https://www.udemy.com/course/mastering-data-modeling-fundamentals/
Cost 13 dollars
Course Time: 3hrs

b) Data Warehousing Fundamentals
https://www.youtube.com/watch?v=J326LIUrZM8
Time: 1hr
https://youtu.be/lWPiSZf7-uQ
Time: 1hr

c) ETL for Data Warehouse
https://www.youtube.com/watch?v=7MOU1l30lXs
Time 1hr

d) Dimensional Modelling
https://www.youtube.com/watch?v=DspXXZrSVRk
https://www.youtube.com/watch?v=ajVfBJrTOxw
Time 2hr

My estimated Learning hours: 1 week

4. Knowledge of Distributed systems and computing architecture & Deep Understanding of Spark/Databricks

Get familiar with Big Data Tools and Concepts: Spark and Databricks

As a data engineer in today’s landscape, you’ll often deal with vast amounts of data or utilize systems, such as distributed systems, specifically designed for handling such data volumes. Hence, having a solid grasp of distributed systems architecture and computing for big data workloads is crucial for a data engineer’s success.

While many of these instructional videos may be lengthy, typically lasting an hour, they offer essential context and foundational knowledge on the principles underpinning the design and utilization of big data tools. Moreover, you’ll observe that many other tools share similar distributed architecture patterns, allowing you to understand how these principles apply across various platforms.

This knowledge serves as a vital foundation for our next topic, which focuses on Spark

Focusing on mastering one of the most popular Big Data computing tools is highly advisable. By doing so, you not only deepen your understanding of big data but also acquire a highly sought-after and marketable skill. Amidst the hype surrounding Hadoop and its Ecosystem, numerous tools and projects emerged, yet only a select few have stood the test of time.

One such standout is Spark, renowned for its ability to distribute and compute big data using in-memory processing across a cluster of machines. Spark offers efficiency and versatility, supporting multiple languages including SQL, Python, Scala, and R. Databricks, the cloud-managed version of Spark, has also gained immense popularity owing to its cost-effectiveness. Therefore, possessing knowledge of Spark’s architecture and optimization is indispensable for success in the job market.

a) Apache Spark programming with Python: This covers fundamentals of big data and Spark with python

https://www.udemy.com/course/apache-spark-programming-in-python-for-beginners/

Cost: 20 dollars
Course Time: 14 hrs
Course Learning Time: 4 weeks

b) Master Databricks Course: This excellent course covers the premier spark platform on cloud which is databricks. It goes over the architecture and concepts

https://www.udemy.com/course/master-azure-databricks-for-data-engineers/

Cost: 20 dollars
Course Time: 20 hrs
Course Learning Time: 4 weeks

c) Databricks /Spark Optimization: this is important because a lot of interviews ask about this
https://www.youtube.com/watch?v=daXEp4HmS-E&t=99s

Note that if you have good knowledge of SQL and Python you can work a lot with Spark

Video time: 1hrs

Optional Courses on Youtube

a) Hadoop Architecture and Ecosystem
https://www.youtube.com/watch?v=m9v9lky3zcE
Video time 1hr

b) Distributed Systems lecture
https://youtu.be/Y6Ev8GIlbxc
Video time 1hr

c) Distributed computing lecture
https://youtu.be/ajjOEltiZm4
Video time 15 mins

d) Big data File Format
https://youtu.be/jKfKmBdPuT4
Video Time 8 mins

e)Optional: Hive tutorial
https://youtu.be/nVI4xEH7yU8
Video Time: 2 hrs

f) Optional: Massive Parallel Processing Engines
https://youtu.be/NUGcAUyQY-k
Watch time 1 hr

5. Cloud Knowledge and Cloud Data Tools, Data Warehouses and Services

Now that we’ve covered the cloud vendor-agnostic skills, let’s dive into the specific marketable skills essential for launching your career as a data engineer. In this instance, I’ll concentrate solely on outlining the requirements to become an Azure Data Engineer, as this has been my area of focus for the past few years.

It’s worth noting that these skills can also be adapted to equivalent technologies in other cloud environments.

Becoming an Azure Data Engineer

a) Azure Cloud knowledge

Gain proficiency in Azure by enrolling in a certificate course similar to the one below. Transitioning to cloud computing represents a significant shift in mindset and technology stack for many organizations accustomed to on-premise operations. While the cloud offers numerous advantages, it also presents pitfalls and constraints. Acquiring a foundational understanding of Azure cloud infrastructure is crucial for comprehending the value it brings to an organization, particularly in the realm of data analytics. Through such courses, you’ll not only learn which tools to use in various scenarios but also best practices for ensuring data security, optimizing costs, and managing resources effectively.

I would start with
a) Azure AZ-900: Azure Fundamentals

https://docs.microsoft.com/en-us/learn/paths/azure-fundamentals/

b) Azure Data Solution Services
https://www.youtube.com/watch?v=ohya6zTa1Hg
Watch time: 1hr

c) Azure DP -203: Azure Data Engineer Certification
https://docs.microsoft.com/en-us/learn/certifications/azure-data-engineer

d) Learn a simple ETL tool in Azure — Azure Data Factory

Azure Data Factory is beloved by every data engineer in Azure for good reason. It’s remarkably user-friendly, powerful, and easy to learn. Not only is it one of the most intuitive tools in the Azure ecosystem, but it’s also extensively documented, making it even more accessible. With a plethora of resources available on the Azure website and YouTube, mastering Azure Data Factory is within reach for anyone willing to put in the effort. I strongly encourage you to dive into learning this tool and continue to grow with it. Doing so will streamline your ELT and data workflow orchestration and scheduling tasks, empowering you as a data engineer.

Azure Data Factory comprehensive overview playlist
https://www.youtube.com/watch?v=Mc9JAra8WZU&list=PLMWaZteqtEaLTJffbbBzVOv9C0otal1FO

Advanced-Data Factory concepts (Parameterization)
https://youtu.be/K5Ak4IdtBCo

Worthy Mentions

Some of the readers of this post will be surprised that I have not mentioned skills like NoSQL, Streaming, Graph DB, Machine learning, etc. I am aware that they are important but I think that they are not fundamental for a beginner. Learning the above and being comfortable with it is hard enough and I wanted to ensure folks do not get overwhelmed.

Anyway for Data Engineers that want to learn more, try learning the below

  1. Understanding Spark streaming technologies and Hadoop Kafka

https://www.udemy.com/course/spark-streaming-using-python/

https://www.udemy.com/course/apache-kafka/

2. Snowflake Cloud Data warehouse

https://www.udemy.com/course/snowflake-masterclass/

3. NoSQL Databases

https://www.udemy.com/course/mongodb-the-complete-developers-guide/

About Me

I am Nnaemezue Obi-eyisi, a Senior Azure Databricks Data Engineer at Capgemini and the founder of AfroInfoTech, an online coaching platform for Azure data engineers specializing in Databricks. I have a passion for learning and sharing knowledge. If you’re interested, join my waitlist for the upcoming Data Engineer bootcamp by signing up with this link: https://afroinfotech.ck.page/d8b6f6da0e. You can also visit my official Data Engineering coaching website at https://afroinfotech.teachable.com/. Follow me on other platforms via https://linktr.ee/nobieyisi.

--

--

Nnaemezue Obi-Eyisi
Geek Culture

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini