Should you be implementing the Data Lakehouse Architecture?
Everything I am about to say in this article is strictly my opinion. I am a big fan of Databricks, and Delta Lake and appreciate the value they bring to modern data analytics. I am currently working with an Oil & Gas client utilizing these technological tools. However, I do have my reservations which I will explain in this article.
The goal of this article is to raise doubts on the validity and efficiency of some of the data platform architectures involving data lakes, lakehouses and Spark that I see implemented in Azure Cloud environment.
As a data engineer working with various clients, I have been bewildered with some design solutions.
Below is a sample scenario that I have encountered in the industry that made me question the reasoning behind it
Building a self service Data platform on Azure Data Lake Gen2
In this project the goal was to build a self service data lake filled with cleaned and prepared data that various business teams could consume for their analytical reporting. As a data engineer, I was tasked with creating ETL pipelines to extract and load data from SQL Server and Oracle Database/Source systems into Azure Data Lake Storage as parquet files. Subsequently, I would create databricks notebooks to transform the data into a business-friendly dataset that would be useful for downstream analytics. At that moment we wanted to use PowerBI to directly query on Azure Data Lake. Since PowerBI didn’t support parquet files, we had to rewrite the files as csv. Hence this introduced data quality issues. For example, some fields will be parsed wrongly (columns split, or records split from one line to multiple) due to commas, new line and other special characters in the fields. These are common problems when working with csv files. Subsequently, we introduced CDC tool called Qlik Replicate to sync data from source systems like SAP, and SQL Server to Azure Data Lake. However, we were limited to writing the data out into data lake as either csv or json. Again, this introduced data quality issues.
Doubts on Validity of the Proposed and implemented Data Architecture