Mastering Data Organization: A Guide to Partitioning Databricks Delta Tables
This article was inspired by a real-world underperforming Spark Streaming pipeline handling IoT data. One of my initial investigative steps was to identify the root cause of performance bottlenecks during upserts or inserts into a delta table. It became evident when I attempted to read that Delta table and create a copy of it in a test folder within my Azure Data Lake Storage container, and it took six hours to copy 67 million records from that table even though it had only 10 fields. Notably, there were no lengthy text fields in this table.
At that point, I realized that something was seriously amiss. Upon inspecting the Delta table in ADLS, I noticed that it was partitioned much like a library organizes books. It had an excessive number of partitions without careful consideration of how the downstream data consumption would occur. Even the smallest partition had file sizes ranging from 10 to 5 kilobytes. Contrastingly, Databricks recommends a minimum of 1 gigabyte per partition. Moreover, partitioning should generally be reserved for tables with sizes of 1 terabyte or more.
Once I managed to copy the data to this new, non-partitioned Delta table, the resulting files were in the hundreds of megabytes in size. Consequently, the task of copying the data from this new non-partitioned Delta table to another Delta table took only one minute. This dramatic reduction in time, from 6 hours to 1 minute, was solely attributed to improving a poorly devised partitioning strategy.
Furthermore, thanks to the latest enhancements in Databricks, a feature known as “liquid clustering” is available to assist you in steering clear of both over-partitioning and under-partitioning your Delta table. I recommend reviewing the official documentation for more in-depth information. This approach will help you avoid performance issue, just ensure to select the most suitable column(s) as your key.
I encourage you to read the article below for further guidance and ensure that you only partition your data when you have a comprehensive understanding of the basics. I also attached reference links from Microsoft that talks about partitioning of delta tables and big data small file problem.
In the world of big data and data engineering, efficient data organization is the key to optimizing performance and ensuring that your data operations run smoothly. One essential technique that plays a…