Mastering Data Organization: A Guide to Partitioning Databricks Delta Tables
This article was inspired by a real-world underperforming Spark Streaming pipeline handling IoT data. One of my initial investigative steps was to identify the root cause of performance bottlenecks during upserts or inserts into a delta table. It became evident when I attempted to read that Delta table and create a copy of it in a test folder within my Azure Data Lake Storage container, and it took six hours to copy 67 million records from that table even though it had only 10 fields. Notably, there were no lengthy text fields in this table.
At that point, I realized that something was seriously amiss. Upon inspecting the Delta table in ADLS, I noticed that it was partitioned much like a library organizes books. It had an excessive number of partitions without careful consideration of how the downstream data consumption would occur. Even the smallest partition had file sizes ranging from 10 to 5 kilobytes. Contrastingly, Databricks recommends a minimum of 1 gigabyte per partition. Moreover, partitioning should generally be reserved for tables with sizes of 1 terabyte or more.
Once I managed to copy the data to this new, non-partitioned Delta table, the resulting files were in the hundreds of megabytes in size. Consequently, the task of copying the data from this new non-partitioned Delta table to another Delta table took only one minute. This dramatic reduction in time, from 6 hours to 1 minute, was solely attributed to improving a poorly devised…