ANALYZE your Databricks Delta Tables for improved Performance

Nnaemezue Obi-Eyisi
3 min readMar 25, 2024

--

Most of the time when I try to optimize the performance of Delta tables, I focus on file size, organization, and structure. This is where concepts like compaction, partitioning, collocating data files, and tuning file sizes come into play. To accomplish these optimizations, we leverage some Delta table functions such as OPTIMIZE, Z-ORDER, and Liquid Clustering. However, I believe another significant aspect that we need to incorporate into our Delta table optimization is updating statistics.

It’s important to remember that as our data changes significantly due to updates, inserts, or deletes, the optimizer might require updated statistics of the table fields to generate an optimal execution plan (Cost-based optimizer) when executing analytical queries like Joins or filters on the table. To achieve this, we can use the ANALYZE command.

Updating statistics is not a new concept; it has been a standard practice in most major database platforms. Typically, it is performed after the completion of a major ETL workload.

Notice difference between Analyze and delta log statistics

You might be wondering, isn’t databricks collecting stats already? Avoid confusing the ANALYZE command with the automatic collection of statistics for the first 32 columns of a Delta Table, which is for the Data Skipping optimization feature of Delta tables. This Data Skipping feature captures mostly the min and max values of numeric fields.

Get the Best of ANALYZE when Unity Catalog is Enabled

It’s worth noting that the ANALYZE command is more suitable for Unity Catalog-enabled workspaces or tables with metadata maintained in the unity catalog. If you have an external Delta table in the Hive Metastore, running the analyze command doesn’t really have much impact.

Below are some key takeaways when running the ANALYZE command

To properly leverage CBO optimizations, ANALYZE TABLE command needs to be executed regularly (preferably once per day or when data has mutated by more than 10%, whichever happens first)

When Delta tables are recreated or overwritten on a daily basis, the ANALYZE TABLE command should be executed immediately after the table is overwritten as part of the same job or pipeline. This will have an impact on your pipeline’s overall SLA. As a result, in cases like this, there is a trade-off between better downstream performance and the current job’s execution time. If you don’t want CBO optimization to affect the current job’s SLA, you can turn it off.

Never run ANALYZE TABLE command as part of your job. It should be run as a separate job on a separate job cluster. For example, it can be run inside the same nightly notebook running commands like Optimize, Z-order and Vacuum.

Spark’s Adaptive Query Execution (AQE), which changes the query plan on the fly during runtime to a better one, also takes advantage of the statistics calculated by ANALYZE TABLE command. Therefore, it’s recommended to run ANALYZE TABLE command regularly to keep the table statistics updated.

“- Reference

For further information on ANALYZE command syntax, please refer to the official documentation: here

ANALYZE TABLE students COMPUTE STATISTICS 

Conclusion

As we can see the ANALYZE Table statement collects statistics about a specific table or all tables in a specified schema. Then these statistics are used by the query optimizer to generate an optimal query plans.

Follow me on: LinkedIn | All Platforms

To Learn Azure Data Engineering with Databricks, and join the waitlist: Click here

About Me

I am Nnaemezue Obi-eyisi, a Senior Azure Databricks Data Engineer at Capgemini and the founder of AfroInfoTech, an online coaching platform for Azure data engineers specializing in Databricks.

--

--

Nnaemezue Obi-Eyisi

I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini