The motivation behind writing this article is to explore the potential cost savings associated with Databricks job clusters and to determine whether using them within Data Factory or in Databricks workflows is the better option. As we are aware, all-purpose clusters tend to be more expensive than job clusters. This cost difference arises not only because interactive clusters require a minimum of about 10 minutes to shut off, but also because all-purpose clusters are more than twice the price of a job cluster ($0.15/DBU vs $0.40/BDU).
In this article, I aim to compare and contrast Data Factory orchestration with and without Databricks workflows. Additionally, I’ll discuss the drawbacks of workflows and compare the performance of job clusters and interactive clusters.
Data Factory with Job Clusters
When using Azure Data Factory to execute a Databricks notebook activity with a job cluster, cluster provisioning is a significant concern, especially for the initial notebook activity. Typically, the initial activity takes around 3 to 5 minutes to kick off, while the system efficiently reuses existing compute resources for subsequent notebook activities, but it still deallocates and reallocate the compute resource for each notebook activity. This takes a significant amount of time.
Why Databricks Job Clusters in ADF Take More Time
It’s essential to note that even after excluding the initial (3–5 minutes) cluster provisioning time, the total pipeline run when using a job cluster still takes significantly more time than running the entire data pipeline using an interactive cluster. This is due to several factors:
- This extended time results from the necessity to assign warmed-up job clusters for each individual Databricks notebook activity and the requirement to ‘logically terminate’ them after notebook activity completion.
- If any packages or libraries require installation on the cluster, this setup must be repeated at the beginning of each notebook activity.
- Furthermore, when using Data Factory to orchestrate Databricks notebooks, additional time is incurred as the Databricks Run Job API…