Databricks on Microsoft Azure -Cost Calculation

Nagaraju Gajula
Better Data Platforms
3 min readAug 11, 2023

Introduction:

In the era of data-driven decision-making, businesses are leveraging advanced technologies to extract valuable insights from their data. Databricks, combined with Microsoft Azure Data Lake Storage (ADLS), offers a powerful solution for processing and analyzing data efficiently. In this example scenario, we’ll explore how a retail company can utilize Databricks on Azure to analyze sales data stored in ADLS. By considering assumptions such as cluster configuration, data storage, and egress costs, we’ll calculate an estimated cost for this data analysis endeavor.

Databricks on Microsoft Azure to process and analyze data stored in Azure Data Lake Storage (ADLS).

Example Scenario: You are a retail company that wants to analyze sales data to gain insights into customer behavior and optimize inventory management. The data is stored in ADLS, and you plan to use Databricks for data processing and analysis.

Assumptions:

  1. Databricks Workspace: You have set up a Databricks Workspace in Azure, and you have chosen the standard tier for Databricks.
  2. Data Storage: Your sales data is stored in ADLS, and the storage cost is $0.02 per GB per month.
  3. Cluster Configuration: You plan to use a Databricks cluster with the following configuration:
  • Cluster Type: Standard_DS3_v2
  • Number of Worker Nodes: 4
  • Worker Node VM Size: Standard_DS3_v2 (4 cores, 14 GB RAM per node)
  • Auto-scaling enabled with a maximum of 8 worker nodes during peak times.

4. Data Egress: You expect to transfer processed data from Databricks to an Azure SQL Database for reporting and visualization. The data egress cost is $0.10 per GB.

Cost Calculation:

  1. Databricks Cluster Cost:
  • Cost per Worker Node per Hour: $0.35 (based on the pricing of Standard_DS3_v2)
  • Number of Worker Nodes: 4 (as specified in the cluster configuration)
  • Cluster Usage Hours: 160 hours in a month (assuming the cluster runs 24/7 for 30 days)
  • Total Cluster Cost: $0.35 * 4 * 160 = $224

2. Data Storage Cost:

  • Amount of Data Stored: 500 GB
  • Storage Cost per Month: $0.02 per GB
  • Total Storage Cost: $0.02 * 500 = $10

3. Data Egress Cost:

  • Amount of Data Egressed: 100 GB
  • Data Egress Cost per GB: $0.10
  • Total Data Egress Cost: $0.10 * 100 = $10

Total Cost: The total cost of using Databricks in this scenario would be the sum of the cluster cost, data storage cost, and data egress cost: Total Cost = $224 (Cluster) + $10 (Storage) + $10 (Data Egress) = $244

Please note that this is a simplified example, and actual costs may vary based on specific usage patterns, discounts, and other factors. Always refer to the Azure pricing documentation and Databricks pricing details for the most accurate cost estimation. Additionally, remember to optimize cluster usage, shut down clusters when not in use, and consider other cost-saving measures to control expenses effectively.

Conclusion:

As businesses strive to gain competitive advantages through data analysis, the combination of Databricks and Azure Data Lake Storage provides a robust platform for achieving this goal. This scenario illustrated how a retail company could effectively process and analyze sales data stored in ADLS using a Databricks cluster. By calculating estimated costs, we’ve highlighted the financial aspects of this endeavor. However, it’s important to acknowledge that actual costs can vary due to usage patterns, discounts, and other factors. As organizations embark on similar journeys, it’s crucial to reference Azure’s pricing documentation and Databricks pricing details for precise cost assessments. To optimize costs, businesses should implement prudent usage practices, employ cost-saving strategies, and capitalize on the insights garnered to drive informed decision-making and operational efficiency.

--

--