Getting Started with Databricks on Azure

Raghav Matta
3 min readMay 4, 2020

--

In this tutorial, you will learn Azure Databricks to achieve the below objectives:

  • Create an Azure Databricks workspace using Azure portal
  • Create and configure a spark cluster in Azure Databricks

Requirements

Microsoft Azure subscription must be pay-as-you-go / MSDN / Trial subscription

What is Databricks?

Databricks with Azure provides self-managed analytical platform, it breaks the silos between Data Engineers, Business Users, Data Scientists to collaborate and deliver a single solution from ingestion to production

How can Databricks help your organization?

Databricks can help organizations streamline the below tasks:

  • Enables data teams to work with big data efficiently, through ingestion to production
  • A single source to connect disparate data sources and build data pipelines
  • Data preparation and feature engineering for machine learning models
  • Data exploration and profiling using ML and BI applications
  • Extracting active insights from data for decision making

Exercise 1: Create an Azure Databricks workspace using Azure portal

  • Navigate to https://portal.azure.com/ and sign in using your Microsoft account credentials with Azure Subscription
  • From your Azure dashboard Click on +Create A Resource, then Analytics and select Azure Databricks
  • Under the Basics tab in Create Azure Databrick Service Page, enter the below details:
Subscription: Select Your Azure Subscription
Resource Group(Create New): databricks-rg
Workspace Name: <Your Name>dbc
Location: East US
Pricing Tier: Standard
  • Click on Review + Create, and then click on Create. It may take 5–10 minutes for your deployment to complete
  • After your deployment is successful, click on Go To Resource, this will open your Azure Databricks Service
  • Click on Launch Workspace to open Azure Databricks and sign in using the same Microsoft credentials associated with your Azure subscription

Exercise 2: Create and configure a spark cluster in Azure Databricks

  • Navigate to your Azure Databricks Workspace homepage, from the left-hand menu click on Clusters, and then click on + Create Cluster
  • In the New Cluster Tab, enter the Cluster Name as spark-cluster
  • Cluster Mode, can be configured as Standard/ High Concurrency. Standard is the default cluster mode and is recommended for single users supporting Python, R, SQL, and Scala whereas High Concurrency mode is optimized to run concurrent Python, R, and SQL workloads. Use Standard Mode for this demo
  • Pool, can be used in a collaborative environment where users are creating and releasing computes, it will keep a defined number of ready instances to reduce cluster startup time. Choose the default option
  • Azure Databricks offers several types of runtime and several versions of those runtime types in the Databricks Runtime Version drop-down when you create or edit a cluster. Choose the default Runtime
  • Make sure you enable autoscaling, and to save on cost select to terminate cluster after 30 minutes of inactivity
  • A cluster consists of one driver node and worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Choose default options for driver and worker nodes
  • Verify the Cluster configuration below, and click on Create Cluster
  • After the cluster is created, you can view the same in the Clusters tab
  • Now you can create a Notebook and start designing your analytical solutions

--

--

Raghav Matta

Microsoft Certified Trainer with passion to learn and share knowledge | Expertise in Azure Data Platform | Databricks