Getting Started with Databricks on Azure

3 min readMay 4, 2020

In this tutorial, you will learn Azure Databricks to achieve the below objectives:

Create an Azure Databricks workspace using Azure portal
Create and configure a spark cluster in Azure Databricks

Requirements

Microsoft Azure subscription must be pay-as-you-go / MSDN / Trial subscription

What is Databricks?

Databricks with Azure provides self-managed analytical platform, it breaks the silos between Data Engineers, Business Users, Data Scientists to collaborate and deliver a single solution from ingestion to production

How can Databricks help your organization?

Databricks can help organizations streamline the below tasks:

Enables data teams to work with big data efficiently, through ingestion to production
A single source to connect disparate data sources and build data pipelines
Data preparation and feature engineering for machine learning models
Data exploration and profiling using ML and BI applications
Extracting active insights from data for decision making

Exercise 1: Create an Azure Databricks workspace using Azure portal

Navigate to https://portal.azure.com/ and sign in using your Microsoft account credentials with Azure Subscription
From your Azure dashboard Click on +Create A Resource, then Analytics and select Azure Databricks

Under the Basics tab in Create Azure Databrick Service Page, enter the below details:

Subscription: Select Your Azure Subscription
Resource Group(Create New): databricks-rg
Workspace Name: <Your Name>dbc
Location: East US
Pricing Tier: Standard

Click on Review + Create, and then click on Create. It may take 5–10 minutes for your deployment to complete
After your deployment is successful, click on Go To Resource, this will open your Azure Databricks Service

Click on Launch Workspace to open Azure Databricks and sign in using the same Microsoft credentials associated with your Azure subscription

Exercise 2: Create and configure a spark cluster in Azure Databricks

Navigate to your Azure Databricks Workspace homepage, from the left-hand menu click on Clusters, and then click on + Create Cluster
In the New Cluster Tab, enter the Cluster Name as spark-cluster
Cluster Mode, can be configured as Standard/ High Concurrency. Standard is the default cluster mode and is recommended for single users supporting Python, R, SQL, and Scala whereas High Concurrency mode is optimized to run concurrent Python, R, and SQL workloads. Use Standard Mode for this demo
Pool, can be used in a collaborative environment where users are creating and releasing computes, it will keep a defined number of ready instances to reduce cluster startup time. Choose the default option
Azure Databricks offers several types of runtime and several versions of those runtime types in the Databricks Runtime Version drop-down when you create or edit a cluster. Choose the default Runtime
Make sure you enable autoscaling, and to save on cost select to terminate cluster after 30 minutes of inactivity
A cluster consists of one driver node and worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Choose default options for driver and worker nodes
Verify the Cluster configuration below, and click on Create Cluster

After the cluster is created, you can view the same in the Clusters tab
Now you can create a Notebook and start designing your analytical solutions