Azure Databricks — Part 1: Introduction

Alexandre Bergere
datalex
Published in
4 min readMar 30, 2020
Photo by Dustin Lee on Unsplash

During the last couple of years, we have been using Spark and Databricks in multiple AI and Big Data projects with different clients. And last year, we decided to participate in Spark + AI Summit Europe that was held in Amsterdam.

During the event, we met so many inspiring people from all over the world and we also got the chance to participate in a couple of meetings with Adam Conway (SVP Products, Data and AI at Databricks), Michael Armbrust (The Creator of Delta Lake) and many more.

This series of events motivated us to dig a little deeper in order to understand every aspect of the platform.

Today we decided to kick off this series of articles to walk you through the key concepts of Azure Databricks. Our goal is to help you from scratch to deploy your solution in production using Databricks.

How are we going to proceed throughout this series ?

In order to help you practice everything you will learn, we decided to end our series with the implementation of a concrete AI Enterprise use case:

Today, France has four different power plants: nuclear plant, solar panels, wind turbine and coal-fire plant.

One of the biggest challenges faced by this industrial facility is the over-consumption caused by one of the stations.

In order to tackle this challenge we simulate the behavior of these power plants through telemetry devices.

The aim of our solution is to predict and redirect the over-consumption. Whenever this scenario happens, our algorithm will pick beforehand the best candidate to absorb the marginal consumption.

The algorithm detects potential over-consumption and redirects it to the best candidate

Our power plant information is simulated from weather history, from the last couple of years.

After several design workshops, we decide to implement the following architecture:

Use case’s architecture

Everything will be explained in more detail in the last article and the code used to deploy the infrastructure will be available on Github.

But first, we will start by explaining the architecture behind the platform along with the different security aspects that you should know. Then, we will move to the industrialization of Databricks’ workflows starting from the CI/CD pipeline, using Ansible and Terraform, to the orchestration of your different jobs. We will also dedicate an article to talk about the best practices to implement in your projects.

Then, we will end the series by implementing together our solution from scratch.

Plan:

To answer all those questions, this post series will be composed of the following items:

  • Azure Databricks — Part 1: Introduction
  • Azure Databricks — Part 2.1: The architecture behind
  • Azure Databricks — Part 2.2: Getting familiar with Databricks UI
  • Azure Databricks — Part 3: Connect Azure storage to Databricks
  • Azure Databricks — Part 4.1: Data protection
  • Azure Databricks — Part 4.2: Authentication through Databricks
  • Azure Databricks — Part 4.3: Secure your network
  • Azure Databricks — Part 5: Monitor your plateforme
  • Azure Databricks — Part 6: Configure the development environment
  • Azure Databricks — Part 7.1: Integration of Databricks in your CI/CD pipeline
  • Azure Databricks — Part 7.2: Schedule your work
  • Azure Databricks — Part 8: Stay on top of Databricks best practices
  • Azure Databricks — Part 9: Using Databricks in a big company
  • Azure Databricks — Part 10: Use case — from ingestion to visualization
  • Azure Databricks — Bonus: Spark AI Summit 2019 Overview

P.S : All content used during this post series will also be available on Github, you can find the repository here.

About the authors

--

--

Alexandre Bergere
datalex
Editor for

Data Architect & Solution Architect independent ☁️ Delta & openLineage lover.