Azure Databricks — Part 1: Introduction

Published in

datalex

4 min readMar 30, 2020

During the last couple of years, we have been using Spark and Databricks in multiple AI and Big Data projects with different clients. And last year, we decided to participate in Spark + AI Summit Europe that was held in Amsterdam.

During the event, we met so many inspiring people from all over the world and we also got the chance to participate in a couple of meetings with Adam Conway (SVP Products, Data and AI at Databricks), Michael Armbrust (The Creator of Delta Lake) and many more.

This series of events motivated us to dig a little deeper in order to understand every aspect of the platform.

Today we decided to kick off this series of articles to walk you through the key concepts of Azure Databricks. Our goal is to help you from scratch to deploy your solution in production using Databricks.

How are we going to proceed throughout this series ?

In order to help you practice everything you will learn, we decided to end our series with the implementation of a concrete AI Enterprise use case:

Today, France has four different power plants: nuclear plant, solar panels, wind turbine and coal-fire plant.

One of the biggest challenges faced by this industrial facility is the over-consumption caused by one of the stations.

In order to tackle this challenge we simulate the behavior of these power plants through telemetry devices.

The aim of our solution is to predict and redirect the over-consumption. Whenever this scenario happens, our algorithm will pick beforehand the best candidate to absorb the marginal consumption.

The algorithm detects potential over-consumption and redirects it to the best candidate

Our power plant information is simulated from weather history, from the last couple of years.

After several design workshops, we decide to implement the following architecture:

Everything will be explained in more detail in the last article and the code used to deploy the infrastructure will be available on Github.

But first, we will start by explaining the architecture behind the platform along with the different security aspects that you should know. Then, we will move to the industrialization of Databricks’ workflows starting from the CI/CD pipeline, using Ansible and Terraform, to the orchestration of your different jobs. We will also dedicate an article to talk about the best practices to implement in your projects.

Then, we will end the series by implementing together our solution from scratch.

Plan:

To answer all those questions, this post series will be composed of the following items:

Azure Databricks — Part 1: Introduction
Azure Databricks — Part 2.1: The architecture behind
Azure Databricks — Part 2.2: Getting familiar with Databricks UI
Azure Databricks — Part 3: Connect Azure storage to Databricks
Azure Databricks — Part 4.1: Data protection
Azure Databricks — Part 4.2: Authentication through Databricks
Azure Databricks — Part 4.3: Secure your network
Azure Databricks — Part 5: Monitor your plateforme
Azure Databricks — Part 6: Configure the development environment
Azure Databricks — Part 7.1: Integration of Databricks in your CI/CD pipeline
Azure Databricks — Part 7.2: Schedule your work
Azure Databricks — Part 8: Stay on top of Databricks best practices
Azure Databricks — Part 9: Using Databricks in a big company
Azure Databricks — Part 10: Use case — from ingestion to visualization
Azure Databricks — Bonus: Spark AI Summit 2019 Overview

P.S : All content used during this post series will also be available on Github, you can find the repository here.

Azure Databricks — Part 1: Introduction

Plan:

About the authors

Written by Alexandre Bergere