My development notebook got executed on production Databricks cluster — how did that happen?

Carson Yeung
4 min readMar 12, 2020

--

Today one of my customers asked me — “Hey Carson — I am surprised that a notebook in my Databricks production cluster was executed by a Data Factory pipeline in my development environment — how can that be? I have secured my Databricks cluster with VNet injection already!”

This can be a disaster to many data engineers — if 1 accidentally executed some testing scripts on production environment that modify or even delete data in production environment.

So, what happened?

Environment Setup

To illustrate the idea, here is a diagram to show how the customer perceived on what he has setup:

It is a simple setup — 2 environments — 1 production, 1 development. Each is in its own subscription.

And in each environment, Data Factory Integration run time will call Databricks to execute notebook on the cluster.

The 2 environments are created on 2 separate Azure subscriptions. To my customer — the environment should be in principle completely isolated.

Here is the question: how come the production Databricks notebook got executed by the development Data Factory?

Investigation into Data Factory

To investigate, we look into Data Factory and inspect how the linked services for Azure Databricks was setup:

If you look at closely, you will notice the only information entered are:

i) Domain / Region

ii) Access token

iii) Cluster name

How does Data Factory know which cluster is to connect?

The answer is: the “access token”, which is hidden in the above screen. The access token tells which Databricks control plane to connect to. And the “cluster name” tell which cluster to run the notebook.

So what went wrong in this particular setup?

The root causes: — copy & paste + poor naming convention

Imagine that you have 1 engineer working across ALL environments, and the engineer has a habit to store the token on a text file — high chance is you might have copied the WRONG key into the access token field.

What would be the implication of copying the wrong token?

The Data Factory will connect to the Databricks cluster on a environment. It has nothing to do with the VNet protection of the cluster, as the command goes through the databricks control plane.

What makes it worse? The name given to the databricks cluster just “cluster” — a name that does not really tell which environment it belongs to.

The “double mistake” has contributed to this unfortunate event of running a development notebook at production environment — and only find out the issue after quite some investigation.

The Mitigation

OK, now we know what happened. How do we mitigate this?

Here are at least 2 things:

1) Educating the engineer not to store multiple access tokens will reduce the likelihood of human error

2) Cluster naming convention also matters — this helps alert the engineer when wrong environment was selected

Obviously, this is still not ideal enough. The engineer might still be storing the secret somewhere on his laptop which is not a desirable way from data security point of view.

To improve it even further

If you take a closer look at the Data Factory connection screen — you can actually store the access token in Azure Key Vault — and setup in a way that only Data Factory managed identity to get the token.

Here is the setup:

By introducing an additional layer of key vault protection, the chance of accessing the wrong databricks environment can be further reduced.

And you don’t have to worry about your developers storing the access token somewhere on the laptop! : )

Reference links:

- Store credential in Azure Key Vault

https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault

- Deploy Azure Databricks in your Azure Virtual Network (VNet Injection)

https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/vnet-inject

- Databricks Authentication

https://docs.databricks.com/dev-tools/api/latest/authentication.html

--

--

Carson Yeung

Love applying technology to make life easier for everyone!