Tracking AI model provenance with Dotscience and Databricks on Azure

Published in

Dotscience

3 min readFeb 22, 2019

Combining Databricks with Dotscience means you get version control and provenance tracking on data science projects using big data loaded from Spark.

What is the problem?

You have a large amount of data in your Databricks managed Spark cluster and want to use it to train AI models.

Once a model is deployed to production, you want to know exactly what dataset was used to train it.

Without tracking the provenance of the data used to train your models, you cannot confidently make changes to a dataset or know for sure that those changes will improve the performance of the model.

A solution…

Databricks excels at managing large scale Spark clusters and giving access to big data stored in a variety of backend storage mediums.

Dotscience brings data provenance tracking capabilities to your project.

By combining these two tools, get get the best of both worlds: big data on a Databricks managed Spark cluster and the data provenance and model performance tracking of Dotscience.

How it works…

We will need the following things:

A Dotscience cloud account
A Databricks cloud account on Azure or AWS
A Dotscience runner with Jupyter
A Spark cluster with some data

We are going to:

Connect our dotscience runner to our Spark cluster
Perform an SQL query to extract data from our Spark cluster
Version control that data using the dotscience API
Train a model using that data
Version control and annotate that model using the dotscience API
Track the performance of the model using the dotscience cloud visual interface

Demo…

Conclusion

As you have just seen — by combining these two powerful tools, you can use your Databricks Spark cluster as a data source for a Dotscience project. This is as simple as writing an SQL query to download the required data into your Dotscience workspace from your Jupyter notebook.

Then, using the Dotscience Python library, you can version control the input data. This means you can always get to that exact version at any later stage. Even if the data in the Spark cluster changes — the data that was version controlled by Dotscience is guaranteed to be the same.

We can then train a model using that input data and track how it performed. We write these values as metadata to the Dotscience API. This let’s you visually track the performance of the model and compare it to other models that may have used different input data and/or parameters.

This is particularly useful in collaborative situations where two data scientists are tweaking the input data as well as model parameters to try and get better performance out of the model.

Both the input data sets and the resultant model performance for each run is version controlled by Dotscience and so the performance can be compared easily.

In this example, we showed you how you can version control data loaded from a Databricks Spark cluster into Dotscience, and then track the provenance of models created using that data as well as the metrics (parameters and summary stats).