The Modern Cloud Data Platform war — DataBricks (Part 4)

Modern Cloud Data War — DataBricks Fourth part

Published in

Data Arena

5 min readJul 31, 2021

This article is a part of a multi-part series Modern Cloud Data Platform War (parent article). Previous part — Modern Cloud Data Platform War — DataBricks (Part 3) — Data sharing.

Challenge 4: Machine Learning & Analytics

Different types of Machine Learning algorithms run on these massive data sets from Recommendation engines to fraud detection etc.

ML & outcome Analytics. Image created by the author

DataBricks provides a very good unified stack that enables your organization to have a very efficient Lakehouse architecture. Have all your data in one place, share data from that place with permissions, shared by different users such as Enterprise Data service users, business users, IT users, Data Scientists, and Enterprise Datawarehous users.

Solution 1: MlFlow Machine Learning

Managed MLFlow

The general lifecycle of how Data goes to Machine Learning is depicted in the image below.

Now, in what way the DataBricks Managed MLFlow helps when it can already help you build a Lakehouse architecture such as the above and fits the purpose for different users and use cases? the short answer is — Managed MLFlow provides end to end ML lifecycle. Ok, but how?

How does Managed MLFlow work?

What is MLflow? MLflow is an open-source platform to manage ML lifecycle, experimentation, reproducibility, deployment, and central repository of the model registry.

What is Managed MLflow? Managed MLflow is built on top of MLflow and is a managed service by Databricks to help manage complete matching learning lifecycle including MLOps, enterprise reliability, security, and scale. It supports several programming languages such as Java, Python, R, etc.

Difference between MLflow and Managed MLflow:

General ML processing flow

General Steps for ML — Image by the Author

Experiments:

Experiments are the core unit of work in the MLflow. Each experiment is a group of MLflow runs and it lets you visualize, search and compare runs and also enables you to download artifacts or metadata for analysis using other tools. Experiments are maintained in a Databricks hosted MLflow tracking server. There are two types of experiments (1) Workspace (2) Notebook

(1) Workspace Experiment: Can be created from Databricks Machine Learning UI or the MLflow API. They are not associated with any notebook.

(2) Notebook Experiment: Associated with a specific notebook and Databricks automatically creates a notebook experiment if there is no active experiment — use mlflow.start_run() to start the experiment. Refer to MLflow Experiment permissions.

Data Science & Engineering Workspace. Image source — Databricks.

Primary components include

MLflow experiments Tracking:

You create models using MLflow and how do you enable tracking? MLflow Tracking is the answer and it enables extensive tracking of the experiments and also allows you to compare parameters and results.

MLflow Projects:

While experiments are a group of runs, Mlflow Project is a format for packaging data science code that enables reusability and reproducible way. The projects component includes an API and command-line tools for running projects, making it possible to chain together projects into workflows. Any Git repository or local directory can be treated as an MLflow project. MLflow currently supports Conda environment, Docker container environment, and system environment.

MLflow Models:

Once you have run the models, the same can be used by a variety of downstream tools such as REST API or batch interface on Apache Spark.

Model customization — The custom python models mlflow.pyfunc provides utilities for customization and the ability to save and log models.
Built-in model flavors — several standard flavors such as Python and R functions, H20.ai, Keras, MLeap, PyTorch, Scikit-learn, Spark MLlib, TensorFlow, and ONNX.

MLOps:

What is MLOPs?

MLOps enables you to automate, provides versioning of the models. Enables new regulations and best practices.

Automation
Versioning of models
Reproducibility
Experiment tracking
Continuous integration and deployment
Testing
Monitoring

How does Databricks MLflow enable MLOps?

DataBricks MLflow enables MLOps using Model Registry and Model serving.

General MLflow before MLOps inclusion

Implementing MLOps:

MLflow Model Registry:

It is a hub where teams can share the ML models that are created across the organization, this enables them to work together and no need to re-invent the wheel, this provides integrated approval and governance workflows and monitor ML deployments and improved performance.

MLflow Model Serving:

Simple model deployment as a REST endpoint for low latency serving. Integrates with the Model Registry to manage staging and production versions of endpoints.

Auto ML:

You can create baseline models and notebooks using Databricks AutoML — a low code approach. Data Scientists and ML experts can accelerate workflows by fast-forwarding through eh trial-and-error and focus on customization using domain knowledge.

Solution 2: Data Science

Streamlining end-to-end workflows from preparation, modeling to sharing data insights using the integrated Databrics Data Science. The power of Lakehouse architecture enables to achieve this integrated environment. By using this, it enables you to focus on the Problem and Solution rather than building the infrastructure and updating and upgrading the environment every now and then.

Key advantages

You can use the IDE of your choices such as Pycharm, R Studio, or Jupyter.
enables you to share the insights quickly
Focus on Data Science problems and solutions rather than infrastructure or upgrades
Collaboration across the entire data science workload.
Clean and catalog your data so that you can get the data ready for Machine Learning models all in one place with Delta Lake.

Summary:

Company X’s one-stop answer for all their Machine learning and drawing intelligence from the massive and increasing data set is Databrick’s (1) MLFlow (2) Data Science. All these are enabled by Delta Lake.