MLflow - Storing Artifacts in HDFS and in an SQLite DB

6 min readJun 4, 2020

Storing Artifacts in HDFS and in an SQLite DB

Introducing MLflow

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:

MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results

MLflow Projects are a standard format for packaging reusable data science code. Each project is simply a directory with code or a Git repository, and uses a descriptor file or simply a convention to specify its dependencies and how to run the code.

MLflow Models offer a convention for packaging machine learning models in multiple flavors, and a variety of tools to help you deploy them. Each Model is saved as a directory containing arbitrary files and a descriptor file that lists several “flavors” the model can be used in.

MLflow Registry offers a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model.

Advantages of MLflow

Data is the key to obtaining good results in machine learning, so MLflow is designed to scale to large data sets, large output files (for example, models), and large numbers of experiments. Specifically, MLflow supports scaling in four dimensions:

An individual MLflow run can execute on a distributed cluster, for example, using Apache Spark.
MLflow supports launching multiple runs in parallel with different parameters, for example, for hyperparameter tuning.
MLflow Projects can take input from, and write output to, distributed storage systems.
MLflow Model Registry offers large organizations a central hub to collaboratively manage a complete model lifecycle.

In this blog I will be showing you the steps, to leverage MLflow Tracking, to store the artifacts of a model in HDFS.

MLflow Tracking Servers

An MLflow tracking server has two components for storage: a backend store and an artifact store.

The backend store is where MLflow Tracking Server stores experiment and run metadata as well as params, metrics, and tags for runs. MLflow supports two types of backend stores: file store and database-backed store.

Use --backend-store-uri to configure the type of backend store. You specify a file store backend as ./path_to_store or file:/path_to_store and a database-backed store as SQLAlchemy database URI. The database URI typically takes the format <dialect>+<driver>://<username>:<password>@<host>:<port>/<database>. MLflow supports the database dialects mysql, mssql, sqlite, and postgresql. Drivers are optional. If you do not specify a driver, SQLAlchemy uses a dialect’s default driver. For example, --backend-store-uri sqlite:///mlflow.db would use a local SQLite database.

mlflow server will fail against a database-backed store with an out-of-date database schema. To prevent this, upgrade your database schema to the latest supported version using mlflow db upgrade [db_uri].

By default --backend-store-uri is set to the local ./mlruns directory (the same as when running mlflow run locally), but when running a server, make sure that this points to a persistent (that is, non-ephemeral) file system location.

The artifact store is a location suitable for large data (such as an S3 bucket or shared NFS file system or as our use case: HDFS ) and is where clients log their artifact output (for example, models). artifact_location is a property recorded on mlflow.entities.Experiment for default location to store artifacts for all runs in this experiment. Additional, artifact_uri is a property on mlflow.entities.RunInfo to indicate location where all artifacts for this run are stored.

Use --default-artifact-root (defaults to local ./mlruns directory) to configure default location to server’s artifact store. This will be used as artifact location for newly-created experiments that do not specify one. Once you create an experiment, --default-artifact-root is no longer relevant to that experiment.

Steps to Achieve our Goal

Defining HDFS Driver

To store artifacts in HDFS, specify a hdfs: URI as the -default-artifact-root . It can contain host and port: hdfs://<host>:<port>/<path> or just the path: hdfs://<path>.

There are also two ways to authenticate to HDFS:

Use current UNIX account authorization
Kerberos credentials using the following environment variables:

export MLFLOW_KERBEROS_TICKET_CACHE=/tmp/krb5cc_22222222
export MLFLOW_KERBEROS_USER=user_name_to_use

Most of the cluster contest settings are read from hdfs-site.xml accessed by the HDFS native driver using the CLASSPATH environment variable. Optionally you can select a different version of the HDFS driver library using:

export MLFLOW_HDFS_DRIVER=libhdfs3

The default driver is libhdfs.

When you do so, while training the model you may face some errors. So we have to edit the file hdfs_artifact_repo.py. This file can be found in the location: /venv_name/lib/python3.6/site-packages/mlflow/store/artifact/ . We just have to delete line 174 of the file ( driver=driver, )

Logging to a Tracking Server

MLflow runs can be recorded to local files, to an SQLAlchemy compatible database, or remotely to a tracking server. By default, the MLflow Python API logs runs locally to files in mlruns directory wherever you ran your program.

To log to a tracking server, set the MLFLOW_TRACKING_URI environment variable to the server’s URI or call mlflow.set_tracking_uri().

So after installing SQLite in your system, create an SQLite DB using the command sqlite3 example.db. This command will create an SQLite DB with the name example. Once the DB is created we will add this line in our training script.This line mlflow.set_tracking_uri(“sqlite:////root/example.db”) will log the artifacts of the MLflow runs in the SQLite DB, locally.

Once done, we will have to set the MLFLOW_TRACKING_URI environment variable to have MLflow find a URI from there. This can be done by using the command: export MLFLOW_TRACKING_URI=sqlite:////root/example.db

Running MLflow Tracking Server

You run an MLflow tracking server using mlflow server. The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools.

The MLflow tracking server serves the same UI and enables remote storage of run artifacts. In that case, you can view the UI using URL http://<ip address of your MLflow tracking server>:5000 in your browser from any machine, including any remote machine that can connect to your tracking server.

The UI contains the following key features:

Experiment-based run listing and comparison
Searching for runs by parameter or metric value
Visualizing run metrics
Downloading run results

After you have set the MLFLOW_TRACKING_URI environment variable, create a folder in HDFS ( in my example it is MlflowOutput), and then start the tracking server using the command :

mlflow server \
    --backend-store-uri sqlite:////root/example.db \
    --default-artifact-root hdfs://localhost:9000/MlflowOutput \
    --host 0.0.0.0 ( or your server ip )
     -p 5000

Training Models

Once the tracking server is up and running, in a new window, you can train your model. Before that, in this window also you have to set the MLFLOW_TRACKING_URI environment variable using the command:
export MLFLOW_TRACKING_URI=sqlite:////root/example.db.

Now we can start training the model :

(base) root@moyukh:~/sklearn_elasticnet_wine# python train.py
MLflow Version: 1.8.0
MLflow Tracking URI: sqlite:////root/example.db
run_id: 4b2a2757c4834aebabe6b33969ecd7f6
experiment_id: 0
Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
RMSE: 0.7931640229276851
MAE: 0.6271946374319586
R2: 0.10862644997792614
2020–06–04 10:22:29,794 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2020–06–04 10:22:30,712 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020–06–04 10:22:30,804 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020–06–04 10:22:30,817 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020–06–04 10:22:31,329 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
***** END OF TRAINING *****

When the training finishes, you can see the Artifacts getting stored in your HDFS folder. Access port 9870 of your system to view the HDFS folders.

host_ip:9870/explorer.html#/MlflowOutput/0/run_id/artifacts/model

As you get to see the Artifacts are getting stored in HDFS, also we get to access the Artifacts from MLflow UI running at port 5000 as specified.

host_ip:5000/#/experiments/0/runs/run_id

You can see the full path of the model is HDFS. So we have been able to successfully store our model artifacts in HDFS as well as in an SQLite DB in local system. In case you have any doubt, kindly reach out to me at moyuh@thirdeyedata.io