Raman Singh
ML Aide
Published in
10 min readMay 24, 2021

--

Managing machine learning models and ensuring reproducibility of machine learning experiments can be overwhelming over time. As described in a previous article, ML Aide is a tool that enables data scientists and engineers to track all data, parameters, and metrics of machine learning experiments.

In this blog post, we will train a simple regression model on the USA Housing Dataset. But before we start, let’s have a look at how Data Scientists use ML Aide and what components ML Aide contains.

The data scientist develops the machine learning application (green box) as usual. The machine learning application sends all of the training’s parameters, metrics, and artifacts (including the trained model) to the ML Aide web server. The web server stores this data in MongoDB and/or in S3. ML Aide can use any S3 compliant storage (e.g. min.io) and is not limited to the use of AWS S3. On top of this, ML Aide provides a web-based user interface. This interface provides access to all recorded experiments and stored artifacts. ML Aide takes advantage of integrating with any Identity Provider (like Active Directory, Google, Auth0, Keycloak, …). Through the Identity Provider, ML Aide can limit access to particular users or teams.

Setup ML Aide

Install Docker

ML Aide is designed for cloud environments and runs on Docker and Docker Compose. Therefore, Docker must be installed. On Windows and macOS Docker Compose will be installed automatically with Docker. On Linux systems, Docker Compose must be installed separately.

Download and Run ML Aide on Docker

ML Aide ships a Docker Compose configuration. The configuration is designed for local test environments and contains the ML Aide web server, web UI, MongoDB, min.io S3, and Keycloak as an identity provider.

In a production setup, you would most likely use an existing identity provider of your company. An identity provider must be accessible from a domain name (and not localhost or an IP). Otherwise, MLAide can not execute an OAuth2 login. Therefore, we just register an alias in the local hosts file. Add the following entry to your hosts file. The file is located at /etc/hosts (Unix) or C:\Windows\System32\drivers\etc\hosts (Windows).

127.0.0.1 keycloak.mlaide

On Unix systems, you can use the following command to add the entry to your hosts file.

echo ‘127.0.0.1 keycloak.mlaide’ | sudo tee -a /etc/hosts

The domain keycloak.mlaide will be resolved to 127.0.0.1. That means that we can now access a Keycloak Server running on localhost through keycloak.mlaide.

Download and execute the ML Aide Tutorial Script from github.com to easily start ML Aide on your local Docker environment. The shell script will download some configuration files to run MongoDB, min.io and, Keycloak. After downloading, the script will start all services including ML Aide.

mkdir ~/mlaide-tutorial
cd ~/mlaide-tutorial
curl https://raw.githubusercontent.com/MLAide/MLAide/master/demo/run-mlaide.sh --output ./run-mlaide.sh./run-mlaide.sh

Now ML Aide should be running. You can access the web UI on http://localhost:8880. You can use one of the following logins:

  • username=adam — password=adam1
  • username=bob — password=bob1
  • username=eve — password=eve1

The web server is running on http://localhost:8881. You can also access Keycloak via http://localhost:8884. You can add new users to Keycloak using the Admin account (username=admin — password=admin).

Install ML Aide Python client

Usually, Python apps should always be installed in a new, clean environment. For this tutorial, we recommend a Python environment manager like virtualenv in combination with pyenv. Of course, you don’t have to use virtualenv or you can use any other environment manager. But in all cases, you should install the dependencies pip dependencies.

# Open a terminal and navigate to the project directory
cd ~/mlaide-tutorial
# Install Python 3.9 (if not already present) via pyenv
pyenv install
# Install virtualenv (if not already present)
pip install virtualenv
# Create virtual environment
virtualenv .venv
# Activate virtual environment
source .venv/bin/activate
# Install all dependencies (including mlaide)
pip install scikit-learn pandas numpy mlaide

Preparing the Tutorial

To train a model we need to download the USA Housing Dataset. Store the file in a subdirectory called data.

mkdir data
curl https://raw.githubusercontent.com/MLAide/docs/master/docs/tutorial/housing.csv --output ./data/housing.csv

MLAide structures all experiments within projects. Projects contain experiments, runs and artifacts. For this tutorial we need to create a new project with the name USA Housing. MLAide will automatically set the project key to usa-housing.

As you learned in the previous step, ML Aide uses Keycloak to authorize users. Authorizing from the web UI is easy by just using one of the credentials above. To authorize your Python machine learning application against ML Aide you will need an API key. The API key enables you to access ML Aide without entering your credentials. To retrieve an API key execute the following steps:

  • In the upper right click on adam > Settings
  • Go to API Keys in the left navigation
  • Click on Add API Key
  • Enter any description and click on Create
  • Copy the show API key and store it somewhere safe. The API key won’t be shown again. If you lose your API key you have to create a new one.

Train a ML Model using ML Aide

Data Preparation

Our data preparation will be implemented in data_preparation.py. Therefore, create a new file with this name. To create a connection to the ML Aide web server with Python clients you have to use mlaide.MLAideClient. An object of this class is the main entry point for all kinds of operations. Replace api_key with your API key that you created using the ML Aide web UI.

from mlaide import MLAideClient, ConnectionOptions
import pandas as pd
options = ConnectionOptions(
server_url='http://localhost:8881/api/v1', # the ML Aide demo server runs on port 8881 per default
api_key='<your api key>'
)
mlaide_client = MLAideClient(project_key='usa-housing', options=options)

Before we read or process anything we should start tracking all relevant information in ML Aide. In ML Aide a run is the key concept to track parameters, metrics, artifacts, and models. All runs belong to one or more experiments.

run_data_preparation = mlaide_client.start_new_run(experiment_key='linear-regression', run_name='data preparation')

Now we can read and process this dataset. Also, we can register the dataset as an artifact in ML Aide. This gives us the ability to reproduce the following steps — even if the dataset is lost, deleted, or modified. The artifact can be used in other runs as an input. This helps to track down the lineage of a machine learning model to its root. In the end, don’t forget to mark the run as completed.

housing_data = pd.read_csv('data/housing.csv')# add dataset as artifact
artifact = run_data_preparation.create_artifact(name="USA housing dataset", artifact_type="dataset", metadata={})
run_data_preparation.add_artifact_file(artifact, 'data/housing.csv')
run_data_preparation.set_completed_status()

Start your python script using your shell with python data_preparation.py. After the script completed check the web UI to see the created run and the artifact.

Model Training

The code for model training will be written in a new file named training.py. In the beginning, we need some imports and we will create a connection to the ML Aide web server.

from mlaide import MLAideClient, ConnectionOptions, ArtifactRef
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso
from sklearn import metrics
options = ConnectionOptions(
server_url='http://localhost:8881/api/v1', # the ML Aide demo server runs on port 8881 per default
api_key='<your api key>'
)
mlaide_client = MLAideClient(project_key='usa-housing', options=options)

As you can see, we are implementing these steps in a different file than the data preprocessing. Therefore, we somehow need to get our input data. We could read the CSV again. Or we could retrieve the CSV file from ML Aide. In this case, we will read the content of the file from ML Aide. In the previous step, we saved the file as an artifact with the name USA housing dataset. Omitting the version means that we want to retrieve the latest version of the artifact.

dataset_bytes = mlaide_client.get_artifact('USA housing dataset', version=None).load('data/housing.csv')housing_data = pd.read_csv(dataset_bytes)

Usually, a split will be done randomly. ML Aide helps you to keep things reproducible.

We start a new run to track the split. Also, we set the dataset as an input artifact.

artifact_ref = ArtifactRef(name="USA housing dataset", version=1)run_pipeline_setup = mlaide_client.start_new_run(experiment_key='linear-regression', run_name='pipeline setup', used_artifacts=[artifact_ref])

Now we split our dataset and link all information related to the split to our run. In this case we want to track all arguments (test_size and random_state) of the train_test_split() function.

X = housing_data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = housing_data['Price']
test_size=0.3
random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)run_pipeline_setup.log_parameter('test_size', test_size)
run_pipeline_setup.log_parameter('random_state', random_state)

If you have a close look at the data, you can see that all X values must be scaled, before we can use them. We use the StandardScaler of sklearn. The scaler that will be fitted here, must also be used later for predicting new values. ML Aide makes this easy by just storing the scaler (or the whole pipeline) in ML Aide as an artifact. The artifact can be loaded later in a separate process for predicting.

pipeline = Pipeline([('std_scalar', StandardScaler())])X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)
run_pipeline_setup.log_model(pipeline, model_name="pipeline")run_pipeline_setup.set_completed_status()

After the train-test-split, we can fit a linear regression model. We start a new run and link the dataset and the pipeline as input artifacts.

dataset_artifact_ref = ArtifactRef(name="USA housing dataset", version=1)
pipeline_artifact_ref = ArtifactRef(name="pipeline", version=1)
run_linear_regression = mlaide_client.start_new_run(experiment_key='linear-regression', run_name='linear regression', used_artifacts=[dataset_artifact_ref, pipeline_artifact_ref])

Now just fit your model as usual. After that, you can log the model with log_model() in ML Aide.

lin_reg = LinearRegression(normalize=True)
lin_reg.fit(X_train,y_train)
run_linear_regression.log_model(lin_reg, 'linear regression')

Finally, we calculate some model metrics. The metrics will also be tracked in ML Aide.

test_pred = lin_reg.predict(X_test)
train_pred = lin_reg.predict(X_train)
mae = metrics.mean_absolute_error(y_test, test_pred)
mse = metrics.mean_squared_error(y_test, test_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, test_pred))
r2 = metrics.r2_score(y_test, test_pred)
cross_validation = cross_val_score(LinearRegression(), X, y, cv=10).mean()
run_linear_regression.log_metric('mae', mae) run_linear_regression.log_metric('mse', mse) run_linear_regression.log_metric('rmse', rmse) run_linear_regression.log_metric('r2', r2) run_linear_regression.log_metric('cross validation', cross_validation)run_linear_regression.set_completed_status()

Until now, we created three runs (data preparation, pipeline setup, and linear regression). All of these runs belong to the experiment linear-regression.

Now we train another model type — a lasso regression model. But we want to reuse the results of the data preparation and the pipeline setup. With ML Aide this can be achieved simply by using a new experiment_key and provide the artifacts of the previous runs via used_artifacts.

dataset_artifact_ref = ArtifactRef(name="USA housing dataset", version=1)
pipeline_artifact_ref = ArtifactRef(name="pipeline", version=1)
run_lasso = mlaide_client.start_new_run(experiment_key='lasso-regression', run_name='lasso regression', used_artifacts=[dataset_artifact_ref, pipeline_artifact_ref])

We fit our model as usual.

alpha = 0.1
precompute = True
positive = True
selection = 'random'
random_state = 42
run_lasso.log_parameter('alpha', alpha) run_lasso.log_parameter('precompute', precompute) run_lasso.log_parameter('positive', positive) run_lasso.log_parameter('selection', selection) run_lasso.log_parameter('random state', random_state)model = Lasso(alpha=alpha, precompute=precompute, positive=positive, selection=selection, random_state=random_state)
model.fit(X_train, y_train)
run_lasso.log_model(model, 'lasso')

And now we calculate some metrics for this model, too.

test_pred = model.predict(X_test)
train_pred = model.predict(X_train)
mae = metrics.mean_absolute_error(y_test, test_pred)
mse = metrics.mean_squared_error(y_test, test_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, test_pred))
r2 = metrics.r2_score(y_test, test_pred)
cross_validation = cross_val_score(Lasso(), X, y, cv=10).mean()
run_lasso.log_metric('mae', mae)
run_lasso.log_metric('mse', mse)
run_lasso.log_metric('rmse', rmse)
run_lasso.log_metric('r2', r2)
run_lasso.log_metric('cross validation', cross_validation)
run_lasso.set_completed_status()

Start your python script using your shell with python training.py. After the script is completed check the web UI to see the created runs and the artifact.

Model Evaluation

Compare Runs

A key feature of ML Aide is to compare several runs with their parameters and metrics. Therefore, open the web UI and select all runs that should be compared. In this case, we want to compare linear regression and lasso model. Select the two checkboxes and click on the Compare button at the top of the table.

You will see all parameters and metrics of the two runs. All values that are not equal will be highlighted.

Visualize Lineage

Sometimes you want to know how a model was built or which runs (steps) were executed in a particular experiment. For this, you can use the lineage visualization of experiments.

Go to the experiment view and select an experiment. You will see all runs (blue color) and all input/output artifacts (red color).

The table below shows you all runs and artifacts. From the runs table, you can jump to the run details or to the run comparison.

Model Staging

In this very basic example, we can see that both models are quite similar. We can choose any of these models and tag them as ‘production ready’. This helps to keep track of models that are used in production, are still under development (or QA), or are already deprecated.

ML Aide provides the following stages for models:

  • None
  • Staging
  • Production
  • Deprecated
  • Abandoned

Model Serving

Until now, we trained two models and evaluated them. Now we will reload the linear regression model to do some predictions.

Our code will be written in a new file named serving.py. In the beginning, we will create a connection to the ML Aide web server.

from mlaide import MLAideClient, ConnectionOptions
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
import numpy as np
options = ConnectionOptions(
server_url='http://localhost:8881/api/v1', # the ML Aide demo server runs on port 8881 per default
api_key='<your api key>')
mlaide_client = MLAideClient(project_key='usa-housing', options=options)

To do some predictions we want to use the linear regression model. But before we predict values with the model, we have to use the sklearn pipeline to transform our input vectors. The pipeline was also stored in ML Aide. Thus, we can load both from ML Aide.

# read the model
lin_reg: LinearRegression = mlaide_client.load_model('linear regression')
# read the pipeline containing the standard scaler
pipeline: Pipeline = mlaide_client.load_model('pipeline')

Now we are ready to use our model. In this case, we will hard-code a house area for our prediction. In real-world scenarios, we would get the input from HTTP requests or something similar.

# create some data for prediction
data = np.array([[80000, 6.32, 7.4, 4.24, 25000]])
# The values are
# - Avg. Area Income
# - Avg. Area House Age
# - Avg. Area Number of Rooms
# - Avg. Area Number of Bedrooms
# - Area Population
# predict the house price
data = pipeline.transform(data)
pred = lin_reg.predict(data)
print(pred) # output is: [1415072.9471789]

Conclusion

In this tutorial, we trained two machine learning models with sklearn. We also created a very simple serving app to predict new values using the trained model. For the whole workflow, we used MLAide to track everything, including parameters, metrics, the dataset and, the models. The MLAide web UI helps to see and investigate the recorded values. Altogether, MLAide makes machine learning experiments reproducible, visualizes experiment lineage, supports run comparison and, helps data scientists to develop new models without the need to manually document every step.

--

--