Create reproducible Machine Learning experiments using Sacred

Published in

Analytics Vidhya

5 min readSep 9, 2020

Every experiment is sacred
Every experiment is great
If an experiment is wasted
God gets quite irate

Sacred lets you configure, organize, log and reproduce experiments. It was designed for ML experiments specifically, but can actually be used for any kind of experiment.

To give an example of how to use this powerful framework, I am going to use the dataset from a Kaggle competition, Real or Not? NLP with Disaster Tweets. This competition is a binary classification problem where you are supposed to decide whether a tweet is describing an actual disaster or not. Here are two examples:

Real disaster tweet:

Forest fire near La Ronge Sask. Canada

Not a disaster tweet:

I love fruits

Sooner or later, the Data Scientist will notice that the performance of the models heavily depend on specific configurations and countless modifications of the data.

Let’s say that we want to run some experiments where we build a model to classify these tweets and measure the classifier’s F1-score using k-fold cross-validation. Most data scientists would probably fire up a Jupyter notebook and start to explore the data (which indeed is always the right thing to do, btw), run some ad-hoc experiments, build and evaluate models. Sooner or later, the Data Scientist will notice that the performance of the models heavily depend on specific configurations and countless modifications of the data. This is where the power of reproducibility starts to pay off.

Why Sacred?

The following are the main features and advantages of using Sacred:

Easily define and encapsulate the configuration of each experiment
Automatically collect metadata of each run
Log custom metrics
Collect logs in various places using observers
Ensure deterministic runs with automatic seeding

How to set up a Sacred Experiment

We start off by creating a base experiment in Sacred as follow:

logreg_experiment = Experiment(‘logreg’)

A Sacred experiment is defined by a configuration, so let’s create one:

@logreg_experiment.config
def baseline_config():
    max_features = None
    classifier = Pipeline([
        (‘tfidf’, TfidfVectorizer(max_features=max_features)),
        (‘clf’, LogisticRegression())
    ])

Notice that the config attribute of the experiment object is used as a function decorator. This enables Sacred to automatically detect that the function should be used to configure the experiment.

This very simple config defines a scikit-learn pipeline with two steps: compute the TF-IDF representation of all tweets and then classify them using Logistic Regression. I added a variable for one of the hyper parameters, max_features, to showcase how you can easily create new experiments by modifying the config.

Now, before you can run this experiment, a main function must be defined:

@logreg_experiment.automain
def main(classifier):
    datadir = Path(‘../data’)
    train_df = pd.read_csv(datadir / ‘train.csv’)
    scores = cross_val_score(classifier, train_df[‘text’],
        train_df[‘target’], cv=5, scoring=’f1')
    mean_clf_score = scores.mean()
    logreg_experiment.log_scalar(‘f1_score’, mean_clf_score)

As you can see, we once again use an attribute of the experiment object as a decorator, in this case automain. This lets the main function automatically access any variables defined within this experiment’s config. In this case, we only pass classifier which will be evaluated with respect to how well it can classify the Twitter data using 5-fold cross-validation on the training set. In the last line of code, the metric that we want to measure is logged using the log_scalar method.

Run the Experiment

To run the experiment, simply call its run() method. To run it with different parameter values, you can conveniently pass a dict config_updates specifying the exact configuration for this experiment run. Pretty neat!

# Run with default values
logreg_experiment.run()# Run with config updates
logreg_experiment.run(config_updates={‘max_features’: 1000})

I usually put the experiments themselves in different files, and then have a separate script which runs all of the experiments at once.

Log your results

If you run the above, you will not see a lot of results. You first need to attach an observer to the experiment. The observer will then send the logs to some destination, usually a database. For local and non-production usage, you can use the FileStorageObserver to simply write to disk.

logreg_experiment.observers.append(FileStorageObserver(‘logreg’))

If you include this line in the runner script above and run it, a new folder logreg is created with one sub-folder per run. One for the default run, and one with the updated max_features value. Each has created four separate files, with the following content:

config.json: The state of each object in the configuration, and the seed parameter which is automatically used in all non-deterministic functions to ensure reproducibility.
cout.txt: All standard output produced during the run.
metrics.json: Custom metrics that were logged during the run, e.g. the F1-score in our case.
run.json: Metadata e.g. about the source code (git repo, files, dependencies, etc.), the running host, start/stop time, etc.

Putting it all together

For the sake of completeness, I will create a final example to show how you can run multiple experiments from the same runner script:

from pathlib import Path
import pandas as pd
from sacred import Experiment
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipelinerand_forest_experiment = Experiment(‘randforest’)
@rand_forest_experiment.config
def baseline_config():
    n_estimators = 100
    classifier = Pipeline([
        (‘tfidf’, TfidfVectorizer()),
        (‘clf’, RandomForestClassifier(n_estimators=n_estimators))
    ])@rand_forest_experiment.automain
def main(classifier):
    datadir = Path(‘../data’)
    train_df = pd.read_csv(datadir / ‘train.csv’)
    scores = cross_val_score(classifier, train_df[‘text’],
        train_df[‘target’], cv=5, scoring=’f1')
    mean_clf_score = scores.mean()
    rand_forest_experiment.log_scalar(‘f1_score’, mean_clf_score)

Now, let’s run both experiments with some config updates…

from sacred.observers import FileStorageObserver
from experiments.logreg import logreg_experiment
from experiments.randforest import rand_forest_experimentlogreg_experiment.observers.append(FileStorageObserver(‘logreg’))
rand_forest_experiment.observers.append(FileStorageObserver(‘randforest’))# Run with default values
logreg_experiment.run()# Run with config updates
logreg_experiment.run(config_updates={‘max_features’: 1000})# Run different experiment
rand_forest_experiment.run()
rand_forest_experiment.run(config_updates={‘n_estimators’: 500})

By looking at the metrics.json file of each run, we can conclude that the default logistic regression model was the best performing, with an F1-score of ~0.66, while the random forest with 100 estimators was the worst one, with an F1-score of ~0.53.

Of course, all of that json-formatted output is not very appealing to look at, but there are several visualization tools you can use with Sacred. This is however outside the scope of this article, but do have a look here: https://github.com/IDSIA/sacred#Frontends

Experiment safely!

This article is part of a series on best practices when building and designing machine learning systems. Read the first part here: https://medium.com/analytics-vidhya/how-to-get-data-science-to-truly-work-in-production-bed80e6bcfee