Machine Learning Observability with Timber.io

Published in

manifoldco

7 min readAug 28, 2018

Logging for machine learning is important. It provides observability into your data processing, model training, evaluation, and artifact persistence. But it’s fraught with questions: which logs are important, how do I log across environments and frameworks, should I log experimental research and production training code?

Enter Timber.io

By providing a dead simple logging interface for Python, Timber.io collects and manages your training logs to provide one central place for machine learning observability. Because Timber leverages Python’s built in logging, you can make use of Python’s powerful logging configuration to increase the readability and usefulness of your own Timber log dashboard.

Observability in Machine Learning

Before looking at how we can put in this in place. Let’s talk about observability in a machine learning context. Chances are your machine learning team has a research or experimentation and a production workflow. You are using multiple tools such as notebooks and Docker containers and frameworks such as tensorflow and scikit learn. You likely have a few members on your team interested in the results of both training and experimentation. As much as you’d like to have standardized processes for each team member and each project, you don’t. The outcome is that observability into research, training, and testing is difficult, especially for members of your team who are not working with the same set of tools as you.

Observability for machine learning has the same goals as observability in software: increased accessibility and relevance of logging. The only difference is that in machine learning we care about continuously increasing quality of our models as reflected in our experimentation, training, and evaluation results. Luckily we can accomplish this in the same way as we do in software: aggregate meaningful logs to one central source of truth. By having our logs aggregated to one central dashboard we empower our team members to be able to find out the results of our local notebook-based experimentation and our continuous model training and evaluation. Let’s start by looking at how to do this with Timber.io in two typical workflows: experimentation in notebooks and containerized training code. Follow along in the repo here.

Experiments

At Manifold, we have been really interested in applications of Machine Learning to DevOps for operations automation. One area of active research we have been pursuing is automating capacity planning. For the purposes of this example, I wrote up a small notebook experimenting with how we can model the relationship between memory utilization and network throughput. You can follow along with the notebook here. After some of the initial exploratory data analysis (Figure 1), I thought I’d try a couple of different approaches to modeling.

Figure 1: Network Throughput v. Memory Usage

You can see in the notebook that I tried a few regression procedures in order to emulate some typical approaches to modeling: high-level imperative models with Scikit-learn, low-level neural network modeling with Keras, and high-level graph-type modeling with Tensorflow. I also choose these three frameworks because of their different approaches to logging. Scikit-learn has a low-to-no logs approach, Keras is somewhat in between providing details when necessary, and Tensorflow is notoriously verbose by default.

Typically after experimentation, I’d like to get this notebook up as a pull request to share my research results with my team as fast as possible and get feedback. As per the goal of notebooks, markdown annotations and cell output make it really easy to make the results of your research relevant (one of the goals of observability). What notebooks are not so great at is enabling centralized aggregation of experimental results. In order to share your results, you need to get somebody to open your notebook and read it. Let’s solve this with Timber!

First, you will want to run your notebook with Manifold CLI (quick start guide here) . By running your notebook with Manifold you have access to a secure credentials management solution that empowers teams to provision third-party services such as Timber and manage their credentials. Once you have Manifold up and running you can easily follow along with the accompanying repo with a free Timber service by running:

$ manifold create — team datasci — project aggregate-ml-logs -product timber-logging
$ manifold run — juypter notebook

All you need to do to implement Timber logging is add a `TimberHandler` from the `timber` python library available to install with pop. You can see that we are relying on Manifold to securely inject that api_key into the notebooks local environment for access with `os.getenv`.

Once you have added logging you can log out the training and evaluation results from your experiments by utilizing the python logging code you already know:

score = regressor.score(X, y)
logging.info(“Training RandomForestRegressor: Score was %d”, score)

By wiring up your notebook in this way, instead of getting messy logs in between cells that will only be shared once you hand somebody your notebook you can share your Timber dashboard provisioned for you by Manifold to your teams mates and they can see the results of your experimentation in real time.

Training

Chances are you use some sort of containerized approach to your production model training workflow and maybe you use some sort of orchestrator like pachyderm or airflow. Usually, you will have some sort of centralized logging through docker compose or Kubernetes, so what advantages can our approach to centralized logging give you? The answer is improved relevancy.

Your machine learning logs probably look something like this:

INFO:Using TensorFlow backend.INFO:60243/60243 [==============================] — 1s 13us/stepWARNING:tensorflow:Using temporary folder as model directory: /var/folders/zt/5_ksy7fn3lj3vh_9fkfgf9mc0000gn/T/tmp87raaoy2
INFO:tensorflow:Using config: {‘_model_dir’: ‘/var/folders/zt/5_ksy7fn3lj3vh_9fkfgf9mc0000gn/T/tmp87raaoy2’, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: 600, ‘_session_config’: None, ‘_keep_checkpoint_max’: 5, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_log_step_count_steps’: 100, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x1126809b0>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fnWARNING: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
 from numpy.core.umath_tests import inner1d.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/zt/5_ksy7fn3lj3vh_9fkfgf9mc0000gn/T/tmp87raaoy2/model.ckpt.
INFO:tensorflow:loss = 558.5604, step = 1
INFO:tensorflow:global_step/sec: 568.315

While some of these logs might make sense in the context of debugging training jobs and monitoring loss at each epoch or step, it’s likely that beyond the scope of a notebook these logs are mostly just noise; the signal of training and evaluation results is getting lost.

Let’s solve this issue with Timber. Using Timber as a centralized logging solution for training code is just a matter of creating a logger with a few lines of code:

# ml_logs/logs.py
import loggingimport timberImport oslogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)timber_handler = timber.TimberHandler(api_key=os.getenv(‘TIMBER_API_KEY’), level=logging.INFO)logger.addHandler(timber_handler)

After implementing a reusable logger in the example above you can use regular python logging to output training and model evaluation results directly to timber:

logger.info(‘Training: %s:%f’, env.JOB, env.SAMPLE_SIZE)logger.info(‘Random Forest Regressors score was: %f’,
    regressor.score([[t] for t in traces[‘Memory usage [KB]’]],    traces[‘Total network throughput [KB/s]’].values))

In the example repo, you can see that I have ported the experimental code into three different training jobs and created a container that can run different jobs with different amounts of training data. For brevity, I excluded the steps of downloading versioned datasets and uploading versioned models. I simulated continuous training with the following script (usually this would be taken care of by some orchestration tool):

for sample_size in {1..100..5}; do  for job in keras_job.py scikit-learn_job.py tensorflow_job.py; do  docker run -e TIMBER_API_KEY=$TIMBER_API_KEY -e JOB=$job -e SAMPLE_SIZE=$sample_size ml_logs:latest ./ml_logs/training/$job  donedone

You can run the example in the repo yourself with Manifold service:

$ manifold create — team datasci — project aggregate-ml-logs — product timber-logging
$ manifold run — bash ./run-me.sh

The resulting logs in Timber are relevant training results that can be shared easily with my team:

Conclusion

In the two examples above, Timber helps us improve the visibility and accessibility of our notebook-based experiments and relevancy of our container-based training code. By implementing Timber we are able to share the results from our machine learning work and provide access into what’s going on as we try to continuously increase the quality of our model training. By using a centralized logging platform, we can log across different frameworks and workflows with ease. Hopefully, these examples are helpful on your own journey of making your machine learning results observable. If you have any questions or just want to chat about observability you can contact me anytime at dom@manifold.co.

You can find the source code for the above examples at: