New horizons of MLOps

Published in

Photomath Engineering

6 min readDec 12, 2022

An artists impression of a space probe in flight around a planet, with a moon in near distance and a sun in a far distance — New Horizons, an interplanetary probe by NASA, was launched primarily to observe Pluto, but now it serves as a crow’s nest of sorts for humanity, discovering many new things with much more powerful sensors than Voyager’s. This is what we feel about ClearML — it showed us ML can be a better experience for everyone.
Source: Flickr

From its inception, Photomath has always been centered around image processing. Image data is the core of our application and product, and although the early version was mostly primitive, it should not be surprising that we eventually migrated our recognition pipeline to the methods of deep learning, specifically computer vision. This delighted our users, who received a better product at the same price, but the engineers? Frankly, they have had some challenges.

Deep learning, despite its conceptual origins in the 1900s, is a rather new field. It helped move many other fields, and computer vision is one of those. Computer vision approaches utilizing deep learning techniques outperformed the “older” methods, but due to their infancy, they aren’t very stable, in the sense that they mostly don’t perform well on a wide range of tasks (an otherwise good method may not work for your use case at all). In addition, they change quite often — a good method ceases to be as good after a while. Therefore, it is critical to perform as many experiments in as short an amount of time as possible. In order to ride the wave of innovation, engineers should be relieved of any friction while iterating various solution possibilities.

And this is where the challenges started.

A figure eight loop with words ML and OPS. Inside the loops there are more words: Plan, Data, Model Build, Test, Packaging, Deploy, Predict Serving, Performance monitor. The last word is connected to the first. — Machine learning loop is a part of a much bigger product iteration loop. It is important to iterate ML part as fast and as easily as possible, because this directly transfers to the added value

Know your origins…

For reference, I’ll describe a standard cycle of one of our AI experiments. It usually starts in one of two ways: an AI engineer will either try to improve on some problems and behaviors from the previous iteration, or they would try out a completely new idea. After doing research on some possible approaches, they would write some code, which could have many variations, all of which may perform better, or none of them.

There is always a possibility of making some changes to the data needed for the experiment. This includes adding new data, changing what data exists, or reducing it. This should be tracked to reduce work duplication, but also for evading dataset mistakes, which (as many ML engineers know) are very, very subtle… if they’re even discovered in the first place!

The engineer would then want to run the experiment. For this, they should have access to some DL hardware. It can either be hosted somewhere in the cloud, or in our case provided by Photomath with our own self-hosted GPU servers. The AI engineer is (luckily, if I may add) not alone — there’s a whole team of engineers who want to run various experiments simultaneously. This now implies that there should be a synchronization of sorts since they all compete for the same resources.

At last, after the experiment has been completed, the AI engineer has to both evaluate the model and compare the results with other experiments, which may be related to the starting idea or some other ideas. It’s also sometimes worth revisiting an old idea with the new data, but that now implies they should have access to the old idea’s code, data, and performance.

It was a hectic time for Photomath’s AI team. We used Google Sheets with experiment metadata and performance. We sometimes killed each other’s experiments due to CUDA OOM or other unpredictable reasons. Quite a few times, the wrong data was used. There were several instances of the code with slight changes, with little or no tracking. Plenty of experiments could not theoretically be compared, and that’s just a portion of the problems that we knew of. We were inefficient, we wasted time and resources, and there was heavy friction between iterations.

It was time for a change.

… to know where you’re headed

At that point, we were almost ready to code our own solution up. We had the skills and an approximate idea of what we were aiming for:

A system allowing people to remotely execute experiments on our hardware, without manual ssh-ing
A system that allows AI engineers to track the progress of the experiment in terms of loss value, performance, and experiment pace
A system that enables us to compare previous experiments and make more informed decisions

We started searching the internet for more resources and inspiration, seeing what people have already built which could help us in our endeavors. And then, one day, we stumbled upon ClearML, which does everything we wanted at the time.

A showcase gif of ClearML. ClearML’s web UI allows you to easily see projects and subprojects, experiment details and metadata, expriment performance and compare several experiments
Source: ClearML

Our AI engineers identified 7 distinct things they like about ClearML:

Easy to setup

After we discovered ClearML on a list of MLOps tools and decided we wanted to try it out, we were wary of potential hardships that may slow us down. However, thanks to their well-written deployment steps, it was a piece of cake — after a slight networking overhead (DNS, and hiding the GCE instance from the public), it boils down to one docker compose invocation. Neat!

Easy to integrate

Most of our code was already written as standalone training scripts since we ran them on our bare metal machines. It was really easy to write another small script which wraps our experiment running code in a clearml.Task and attach some metadata to it.

Our single pain point was that ClearML scanned the environment for the requirements while starting the experiments. Since we rarely ran experiments locally, we never expected it to use requirements.txt. But this got resolved quickly.

Experiment tracking

In order to iterate quickly enough, it’s important for experiments to not last too long. While some of them last for a week or more, usually they perform well from the start. ClearML’s experiment tracking allows us to abort all the experiments which don’t seem to have the opportunity to beat the best one or kill the slowly progressing ones.

Comparing various experiments

One of the largest impacts ClearML had on Photomath’s AI team, in the opinion of the author, is its experiment comparison feature. We used Google Sheets for that in the past (we’re not exactly proud of that), and updating the sheets with relevant info was cumbersome, tedious, and error-prone. Now, several minutes of searching for the relevant sheet, updating, and analyzing the data are literally replaced with two-to-three clicks in a browser, in addition to various graphs and important numbers.

This is a view of one of our internal experiments related to the LUMEN Data Science competition held by eSTUDENT student organization. One can easily see the model’s train and test performance and progress. For instance, one can conclude that this example converges too slowly and needs to be changed

Easy access to experiment artifacts

ClearML can be configured to store all of the artifacts in the cloud. It’s very easy to find a checkpoint from a particular epoch and in particular the last and the best-performing epoch. These models are then candidates for deployment.

Furthermore, this enables reusing the artifacts for next experiments, a practice known as warm start. ClearML allows the AI engineers to easily define a model checkpoint from which an experiment can be continued by downloading it and loading automagically.

ClearML dataset management

ClearML has support for managing your data. It will download and run the experiment with the correct data version, and it’s pretty smart to cache the versions and version diffs. This wasn’t a feature when we first started using ClearML but we found a use for them as soon as it was released. Although our in-house math problem dataset is way too large to host directly, we store dataset metadata with ClearML Datasets.

Attractive and functional UI

We've never had to ask for help with UI, which means it does its job and does it well.

All in all, to quote one of the engineers:

ClearML has everything we need

Like what you’ve read? Learn more about #LifeAtPhotomath and check out our job postings: https://careers.photomath.com/