Testing Automation — Making It Right #3: Test Machine Learning Based Applications

Yonatan Katz
4 min readApr 13, 2020

This is the third part of the series.
#1 — Theory
#2 — Test Types and Integration Tests

Why do I dedicate a post for machine learning testing? Isn’t it just like any other piece of code?

Well, in some aspects it’s like a regular app that needs to be tested, but in some aspects — it’s different.

(a) All machine learning algorithms are based on statistics. They don’t guarantee that a specific input will work, and therefore — it’s hard to unittest it. Writing a test that you are not sure it will be “green” in the future is problematic.

(b) It’s hard to estimate the effect of a new ML model on the final product. Even if the accuracy increases it doesn’t mean a specific customer won’t be badly affected.

(c ) Running a test that checks the ML statistics (accuracy, false positives, precision / recall, etc.) on large input may take time. We want to acheive an efficient dev environment, and we cannot afford every tests run to take one hour.

In this blog post, I would suggest several things to consider, from my experience. The goal is clear: We need to have one gateway that tells us whether we can deploy a new version or not.

  1. Separate the model and the application

Since checking the ML statistics may be slow to run during development, or even as part of CI — it should be in a different deliverable. Its tests should run in a separate CI pipeline. The statistics test should be a gateway only for the ML model, and not for the entire application.

2. Define “improvement” of your statistics

You need to define what is considered as an improvement of your model, and break the CI if you have degradation. I am not defining here what is a “degradation”, as it is a project specific decision. In some projects the total errors is the main number, in others the false positives and not the false negative, etc. You first need to define it, then you need to enforce it.

3. Do your best to have consistent results

In many cases I encountered, two successive training runs can produce two different results, even when code and data have not been changed. Obviously, this can make the tests “flicker”. One solution for that is to allow degradation of X % from previous run. But that’s not the best option. First because after several runs you can have bigger degradation. Second, because you can allow releasing app version even though you don’t really know what is the cause for the degradation, or what’s its impact.

The better option (although not always possible) is to make sure you have consistent results: If you have data-set — it should be sorted, so every run on every machine will take the input in the same order; If you run over files — sort them by name; Avoid iterating non sorted data structure (e.g. sets); Make sure all random generated numbers have constant seed (even if you need to hack the ML library a bit. True story); You understand my point.

4. Write “regular” tests for demos and sanity scenarios.

When the sales person is going to a customer, his demo must work. Excuses like “its ML based algorithm, that’s why it doesn’t work for this input. For other inputs it will work” — look bad.

Test any input that is used for demo. That’s a good advice for every project, but mainly in ML project.

5. Try to engage with your customers to integrate their data into your tests.

ML training assumes that your data set represents the real distribution of data. But what happens when you have a customer that his main data does not have enough representation in your data set? Delivering new model can affect his flow dramatically even though your accuracy is higher in the newer version.

Therefore, try to ask your customers to provide you with a “sanity data set”. Add them to your tests. Your customers will thank me.

6. Test the code around the ML logic.

From time to time, when you have system tests you skip writing integration tests, as they are already included. I can understand that, although not agree.

In machine learning testing, don’t assume that if training passes and accuracy increased — the code doesn’t require further testing. Test every piece of code separately.

If you have features extraction code — test it. If you have pre-processing, post-processing, or anything related — test it separately. Don’t assume the final ML statistic numbers are enough.

One of the reasons for that, is because ML can hide bugs. ML can work well with bugs, because the bug may be consistent. Yet, the model can be much improved if the code will work.

If I need to summarize this post, the summary would be: Make sure you’re comfortable with shipping your new model if you have “green light” in your tests.

You’re afraid of customer use case? add it to the test. Main flows can be affected? test it.

Finally, test your stats.

Now, you’re ready for deployment!

--

--

Yonatan Katz

Entrepreneur, software engineer and engineering manager. Founder of algotext.io .