Start your next AI project with a failing test

9 min readFeb 25, 2021

I found that many university courses, books or online training either taught me how to build software or how to train machine learning (ML) models, but few blended both worlds. Having worked on various AI projects in the IBM Garage, I want to share how my experience in using test-driven development (TDD) helped me build better AI-powered applications.

This blog post uses an AI-powered web app as an example on how to apply TDD in AI projects. In the web app people can upload images taken inside and outside of houses. Then the website will display whether the photo was taken outside or, if taken inside, in which room.

Image classification workflow. Image Sources: pexels.com

What is Test-Driven Development?

Software developers use test driven development (TDD) to make sure that code works as expected. Instead of aiming for the perfect solution in the first pass, the code and tests are iteratively built together one feature at a time. In TDD developers first write a failing test and then just enough code to make the failing test pass.

Automated tests and TDD help developers in several ways:

Instant feedback: You instantly know whether the code fulfills the specification and edge cases (= tests).
Test-driven debugging: Detect problems early on and pin down which parts of the code are not working as expected.
Change with confidence: Other developers can implement new features or refactor without the fear of breaking the product.
Tackle challenges: Solving more simple cases first gives you confidence when tackling tougher challenges.
Little to no wasted effort: You only write the code needed to implement a requirement.

1. Start small

I regularly start projects with a simple test — people can visit the website in their browser (not my computer) via an app that is deployed to the cloud. This is the minimum functionality needed to interact with our AI application.

Why start small when there is a plethora of model code on GitHub published together with state-of-the-art papers? While the code usually allows you to train a model and calculate the metrics listed in the paper, making the model accessible to an application, often via an API endpoint, is rather difficult and time consuming. Many tools are grounded in research and productionizing them or their artifacts is often not a priority. Thus, this simple task sets our focus on deployment early on and also provides us with the foundation to do so. Also, consider that if you postpone deploying the model because “it takes a long time,” chances are deployment poses a higher risk than building the algorithm itself.

In pseudo code, this simple test and code could look like:

2. Build the happy path

Next we build the happy path where data flows through our system—the image inputs from our UI to an algorithm and the predictions back from the algorithm to the UI. What is the simplest algorithm to implement? Simply returning a constant or random value.

While this might seem trivially easy, it will help your overall project:

Alignment: “Show don’t tell” ensures that people are not talking past each other.
Data collection: You can collect data in the same way as people will use the product, reducing bias. For example, “in the wild” people will take photos with their old smartphones and subpar lighting. You should collect your testing data this way.
Collect feedback: Prove that the product is valuable to customers and get feedback to identify issues early on.
Integration: Integrate between the UI and ML sides of the project from the beginning. That way there is always a deployed version you can show.

Again, tests and implementation could look like:

With an unbalanced data distribution (e.g. ~60% of photos are taken outside), this algorithm is already better than a random coin flip.

Over 60% of images are taken outside. Image Sources: pexels.com

This will set the benchmark for all future algorithms:

What to test for?

Writing this test leads to the question what to actually test for? While code runs deterministic, AI algorithms are probabilistic. The success of machine learning models is often measured with algorithm-specific metrics such as accuracy, recall or mAP. We can spend days or weeks to optimize these metrics. Unfortunately, these metrics rarely predict the overall product’s commercial success. Therefore, we should primarily test for product-specific metrics that align with our business goals.

In the same way TDD keeps developers from wasting time with writing unneeded code, TDD can also ensure that we are not wasting time with unnecessarily optimizing algorithms. You should optimize and write a failing test for issues identified in testing with real end users, not hypothetical edge cases that might never occur. Instead of aiming for perfection (100% accuracy or test coverage), aim to meet the business objectives.

3. Improve (and learn) test by test

After we made it work, we will now make it better! We find the biggest bottlenecks in our happy path, create a list with possible improvements and test them. Here, TDD can facilitate the process of how research and data science work:

Instead of casually formulating a hypothesis and manually verifying, why not make it explicit and automate it? Re-use your automated hypothesis tests to quickly conduct experiments, measure their impact and incrementally improve the algorithm.

A hypothesis to improve the algorithm could be: If I classify images with a high proportion of green as “outside” and the rest “kitchen” accuracy will improve.
Which we could implement in pseudo code as:

Outside images have a high proportion of green. Image Sources: pexels.com

As there is also an in-balance among the indoor pictures, this algorithm gets us to ~80% accuracy, without any deep learning. This raises the threshold for all future algorithms:

Be the algorithm

Many AI algorithms are inspired by the human brain. So why not reflect on how humans currently solve the problem? I found it very helpful to learn from domain experts. Be the algorithm yourself, TDD which strategies and heuristics work and get a feeling for how the deep learning network might work, which helps you later to optimize training data and hyper-parameters.

Stop when it is “just enough…”

For some applications, 70% accuracy might already be good enough and development could stop with a pretty simple and maintainable algorithm. To achieve a higher accuracy, we could integrate a machine learning model into the application. In the TDD spirit of only writing enough code to pass a failing test, we should use high-level frameworks such as Keras, huggingface or NLTK whenever possible before writing custom algorithms in TensorFlow or PyTorch.

Improve hypothesis by hypothesis, test by test

We can formulate further hypotheses, write a failing test, implement the changes and test the hypothesis. For example we could formulate:

If I use a pre-trained ResNet model with a new classifier, accuracy will improve.
If I augment the training set with different exposures, accuracy will improve.
If I add fine-tuning, accuracy will improve.
If I add hyper-parameter optimization, accuracy will improve.
…

Incorporate new requirements with a failing test

As we build out the product and get feedback from people using our product, requirements will certainly change. The same way we improve the algorithm we can also extend it to meet new requirements. For example, if we also want to cover images taken at night we would:

Add images taken at night to the tests.
Run the tests to see whether our algorithm is still “good enough” or if we have a failing test.
Improve the algorithm, for example by adding new samples or image augmentation logic, to make the failing test pass.

4. Test-drive the ML pipeline

Sooner or later you might integrate a machine learning (ML) algorithm into your applications. ML models automatically learn from data without being explicitly programmed. We measure their performance with data they are not yet trained on. Commonly we split our dataset into three subsets:

Training set: Used to fit the model.
Validation set: Used to assess the model performance in the training phase and guide improvements.
Test set: Used to assess the likely future performance on unseen data; never used for optimization.

In addition to measuring the algorithm’s performance we should also test the training code used to build the algorithm, as wrong code leads to bad algorithms and could be one of the reasons why a model performs poorly. Thus, properly testing your ML pipeline helps you to pin down errors faster. Are my results bad because of issues with the training data, is there simply a coding error, are my metrics are calculated wrong, etc.?

We can TDD the ML pipeline at different granularities:

Unit tests

Unit tests are the foundation of our tests. They make sure that a certain unit of our product or ML pipeline works. You can TDD each aspect of the ML pipeline, for example:

Are pre-processing functions working as intended?
Is the training data formatted correctly?
Is there data leakage, e.g. duplicates, in the dataset?
Are metrics actually calculated correctly?
Is the training code correct? Is it working for a very simple problem?

Validation tests

With validation tests we use algorithm-specific metrics to assess how the algorithm could be improved. In real-world applications this is often achieved by improving the quality of the training set, for example adding more samples of edge cases or re-labelling incorrectly categorized images. Usually these tests are run in Jupyter notebooks or are part of an automated ML pipeline. In these tests you can assess metrics such as:

Accuracy, precision, recall, F1 score…
Overfitting, underfitting
Performance within each class
False positives, false negatives
Worst performing images
Commonly confused images or classes

Acceptance tests

Finally we also need to make sure that our product works correctly from a user’s perspective, not just from a technical one.They give use the biggest confidence to decide if our product is working or not. Acceptance tests span the whole product, should be automated and based on the test set. They include:

Happy paths: Need to work 100%. If they don’t work, people will shake their heads…
Generalization: Does the model fit unseen data from the test set?
Product-specific metrics (business KPIs).
User stories.
Bugs that were fixed.
Handling ML failures gracefully.

5. Test for real world usage

Be aware that people familiar with the project will test it differently than “unbiased” testers. Your colleagues know what shortcuts to ask or how to confuse the system. While it might be fun to upload a gingerbread house (and if your boss tests the system that way you probably need to cope with such images), but be aware that this will probably not reflect real world usage. The tests you write should be based on real-world tests and the behavior of unbiased people, not your colleagues: What pictures do they upload? How do they use the app? When was the classification (embarrassingly) wrong?

Conclusion

TDD is only on the surface about writing unit tests. Rather it is about instantly validating success. This closely aligns with how AI algorithms are commonly improved—we change a parameter or the training data and validate whether the prediction improved or not. With TDD you can speed up this process and instantly validate the success of a change within both the AI app and ML pipeline.

Being able to instantly validate success is key to build maintainable and high quality AI apps. While you are usually able to quickly build a works-80%-of-the-time AI app without TDD, you will spend most of your time improving the last 20%. And here things will get rather challenging without automated tests. You will improve some parts of the system, just to find out later that you broke another part. Then you will either implement “a quick workaround” that makes the AI system even harder to maintain and debug, or you will spend a lot of time figuring out and fixing the root cause. Using TDD, however, you can instantly verify that all requirements are met with every change you make, thus knowing exactly which change caused the regression. In addition, test-driven debugging will further guide you in pinning down the root cause of the issue.

So: Start your next AI project with a failing test!

Learn more about IBM Garage at www.ibm.com/garage