Carl Whalley
Aug 18 · 6 min read
Old university exterior
Old university exterior

I recently gave a talk on machine learning where I was very happy to find myself on the same billing as an old colleague I’d not clapped eyes on for well over a decade, Leigh Rathbone. I worked with Leigh back at Sony Ericsson, and he’s now a top honcho at Shop Direct, a huge UK ecommerce business. We were chatting afterwards when it struck us there’s a whole new world of uncertainty lurking in the intersection of where our talks met: testing AI.


Code vs Data

Automated software testing has come a long way since it first appeared, blinking in the sunlight which, back then, seemed almost more trouble than it was worth to set up. That’s because it had to. Software has evolved in complexity so much, and changes so much, that the assurance it was behaving as it should quickly outstripped the manual human ability to keep up. Think about the richness of even single user apps such as on desktops, browsers or smartphones. The free flowing nature offered to the user, now able to select options and move through the app in crazy, unforeseen ways, is completely removed from the linear, fixed ways of old.

Most software teams now use some form of TDD, and the green light back from the CI server on a commit after all the tests have passed is always worth a proud high five. It’s ingrained into IDE’s too, so maintaining the test source tree alongside production is now nothing like as painful as it used to be. Many development shops even enforce rules regarding commits, such as no production code can be checked in without a corresponding test for it. I’ve worked on systems which ran thousands of these tests overnight, all against the days work the team had done. In fact the first thing to do each morning was check the output and I know this is the way it is for many sites currently.

The interesting thing here is how these tests were implemented. Let’s get concrete: once I worked on as large scheduling system which had a Java back end and managed its SQL data via Spring & Hibernate. Each night, a clone of the day’s live data was taken and the tests run against it, plus another run on a standard set of test data for reference. The tests were coded for JUnit using suitable annotations. This is all standard stuff — the tests are code, so they are repeatable and can be developed and debugged just the same as production code.

That’s fine, and for those kind of systems still the way to go. However, AI, and in particular it’s Machine Learning implementation, isn’t like that. We now have a vast complex of nodes, weights, activations and all the mathematics operating on them which represent our software. It’s a kind of sealed box, where if we really, really wanted we could spend an inordinate amount of time stepping through each stage in order to work out why a certain output was arrived at, but that’s nothing like the old if … then … else logic we’re used to.

So if that system were to instead use machine learning, would the rigid database test techniques still be valid? The point is that you’re now testing both the code and the data, so when either change are the tests still valid?

Deep learning relies on data. Lots of it. And as things progress, the problem of it’s operation being too complex to know what’s going on inside it is only set to get worse. Tommi Jaakkola, MIT Professor of Electrical Engineering and Computer Science, says:

Whether it’s an investment decision, a medical decision, or maybe a military decision, you don’t want to just rely on a ‘black box’ method.


The quality of data

Here’s an idea which illustrates the problem. Suppose a year ago, a machine learning system was created using the tools available at the day. It was trained with an appropriate set of test data, and produced predictions when presented with new data. As time went on, improvements were made to the software. Increases in speed due to new formulas being used, or old ones being tweaked, refactoring and so on. That software is in much better state today. Now, if we ran the year old version with the trained data we had then, we’d get a certain result. You’d want to run today’s version with today’s data, and naturally expect it to give better results, but that’s not necessarily the case any more. At the extreme, if today’s data wasn’t as good a fit with the current version as the one a year ago was with it’s data back then, you’d be in the strange situation of newer, “better” software performing worse than older versions. You might argue that’s always the case, say a lousy database today would do the same with regular software, but the difference is the data for that wasn’t such a fundamental, core component of the logic itself. You used to be able to improve things by changing the code alone— now you change the data.

This brings us to the issue of testing such software. Clearly, a completely new approach than ensuring a method to add numbers gives the correct result is required, and it’s most likely going to be something very specific to the AI in question. Say it’s face recognition, and you show it Elvis. What’s a pass? More than 95% confidence? The exact same value it got before? Will it get all shook up if it sees a lookalike, as a human would?


Using AI to test other software

Of equal interest is the fact that AI is a tool which can be used to drive the software testing process, and yes, that means it can test itself. There are many areas which you could train AI to be aware of before you even begin letting it loose on the specifics of your app. For example, security is a massive concern so an AI could automatically attempt to inject code into any user-facing text field on a web page, perform buffer overflow attacks or even stress test the system via DDOS. Also, since we’re talking machine learning here, each new project could get better because, quite literally, the lessons learnt from all the previous ones could be applied to the new ones.

Smarter ways to perform the tests themselves may be just round the corner too. Imagine a non-trivial java project, with multiple classes and packages, and no unit tests at all. Could an AI trawl through it and be smart enough to create them, with meaningful parameters?

How about automated user flows beyond the regular record/playback scenarios — ones which really did look at what the user was trying to accomplish. Just as some ML systems use “reward” techniques to learn to get high scores in video games, they’d now create flows where the goal is to find a bug.

Thinking back to the chat with Leigh, you could imagine this becoming more of an issue with ecommerce as AI becomes more prevalent. Areas we know already of interest include recommendation engines, per-customer pricing, dynamic offer variations and so on. And that’s just his one field — what about the zillions of others such as banking, medical, process control and so on?

We live in interesting times indeed.

The Startup

Medium's largest active publication, followed by +502K people. Follow to join our community.

Carl Whalley

Written by

The Startup

Medium's largest active publication, followed by +502K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade