How It All Started
Ever heard that startups need to take a pivot? Yeah … not just once.
I was at a software company as a student. At some point I was given a small change request for an existing project, that I finished on my own. Then the company wanted to make sure everything works as expected. But for some reason testing was postponed and postponed until the very last minute. Then, one day before the release, my changes were tested. And for some very strange reason, the tester found bugs. So I fixed the bugs. He tested again. He found different bugs — things that had worked before were now broken. So I fixed the bugs. He tested again. Guess what.
This went on well into midnight. Which was especially aggravating, because that day was my testing counterparts’ birthday. So instead of celebrating with his wife and kids and friends … he was with me, testing my crap. And the most annoying part was, that not the new functionality was the problem … but the functionality that used to work, that now didn’t. The side effects. The regressions.
And although this was my most intense encounter with regressions, it for sure wasn’t the last one. Regressions haunted me in each and every project. And test automation, which (in theory) should solve this problem, didn’t. It actually made it somewhat worse.
Because the current state of GUI test automation is that it doesn’t work very well. Tests require a lot of effort to create. Once they are created, people tend to forget what they do and what they test. They are a nightmare to maintain and brittle as hell. If the test is red, you don’t know whether it is red because the test is broken or your application is. And the effort to find out is normally quite considerable. And often enough they are flaky. They switch from red to green seemingly random. So you don’t trust them anymore. But you also now don’t do manual checking anymore, because you have test automation. And since these tests were a lot of effort to create, and are a lot of effort to maintain … it can’t be all a waste, right? Besides, everyone else is doing it too — so it must be the right thing to do, right?
But still, against our best efforts, test automation didn’t work satisfactory. And this is true for big companies as well. I once was on a project where I worked together with HP. And they have a f* test automation tool in their product portfolio. And still it didn’t work well. These high-paid people, with their own tool, could not get it to work satisfactory. To me, that meant it was clearly broken.
At that time, I worked part-time because I was still a student. After my studies — because it was so much fun — I decided to go full-fledged crazy and try to get a PhD. As the topic for my thesis I suggested: “Find a way to predict the impact of software changes”. As is it often the case in academia, I ended up doing something completely different. But anyway, at least it was my intent to do that.
After finishing my PhD, I decided to try to found a company.
My thesis ended up having something to do with test generation on unit level, so I wanted to found a company with that. In Germany, you can get initial funding from the government with an innovative idea that stems from research. So I applied for that. And I was rejected, because my idea wouldn’t scale and wouldn’t work on many different projects — there are just too many different unresolved real-world problems with that approach.
But I still wanted to found a company. I was somewhat flexible on the idea, though. A colleague of mine had a little research prototype that monkey-tested the GUI, using the same underlying test generation technique as I had in my project. For those who don’t know: Monkey testing is randomly using the GUI until it crashes (broadly speaking). This approach also sounded promising. So I asked him whether I could found a company with his approach. Since he was not interested in using his results, he agreed.
I took the prototype implementation of the monkey testing approach and made it work for the big ugly real-world project of my first customer.
Then, the customer told me that the tests were too random and often ended up testing scenarios that were hardly relevant for any of their existing real-world customers. Although the bugs were crashing the application, they were still low priority. But on the other hand, the interesting, critical functionality was hardly ever triggered, because the random monkey didn’t know how to trigger it. Bummer.
Then we came up with another idea: Let’s record tests then, so that we can give them to the artificially intelligent monkey in order to train him on how to use the software, hoping he can test the critical functionality and trigger high-priority bugs afterwards. Which I implemented.
Then the customer noticed: Hey — recording and maintaining those tests is actually effort. And just finding some technical bugs does not justify the effort, because all we find is technical bugs, which is only a small percentage of the overall number of bugs and a small risk in the overall project. If we do record and maintain the tests, they should also work as actual executable test cases like in traditional capture and replay tools, to justify the effort.
But I didn’t like the idea, because this would not integrate well with the monkey as the monkey wouldn’t know which assertions to create. It would essentially degrade into yet another capture and replay tool, that was additionally able to monkey test. So I rejected the idea to let the user define assertions. I had just too much bad experience with that. That was not a solution I could sell with a straight face.
I thought long and hard about that problem and I realized that the monkey could define assertions, it just didn’t know whether they were correct. But they would not need to be; they just would need to check for changes and show the impact of these changes to the user.
After all these years, and straying so far from my initial goal to find a way to measure and predict the impact of software changes, I had actually come full circle. Now, the monkey would intelligently execute the application, based on recorded and trained actual real-world use-cases, and would just record everything he saw. Then, after a change of the code, the recording would highlight the differences and the user could decide whether this was an improvement or a regression. We had created a version control system for the behavior of the system under test and thereby circumvented the oracle problem, which kept all competitors from using AI in testing for decades. And I had finally solved the problem which so long had bugged me.
Then we realized, that this approach solves so many of the usual problems of GUI test automation, it is really amazing. But this will be the story of another blog post.