Is fuzzing Python code worth it? Yes!
Unit testing and code review are the most common ways of testing code, but they’re not perfect. The fundamental weakness of these methods is that they rely solely on the developers’ ability to identify and predict problems in code. In other words, the approach is limited by what potential ideas the auditor has for how the code might fail.
One technique that can address this problem is fuzzy testing, also known as “fuzzing.” It’s a simple idea: You feed semirandom data to a program in an automated fashion and monitor if the program is able to handle the inputs correctly.
Fuzzing is considered one of the most cost-effective bug hunting methods for memory-unsafe languages like C, and C++, and it’s used to test many major C/C++ software projects as a supplement to code review and unit tests. Some of the main reasons for its adoption have been its ease of use and track record for finding weird bugs that manual auditors had missed, like some of the more than 16,000 bugs that Google has found through OSS-Fuzz.
There hasn’t been much research on using fuzzing for languages like Python, which is what my team at Cognite uses. The most similar public projects now are pythonfuzz from fuzzit.dev and python-afl from Jakub Wilk.
One of the issues with Python is that it’s a dynamic language; you have to run the code to find simple bugs such as import errors or undefined variables. Writing tests to cover every single scenario and code path that depends on runtime state would take forever. With this challenge in mind we decided to try fuzzing on our Python code to increase the code quality and reduce the time spent on debugging.
Our approach to fuzzing
We wrote a library that automatically parses Python type hints and finds the correct type for each function. It also finds all functions and classes in each module it’s going to test. Purely testing random values may lead to poor results, so we allowed our solution to be seeded with potential real arguments for each function. Altogether that means our fuzzer is able to generate a fairly realistic input. We then use American Fuzzy Lop (AFL), a state-of-the-art fuzzer that uses code coverage to guide its mutations.
Adding support for a new library with our implementation is easy. If your library is using type hints, all you need to do is to add example values to help seed the training.
Above we’ve defined some example values for the Tag class constructor. With a total of 30 lines added we achieved good support for one of the libraries we were actively developing at the time.
Results from fuzzing the code revealed all kinds of bugs: out-of-bounds access, AttributeErrors related to attribute access on None objects, using max on empty lists, bad error handling leading to weird crashes, and bad refactoring causing the use of functions that no longer existed. And the most surprising finding? This code had already passed code review.
What we discovered
Above is a made-up, simplified example showing the kinds of bugs we encountered. There are two bugs here:
- First, function_with_problem() may throw an exception. In case there’s no match, it will raise an AttributeError complaining about how a NoneType object does not have an attribute called groups.
- Second, the start_function() assumes that the function_with_problem() will always succeed. The try, catch block is there only to catch if one of the group matches is not in the pre-allowed list, so exceptions from function_with_problem() isn’t really dealt with. This misunderstanding allows the start_function() to try to return match_groups before it’s declared.
One common bug we encountered is where an object can be None and the developer assumes it’s not. This is often caused by a lack of understanding of what certain APIs may return, sometimes due to bad documentation and sometimes due to not even reading documentation. This also manifests itself in code where a lot of exceptions are caught and thrown. This allows for situations where variables are either not declared or declared but in a different state than what the developer expected.
When developing the fuzzer and going through bugs, we also performed a review of the code. One of the bugs we found that had previously gone through code review and was not caught by unit tests or our fuzzer was this:
If you look at the not-equal implementation, you’ll notice a logic bug. You should not return false every time other is an instance of Tag. Rather, you should return True if the other object is not a Tag.
These can be hard to spot. As a result they may occur in codebases even with a mature review process. Fuzzing won’t detect it either.
In this case one should flag the use of a __ne__() function as extra interesting due to it no longer being required in Python 3. Look for a future blog post in which we’ll cover some other tips and techniques for when you’re auditing Python code.
Final thoughts: Fuzzing is not a silver bullet
One problem we noticed when we deployed this fuzzer was that some of our code was not optimized for fuzzing. A big portion of our codebases relies on communicating with internal and external APIs. Fuzzing it directly without mocking will cause a big overhead, and in many cases it would be impractical to mock such interfaces. One could decide to fuzz only the parts of the codebase without API dependencies, but that still leaves a lot of complicated behavior that we won’t be able to test. It’s therefore important that all developers agree on a path to attempt to separate logic and networking to ensure that most logic can be fuzzed. (In a follow-up blog post we’ll explore how we’re automatically testing our API endpoints both for differences between our API specification and the implementation, and for ways to trigger server errors.)
The bugs described above often arise during large rewrites when test coverage may be reduced. Due to the sheer amount of changes it may become unmanageable for a reviewer to understand everything going on.
Fuzzing can help — as an addition to code review and unit tests. It allows us to move faster, because we’re able to trust our code more. Fuzzing does have its limitations in terms of what bugs it can find and where it can be used effectively, and it should never be used as a replacement for code review and unit tests. As a safety net, however, fuzzing is proving to be helpful, and internally we’ve found that fuzzing improves our overall confidence in our product. It’s something we want to continue exploring.