Computing the Impossible: Static Analysis of Dynamically Types Languages (Part 2)

Published in

MANTA Engineering Blog

11 min readOct 14, 2022

In the first article, we discussed what we do at MANTA, why we decided to extend MANTA with the support for programming languages, and which steps are necessary before we can get any data lineage of an analyzed program. In this article, we are going to follow up and discuss concrete challenges and struggles we have had to overcome to create our prototype data lineage scanner for the Python language.

Python and its Challenges

If you are a data scientist, you have probably used Python before. Together with R, they are the two most used languages for data science. Its human-speech-like syntax and ease of use, where the ‘dirty’ work is done by the interpreter in the background, make it ideal for working with data.

However, when it comes to analyzing the source code written in Python, all of the language’s advantages suddenly become disadvantages. Ideally, you would like to work with what actually happens when the program runs, but all you get is a code full of syntactic sugar — constructs to make it easier for users to write and read the code — hiding the actual implementation. Equally challenging is the fact that Python is not statically typed. While in Java or C#, you must declare the type of the object you are going to work with (making it clear which methods are available for the variable and which are not), in Python, the type of the object does not matter as long as it has the functions you want to use: if you want to throw() the object, it does not matter whether it is a ball or a plate, as long as you can throw it.

From now on, we are going to assume that you understand the general idea of analyzing programming languages here at MANTA and we will refer to individual stages throughout the description of the issue and its solution. We will focus on the object typing and syntactic sugar issues of the Python since these two are the unique ones when compared to our two more mature scanners for Java and C#.

Struggling with Syntactic Sugar

In Python, a lot of behavior is hidden in the background using the so-called dunder (double underscore) or magic methods. These methods, as the name suggests, are wrapped in double underscores: __add__(), __contains__() or __init__(). Thanks to them, all behavior can be overridden, so that the object you want to work with behaves exactly as you want. You can define the behavior of unary and binary operations (+, -, *=, >>), comparisons (<. >, ==), constructors (__init__()), or string representation (__str__()). Python interpreter can process and run your code, swapping all syntactic sugar constructs, which are easy to read, for their dunder method equivalents - all in the background without you knowing it.

To let you better understand what the syntactic sugar is and how it makes things complicated, let’s have a look at an example — showing how a command looks written in a user-friendly way and how its dunder equivalent looks like. If we look at the string concatenation in Python 'this is ' + 'a string', it is pretty clear that you want to create a single string 'this is a string'. However, what actually happens in the Python interpreter is that the dunder (double underscore) function __add__() is invoked on the first string with an argument 'a string' to be concatenated: 'this is '.__add__('a string'). Both versions can be run in Python, but if you chained operations (e.g. 5 + 6 + 7 + 8) and wanted to rewrite it using dunder methods, you would have to use three __add__() methods. With the nice syntax, all it takes is seven characters.

Dunder methods are, however, used in many more scenarios than we described before. On the background, they are used for loops (__iter__() and __next__()), item access (__setitem__() and __getitem__(), __getattr__()) or invocation (__call__()).

Now that you understand the fact that we do not see what actually happens, we can proceed to the actual problem — how do we get the actual representation for our analysis? There are three main options:

1. Python ast library — using the ast library from the standard library, get and work with the AST (Abstract Syntax Tree)

2. Python bytecode — using the Python’s disassembler to get the bytecode and work with that

3. Parse and process code ourselves — using a parser to parse the input program code and implement all transformations by ourselves

The first two options have a clear advantage in the fact that they must work since the representation is created by the Python’s standard library. Additionally, it requires less implementation time since these libraries are ready-to-use. However, there are some issues as well — Python interpreter would have to be required on the (customer’s) machine which would use the scanner (MANTA is written in Java and therefore does not require Python installed), the backward version compatibility is not guaranteed, bytecode instructions and AST objects can change with different versions.

In the end, we decided to go with parsing the input program code on our own. In MANTA, we have a lot of experience with parsing and we were sure that it is quite possible to get the code into the representation we want in reasonable time. ANTLR offers an official Python parser grammar that helped us a lot. With that and small improvements to match syntactic constructs introduced in the latest Python 3 versions, all we needed was to design structures we would use to represent the input program code and write the transformer from the Parse tree to our representation, removing all unwanted constructs. In the end, this took us only a few weeks, which is only a fraction of the time it took us to create the first working prototype.

Having a full control over the representation proved to be very helpful countless times during the scanner implementation, saving us a lot of work. Whenever there is a construct which looks nothing like what actually happens, we simply transform it into its true form after parsing once, avoiding repeated detection and transformation of the problematic construct every time it is analyzed in the symbolic analysis. To make it easier to imagine how many special cases there are in Python thanks to the syntactic sugar: currently, our transformer transforms around 80 different constructs which are commonly found in the input program code. In the example below, you can see how the original and equivalent transformed code looks like. Note that whenever we create an artificial object (variable or function), we include the number sign # in its name - this is not allowed in Python and therefore we avoid naming collision with actual variables or functions.

Original code:

def foo(value):    return 'List length is 3: ' + str(value)
my_list = [i for i in range (2)]my_list[1] = 'abc'print(foo( len(my_list) == 3 if my_list is not None else (lambda: "False")()))

Our optimized version:

def foo(value):    return 'List length is 3: '.__add__(str(value))substitute_variable#1 = __builtins__.list()for i in range(2).__iter__().__next__():    substitute_variable#1.append(i)my_list = substitute_variable#1my_list.__setitem__(1, 'abc')def lambda#1():    return 'False'if (my_list is None).__neg__():    substitute_variable#2 = len(my_list).__eq__(3)else:    substitute_variable#2 = lambda#1()print(foo(substitute_variable#2))

As you can see, the optimized code is less readable by humans, but in terms of machine processing and symbolic analysis, this is the preferred representation.

Once we had the code representation figured out, we were able to focus on the main challenge in Python: the language being dynamically typed.

Dealing with Dynamic Typing

As mentioned earlier, one of the main differences when you compare Python to Java or C# is that Python is not a statically typed language. This means that compared to the type of the object, it is more important what methods it defines. If you want to fly() in Python, it does not matter if it is a bird or a plane (or Superman) - it only needs to be able to fly. Another common name for this typing approach is duck typing: “If it walks like a duck, and it quacks like a duck, then it must be a duck.“

This, again, sounds great for users. However, if you are trying to perform the semantic analysis, you have a problem. How do you determine which swim() function is being invoked? Is it the dolphin or the salmon that wants to swim? In the context of data lineage, this can be a difference as big as writing to a file or a database.

When we are tracking data flows across several modules, classes, and functions, we have a problem. How do we resolve the correct invocation target (a fancy name for the [potential] function being invoked) when you call __str__() on a variable? This method, defining string representation of the class instance, is present in majority of classes, yielding thousands of possible targets. The answer is probably disappointing, but we cannot say the correct answer with 100% accuracy. What we can do is use certain predicates that can exclude some options. However, in symbolic analysis, we must never exclude an option that is not 100% incorrect - opting for over-approximation rather than under-approximation. So, which predicates can we use to increase the accuracy of our analysis?

First of all, we must understand how classes and functions in Python become available to a program. Python, similarly to other object-oriented programming languages, provides a well-designed import mechanism. If you want to use an object foo from module B, you must import it first into your Python script. Without that, nothing from module B can be used. Therefore, we only need to look for the invocation target in the script being executed or the scripts being imported. Sounds easy, right? Well, it’s not.

What we also need to do is to consider several other aspects of imports — you can import an object (a class or a function) directly, or you can import a whole module. In case of a module import, you must qualify the function name with the module name. You also shouldn’t forget about transitive imports (if module A imports B and B imports C, then objects in C are available in A when correct qualified names are used). A tricky thing is a conditional import. If a condition is fulfilled, an import is performed and if it is not fulfilled, the import is skipped or some other module is imported instead. What should we do? In this situation, we have no other option than considering both options happening and importing both modules. If they have the same import alias, then the alias must know that there can be more than one aliased imported module. We could perhaps spend the whole day talking about imports and resolving them for the needs of our symbolic analysis, but if you understand the general idea behind using imports for accuracy, we can move on.

The next invocation target criterion is its type. You might wonder what kind of types we are talking about. In general, you can distinguish five different function types:

· Constructors: class-nested methods named __init__

· Instance methods: class-nested methods with the first parameter self, invocation must provide the class instance; method works with the class instance

· Class methods: class-nested methods with cls parameter, decorated with the @classmethod decorator, working with the class object (not instance)

· Static methods: class-nested utility methods with empty parameter list or annotation @staticmethod, used for helpful computation, does not require an instance nor the class object

· Ordinary functions: non-class-nested functions

If you are resolving the invocation target and can figure out the type of the function you are looking for, you can usually filter out at least a portion of possible options.

Last but not least, you can look at the function parameters and try to match them with your analyzed invocation. If it uses no parameters and the possible target requires at least one, you can be sure that this is not your correct invocation target. Parameter matching, like import resolution, seems simple — until you come to analysis of keyword arguments, asterisk and slash parameters, and their various combinations. In extreme cases, a function definition can be as weird as:

def f(pos1, pos2, /, pos_or_kwd, *, kwd1, kwd2, **kwargs)

or as long as this (actual code from the PySpark library):

def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=None, comment=None, header=None, inferSchema=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, positiveInf=None, negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None, lineSep=None, pathGlobFilter=None, recursiveFileLookup=None, modifiedBefore=None, modifiedAfter=None, unescapedQuoteHandling=None)

In small and simple cases (a vast majority), the complexity of parameter matching is pretty low. However, if you use variable parameter count (*args or **kwargs), the complexity rapidly grows and at some point you simply have to claim that the parameter matching is successful without even trying - a trade-off between the computation time and accuracy.

To conclude,when we use these three above-mentioned approaches to guess the invocation target, we can get very close to the accurate state without knowing or tracking actual runtime types of objects. There are, of course, some cases when we are unable to say anything (everywhere present dunder methods and/or methods with foo(*args, **kwargs) parameters, which basically accept everything), but in general, this is the best effort you can do in a reasonable time.

Speeding Up

If after reading about these algorithms and approaches to increase the accuracy of the symbolic analysis, you think that it must take ages to compute the result even for smaller inputs, you are not wrong. First runs of the analysis for very small input applications (around 30 lines of code) took us around 16 hours and were using up a lot of memory. What we figured out after some analysis was that we took a monstrous amount of time to compute invocation of built-in functions with different contexts, such as __eq__ or other dunder methods that are defined for all standard data types (strings, floats, integers, etc.). This is especially useless since equality-checking and similar operations do not modify data flows. Another thing we learned was that the standard library (and any other library) function invocation analysis takes ages and rarely provides anything useful.

To make the whole execution faster, we use manually implemented propagation modes. For library functions, we never simulate the method execution. Instead, two scenarios can happen: the function either has, or does not have, a defined way to propagate input data flows into the output. If it does, we use propagation mode(s) of the function to make data-flow-relevant changes in the context. If it does not have its propagation defined, we claim that what comes in, comes out. This is the most accurate approximation we can do — unless we know what to do with the function, we assume that nothing happens in that function.

Of course, some data flows can be missed here if we do not support this or that function, but the fact is that most customers use the same libraries and the same functionality (also, note, that a majority of functions in any library has no impact on data flows). Moreover, because the Python interpreter’s reference implementation is in the C language, most I/O operations, such as file read or write, are, in fact, implemented in the C language — the interpreter actually runs the C language and the Python code are just nice instructions that are easy to read. Because of this fact, we would never even know what is going on during a file read or write since we would need to analyze the C language to be able to do that.

When we managed to implement ‘selective’ code analysis — execution simulation for the analyzed application and custom handling for all library invocations — we were able to cut the time needed for running small programs to 2–3 minutes. This is a huge improvement towards the usability of the scanner, costing us virtually nothing in terms of accuracy, since we only avoid the endless simulation of data-flow-irrelevant functions.

Conclusion

As you can see, analyzing the Python source code is no easy task and requires solving a billion smaller issues to get to some minimal results. In our case, we were lucky that we already had scanners for Java and C# and a mature platform when we started the work on the Python scanner. This allowed us to reuse many concepts and problem solutions that we have come across before at MANTA. There is still a lot of work and research left to be done in order to deliver a fully mature product to our customers, though.

However, in the future, we cannot expect that the complexity of systems and data solutions will decrease. This means that the need for analytical tools, such as the whole MANTA platform, is going to grow. Therefore, it makes sense to invest effort into new technologies and solutions, such as programming language data flow analyzers.