Computing the Impossible: Static Analysis of Dynamically Typed Languages (Part 1)

Published in

MANTA Engineering Blog

10 min readOct 7, 2022

Abstract

Is it possible to implement a scanner for the static analysis of a programming language without information about program variable types?

Scanners at MANTA

At MANTA, we have a solution for automatic data lineage analysis. It helps companies visualize their data pipeline, resulting in easier data governance, improved data quality, and faster data incident resolution. Because the whole process is automated, the cost is fractional when compared to the same service provided manually by consulting companies.

However, because every technology stores data in a different way and has a unique metadata structure, we have to produce a new scanner for every technology supported. It is obvious that you cannot use the same approach for the extraction and the analysis of metadata from a DBMS, such as MS SQL or Oracle, and a BI tool, such as PowerBI or Tableau.

Thanks to the well-designed MANTA platform, we can abstract the main concepts present in every technology and quickly develop new scanners according to the market needs. MANTA currently supports automated data lineage analysis for over 40 main technologies in the data pipeline , including databases, data integration and modeling, and reporting and BI.

You might think that this is pretty much everything that the industry needs, but the fact is that to scan the whole customer environment, it is necessary to expand MANTA into new and sometimes unknown areas, such as the analysis of programming languages.

Because the whole topic is very complex, this article is split into two parts. In the first part, you will learn about our motivation to build data lineage scanners for programming languages and how we approach this problem. In the second part, you will learn about the Python scanner in-depth, including key challenges we faced during implementation and how a seemingly simple programming language can be not-so-easy to work with.

Why Programming Languages?

While uncommon, there are certain technologies that allow you to define behavior by writing source code instead of using a user interface. For example, you can write procedures in C# in MS SQL, define functions in JavaScript in Google BigQuery, or write Databricks notebooks and jobs in Python, R, or Scala.

As it turns out, there are many cases in which MANTA skips part of data lineage analysis because it requires the analysis of a source code. This may lead to certain inconsistencies in the resulting lineage graph. However, if we could allow our scanners to use programming language scanners for the analysis of the source code embedded in other technologies, we could offer our customers a useful and unique feature that our competitors can not deliver.

The static analysis of programming languages is no easy job. Even though there are several solutions for the static code analysis, there is no existing solution that focuses on data lineage. While parsing (lexical and syntactic analysis) is relatively easy for programming languages, it is a major problem to perform the semantic analysis — understanding what the parsed code actually does and how it affects data. If you, for example, assign a value to a variable:my_variable = 5+5,it (usually) has no impact on data lineage and can be ignored. However, if you are loading data from table A and writing this data to table B, we need to be able to recognize and visualize it in the resulting data lineage accordingly. For this reason, we are partnering with Faculty of Mathematics and Physics of Charles University in Prague to conduct research in this area.

Analyzing a Programming Language Source Code

Before we dive deep into what makes analyzing Python more difficult, it is important to understand the data lineage analysis process for programming languages.

Extract the Code

Before the analysis can start, you need to collect all input program codes in some location. Let’s call it the extraction. Typically, you need two sets of input program codes: application and libraries. Of course, we are interested what our application does, and we do not really care about the implementation of libraries. However, we need to know which library functions are used — if we know that a print() function is called, is it the Python’s native print into the console or is the print() function of some other library doing something completely irrelevant?

Process the Input

Once you have the input program code collected, you need to load it into the memory. There are countless ways to parse and process source code — if you have Java code, you can compile it and process the compiled code, parse the source code, or, in case of Python, ask the interpreter to prepare an AST tree of the script and use that. The optimal solution is different for each programming language, so you must choose carefully to avoid as many potential problems as possible.

Construct Call Graph

The next step you must take before starting the analysis is computing the call graph. The call graph captures dependencies between individual functions. If the analysis scope counts hundreds or thousands of files and functions, it is not always easy to determine which function is actually called. If we create a structure that can tell us quickly which functions can be invoked in the current context, we can save a lot of computation time. This can be pre-computed at the beginning because function callers and callees do not change during the analysis.

The component constructing the call graph must be able to resolve imports because it is very common that the functions invoked are from various files. If we lose this information, an enormous amount of information is lost.

Compute Aliases

Step number four is to traverse the input program code and compute aliases. If we take an example:

a = 5   # a aliases value 5b = a   # variable b aliases value 5 as well

We can determine from the analysis of the assignment statements, without knowing the context, that using the variable b is the same as using the variable a. A traversal of the assignment statements in the input program code representation can help us with computing aliases per different parts of code - functions, classes, or modules. Again, aliases are context-insensitive (therefore we do not store the exact variable value, just remember that a is the same as b) and this data can be reused.

Run the Analysis

When you have all context-insensitive tasks done, you can start the actual analysis. Performing symbolic analysis allows you to analyze the code without actually running it. However, you will lose some precision in cases when a runtime value is provided. A good example is control statements. Let’s look at this if statement:

option = input('Choose output destination: ')if option == 'file':    # write data into a fileelse:    # write data into the database

If you were analyzing the program at runtime, you would know what the value of the option variable is. However, in static analysis, you can only guess. You must assume that both options are correct to avoid missing the correct option, thus splitting the execution into two branches - one branch counts with the fact that the option == 'file' condition was fulfilled, and the other one pretends that it was not.

Typically, an execution of a program has some entry point — a function or a file that is executed (for example, the main() function in Java, or the Python module which is ran by the interpreter). This is where the execution starts and all invocations, branching, and operations need to be computed.

An obvious question you might be asking now is given that the number of options and possible execution branches is huge, can the analysis ever finish? Well, yes and no. In order to get a relatively accurate output in a reasonable time, some approximations have to be used that would not reduce the correctness of the analysis, which I am going to describe later. If used correctly, the analysis finishes quickly.

The analysis runs in a loop while new data flows are computed. The moment when you finish the loop and nothing new is created, you know that you have computed everything, the analysis is terminated, and the results can be processed.

Create the Resulting Graph

As mentioned earlier, the last step is to transform the output of the analysis into a standard graph visualization. During the analysis itself, it makes no sense to represent the data as graph because different contexts produce different nodes and edges that sometimes turn out to be worthless in terms of data lineage.

Instead of adding nodes and later removing them, we keep all context data in a separate data structure that is easy to work with. We only transform it into the graph at the end, adding nodes and edges only for those data flows which are interesting for us (e.g., file and database streams, console input/output, etc.).

Visualizing Data Flows

When compared to common MANTA graphs, it is clear that programming language graphs are way smaller. The reason behind this is very simple — if we decided to visualize every step of the program, all the branching, and all context options, we would end up with an incomprehensible graph that would provide no value to users. The only way users would be capable of reading it would be by contracting inner nodes and seeing a smaller graph. This raised a question — do we even need a big graph, or do we just need to show external points of the program?

And that is what we did: we visualized the ‘entry points’ into/from the program point. For a user, it is not useful to see that you initialized a variable with a value 5 on line 427 if that variable does nothing. What is valuable in the world of data lineage is seeing that the Python program loads data from table xyz, processes it inside (without visualizing exactly how), and writes it into a CSV file - the user knows that the source of the CSV file’s data is in the table xyz. If the CSV file is then used to load data into a BI tool, the user will see the relation between the BI tool data and the table xyz, which would otherwise be unknown had we not had used the Python scanner lineage.

Below, you can find an example of how the data lineage looks if we load data from a database and print it to a console. Note that in some cases, we do not know column information during analysis since we do not run the code — in such cases, we may have to claim that there are some columns used, but we are unable to enumerate them all (we only analyze the program code, not SQL queries or content of read/written files). The green node is the output of our MS SQL scanner and yellow nodes are the Python scanner’s output.

A visualization of a Python program data flows loading data from a database and printing different subsets of the SQL query result to the console.

A Happy Day Use Case

To help you understand the context better, let’s see an example use case for the scanner and its helpfulness if everything works well. Imagine that you have a data pipeline consisting of a few main components — a database, a transformation script which saves transformed data into file(s). These files are then loaded in a BI tool, (e.g., Qlik Sense or PowerBI).

Without the Python scanner, you would not know how database and files are related (or you would not know there is any connection at all).

Everything works perfectly until somebody in your large company makes some changes without letting anyone know and suddenly, visualizations of data in your BI tool start showing weird data or stop working completely. What would you do? If you combine tens of tables in your BI tool, finding the root of the problem can take several days. Fortunately, you scanned your data pipeline in MANTA before, and have the data lineage from the time when everything worked.

A simple data lineage visualization of 4 technologies (left to right: MS SQL, Python, Filesystem, Qlik Sense)

With MANTA, all you need to do is run another scan of your pipeline and compare the graphs from before and after.

Visualization of two data lineage scans in MANTA. Red nodes are those nodes and edges which were removed, green are the new ones.

That way, you can see the problem easily — somebody changed the database columns you have stored in files, which is why your BI tool suddenly made no sense. All you need to do is to get the problematic Python script back to its original state. Thanks to the fact that we know how the data in files are related to the database, you were able to discover the issue in the matter of minutes instead of a embarking on a very tedious manual analysis of the whole pipeline.

You have reached the end of the part one. If you liked what you’ve read, you can continue to part two, where I discuss the challenges we faced during the development of the Python scanner.