Introducing tree-hugger: Source Code Mining for Human

Published in

CodistAI

5 min readMay 22, 2020

The Problem

We at CodistAI are working hard to build an AI which is able to understand source code and associated documentation. Because being developers ourselves we had faced the pain of writing, and keeping documentation up to date while writing code at the same time(add to it the pressure of delivery and deadlines!). It is also a huge problem to find them back when we need them. And we know that we are not the only one suffering from it.

But, to build such a system we needed data. And lots of it. We needed to mine huge amount of code spanning different languages, and guess what, data sources are not plenty when it comes to code as data. The few main sources that we could find are principally Github’s CodeSearchNet challenge data-set, Google big-query’s Github activity data set, Py150, and few others like these.

Mining different language code files and gathering important information from them is not a trivial job. We did not want to create new parsers, so great parser generator frameworks such as ANTLR or lex-yacc were not an options for us. What we needed was a good, high level library that exposes a simple, Pythonic API on top of some kind of universal code parser.

So at the end the choice came down to two options. Babelfish and tree-sitter. Now, babelfish was the newer kid on the block, and was coming with some nice properties, but the uAST (Universal AST) was not really something we liked that much. The API was not that easy either. So tree-sitter was a natural choice (Also babelfish is not maintained anymore).

We were impressed by the clean design, the speed, language coverage, and the minimal dependency of tree-sitter. However, we were still struggling with the low-level interface it provides with the Python binding. So we started to write some codes to create some higher level abstractions on top of it.

Thus, was born tree-hugger.

Introducing tree-hugger

tree-hugger is a light-weight, extendable, high level, universal code parser built on top of tree-sitter.

Let’s unpack those words one by one.

light-weight: tree-hugger aims to be a simple and easy-to-use framework. It gives just enough tools for a developer to quickly start working on mining code-data while it takes care about a lot of boilerplate. It also aims to make the life easier by providing some command line utilities to them. To that end it remains very light weight in itself. We are also pretty low on dependencies.
extendable: tree-hugger aims to be extendable by design. It achieves that mainly in two ways. One is to have an external source of queries. We read the queries (s-expressions) from a yml file (an example can be found here) and that means we do not need to write them in the code and we can very easily iterate on them. And the second thing is to have a modular structure with some common, boiler plate code already supplied for you. Which means, you can focus on the actual thing. Writing important part of code that matters to you.
high-level: tree-hugger hides the little details of running a query, or walking on the ast, and also the tricky part of retrieving some code from the query result under clean, Pythonic API and so you are free to concentrate on the problem at hand.
universal: We actually leverage the amazing tree-sitter, so by default we are (almost) language agnostic :)

Use-case : Mine Code and Comments

Let’s say that you want to treat code as data and try to fit Machine Learning models on it (If you want to know more about it you can checkout one of our earlier articles here.).

The task at hand is to mine a lot of code (different languages, such as Python, PHP, C, C++, JS, C#, Java etc.) and to generate a data-set where each sample looks like this — (f, d) Where f stands for the function body and d stands for the associated comment (Docstring, if you prefer).

Now that you started working on this, you discover that you have a problem. You need to have something that let’s you read through all of the different language files and get the data that you are interested in. Also, it is needed that the data you retrieve, should be in a form that you can use easily. A framework and library like tree-sitter is useful here.

But…

Once you start using it, you will notice the pain points.

You need to write queries as s-expression and they are embedded in code. Makes your own code very hard to read and also hard to debug.
Although universal, the representation of each of those files differs internally when you get the s-expression back. So you need to write, manage, and keep track of similar queries but written slightly differently for all of those different languages.
Once you start getting the data you will need to do some post-processing to make it usable in your modeling scheme.

All of those takes a huge amount of time and there is no trivial solution out there.

Enter, tree-hugger.

Let’s see an example. Imagine you have a Python file, with several functions defined in it and some of them have docstrings while others not (Check out an example here). Here are three lines of code (Assuming you have installed tree-hugger and setup the environment. You can also check out how to install tree-hugger and how to build the .so files in our documentation) which reads the file, parses it, creates a parse tree, runs a query, returns a dict which contains function names as keys and their docstrings as values.

from tree_hugger.core import PythonParserpp = PythonParser()
pp.parse_file("tests/assets/file_with_different_functions.py")
pp.get_all_function_docstrings()

And here is the result.

{'parent': '"""This is the parent function\n    \n    There are other lines in the doc string\n    This is the third line\n\n    And this is the fourth\n    """',
 
'first_child': "'''\n        This is first child\n        '''",
 
'second_child': '"""\n        This is second child\n        """',
 
'my_decorator': '"""\n    Outer decorator function\n    """',
 
'say_whee': '"""\n    Hellooooooooo\n\n    This is a function with decorators\n    """'
}

That was easy!

And imagine, being able to do that with the same API for all the languages and not worrying about the underlying semantic differences of them. So you can mine all language files at scale and with minimum effort. This is what tree-hugger is about. Our aim is to be the standard of data mining on source code with a clean, high-level, Pythonic API. That sets you free from all the lower level details and let’s you focus on more novel problems at hand.

Final Words

Today, we release the first version of tree-hugger. We have tried to provide a very comprehensive documentation. So please go trough it. If you find something missing, or have a suggestion to improve something, or you spot a bug, please open a Github issue so that we can discuss that. If you want to contribute, that is more than welcome.

Happy coding!

Introducing tree-hugger: Source Code Mining for Human

The Problem

Introducing tree-hugger

Use-case : Mine Code and Comments

Final Words

Written by Shubhadeep Roychowdhury