Machine Learning on Go Code

Published in

sourcedtech

10 min readSep 25, 2018

This blog post is the written form of my recent talk at GopherCon 2018: Machine Learning on Go Code, which you can now enjoy directly on YouTube Machine Learning on Go Code.

You can also follow the slides on SpeakerDeck.

Software is eating the world

I’m sure you’ve all heard the famous phrased coined by Marc Andreessen “Software is eating the world” (Why Software Is Eating the World — Andreessen Horowitz). I definitely had heard about it, but I didn’t realize how much this was the case until recently, when I found this article (Infographic: How Many Millions of Lines of Code Does It Take?) which shows the number of lines of code in popular pieces of software and its evolution over time.

Infographic: How Many Millions of Lines of Code Does It Take?

Today's data visualization comes from David McCandless from Information is Beautiful. Buy their awesome book called…

www.visualcapitalist.com

I invite you to review the amazing graphics included in that article now, but just in case you’d rather continue here the most impressive facts are how Windows NT 3.1 had already 4 to 5 million lines of code, the latest version of Chrome has 18M, or how a Ford Pickup has 150M lines of code. If this doesn’t send a shiver down your spine you are definitely a brave individual.

The number of lines of code we manage keeps growing at explosive rates, while the evolution of our tools is barely incremental.

So the number of lines of code we manage keeps growing at explosive rates, while the evolution of our tools is barely incremental. This needs to change, we need better tools not only to have better code but to write it in better conditions. In comparison, our transportation tools have evolved greatly to the point that our cars now understand their surroundings and are able to warn us in situations of danger, or even to take actions to avoid imminent accidents.

This is why, I argue that it is time to start applying Machine Learning to our code. The title of this talk is Machine Learning *on* Go code, not *in* Go code, and this is a very important distinction. I will not be discussing how we can implement ML algorithms with Go code, but rather how we can use ML algorithms to improve the way we write, read, and maintain our Go codebases.

Machine Learning on Go Code

Let’s talk about how Machine Learning can change the way we write code, by focusing on what are the principles that can take use there, the current research fields that apply, and the use cases of this technology.

Machine Learning on Source Code (aka ML on Code) is Machine Learning that we apply on top of Source Code, so rather than having an input consisting in images, videos, or natural text, we will feed source code to our models in order to train them to predict interesting characteristics of codebases.

Extracting the meaning of programs is easy, extracting the intention of the developer is harder, and when those two differ is when bugs occur.

Machine Learning on Source Code is closely related to the following fields:

Data Mining techniques are useful since we are handling really large codebases, it’s easy to imagine training a model with trillions of lines of code.
Natural Language Processing techniques are obviously useful to source code, even though the generation of natural language is a much for flexible target than the generation of valid programs in a specific programming language. Extracting the meaning of programs is easy, extracting the intention of the developer is harder, and when those two differ is when bugs occur.
Finally, source code is full of cross references, which we can naturally represent with graphs, this is why Graph Based Machine Learning techniques appear often on ML on Code research.

Let’s talk now about the four challenges of ML on Code.

Challenge #1: Data Retrieval

Retrieving source code seems like an easy task, but actually is a common pain for many researchers. This is why we decided to create a public dataset named Public Git Archive (https://pga.sourced.tech) which based on GH Archive downloads all of the contents of every repository with 50 GitHub stars or more and makes them available in a convenient format.
The dataset contains over 4TB of source code including hundreds of programming languages.

Once we have all of these repositories we need to identify what programming languages each file is written in, in order to then parse it and extract the relevant pieces of information. This often needs to be done not only on the latest version of the files, but also over the previous revisions in order to understand the evolution of source code.

For these tasks some open source tools are available (some of them by source{d}):

Language Classification: enry by source{d} and linguist by GitHub
File Parsing: Babelfish by source{d}, tree-sitter by GitHub, and other ad-hoc parsers
Token Extraction: XPath (libuast by source{d})
Reference Resolution: Kythe by Google
History Analysis: go-git by source{d}

As an effort to make many of these tools available in an easy and unified way, to provide a simple way to analyze source code repositories we’re working on the source{d} engine a simple server that provides a SQL interface to your git repositories.

Announcing source{d} Engine beta for Code Retrieval and Analysis and source{d} Lookout Alpha for…

Today we’re excited to announce the public beta of source{d} Engine and public alpha of source{d} Lookout. Combining…

medium.com

Challenge #2: Data Analysis

Once we’ve extracted the pieces of source code we care about, we need to start analyzing those in order to train our ML models. But, what is source code?

Source code can be seen at, at least, four different levels of abstraction:

a sequence of bytes
a sequence of tokens
an Abstract Syntax Tree
a graph showing relationships across code snippets, such as Control Flow Graphs

The highest level of abstraction provide with more information to the model, therefore giving a better chance to predict advanced concepts. But there’s a trade-off since for instance an analysis by token will never predict new identifiers, since only those that we have seen can be predicted, while analyzing it as a sequence of tokens can provide brand new identifiers never seen before.

Challenge #3: Learning from source code

Once we have analyzed the source code we need to decide how we’re going to learn from it. Depending on the different ways we represent the source code some models will be more appropriate, so let’s see some of the most common ones.

Neural Networks

Neural Networks are statistical objects able to learn how to predict specific outputs given some inputs and a really large number of examples. I like to think of them as puppies, but inside computers. Like puppies you need to give feedback on how they should behave until they eventually learn how to do so.

The traditional example is MNIST, a dataset of images containing hand-written digits. We can give the 26x26 pixels of the image as inputs to the neural network and have ten outputs corresponding to the probabilities of the given digit being 0 to 9.

These neural networks can be used for ML on Code to predict a given character or token given the surrounding ones. I trained a neural network that would predict the character that appears in the middle of sequence given the 10 to the left and the 10 to the right.

After lots of training, and achieving an accuracy close to 80%, we can use it to predict the missing character (represented with the ☐ character) in the following piece of code.

i := 0; i ☐ 10; i++ {

The resulting 5 highest probabilities are:

‘<’ 0.99858105
‘=’ 0.00053220615
‘ ‘ 0.00024154336
‘d’ 0.00019700211
’n’ 8.995945e-05

Recurrent neural networks

If you’d like to generate sequences of characters, or analyze sequences of variable length, the previous approach will fail. Instead we can use recurrent neural networks, which similarly to recursive functions, are able to create a loop by feeding their result as an input to the following iteration.

The traditional example for Recurrent Neural Networks is the charRNN, which allows you to train it with long texts and then use it to reproduce pieces of text that resemble the original one in style. My ex-colleague Martin Gorner gave a great talk about how this could be used to generate Shakespearian text in his “Tensorflow without a PhD” talk at Google I/O.

I decided to do the same with all of the Go code in the standard library to see whether I could generate code that looked like code that could be found in there. The results are … interesting. After many tries I was able to reach around 61% accuracy which, although not great, it can already power some useful tools.

Before training the neural network generates code that is pretty far from the standard Go program.

r t,
 kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i

But after training the neural network with all of the files in the standard library we start to see code that looks much more familiar.

if testingValuesIntering() {
                t.SetCaterCleen(time.SewsallSetrive(true)
                if weq := nil {
                        t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error
                }
                t, err := ntr.Soare(cueper(err, err)
                if err != nil {
                        t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into
                }
                if err != nil {
                return
                }
                if err == nel {
                        t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err)
                },
                defarenContateFule(temt.Canses)
        }
        if err != nil {
                return err
        }
        // Treters and restives of the sesconse stmpeletatareservet
        // This no to the result digares wheckader. Constate bytes alleal

After more epochs we keep on generating code that looks more and more like something we could find in the standard library, but unfortunately the resulting rarely compiles.

This is one of the main differences between generating natural language or programs, while slightly wrong natural language can be considered artsy, slightly wrong programs are just wrong.

There’s also many research papers that are worth reading in order to learn more about how ML on Code can learn from source code, but I’d like to mention two more of them.

Learning to Represent Programs with Graphs

[1711.00740] Learning to Represent Programs with Graphs

Abstract: Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried…

arxiv.org

By Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi introduces the VARMISUSE task, which consists in, given a program from which a single variable usage has been removed, predict which variable was there.

For instance, can you predict what token should appear in lieu of ??? in the following code snippet?

from, err := os.Open("a.txt")
if err != nil {
  log.Fatal(err)
}
defer from.Close()to, err := os.Open("b.txt")
if err != nil {
  log.Fatal(err)
}
defer ???.Close()io.Copy(to, from)

This problem, while simple to explain, is very hard to solve making it a great candidate for becoming a benchmark of different algorithms applied to ML on Code.

code2vec: Learning Distributed Representations of Code

[1803.09473] code2vec: Learning Distributed Representations of Code

Abstract: We present a neural model for representing snippets of code as continuous distributed vectors. The main idea…

arxiv.org

By Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, code2vec maps code snippets into an embedding together with funcion names and provides a way to predict what names should be given to a function given its body.

More ML on Code research?

For more papers I recommend you visit our repository github.com/src-d/awesome-machine-learning-on-source-code.

Challenge #4: What can we build?

So, what can we build with all of these techniques? Is predicting the best we can do? I don’t think so. If instead of predicting tokens we use the train models to evaluate how “expected” a piece of code is we can then start to create very useful tools for our daily tasks.

Imagine you receive a very large pull request and you are requested to code review it. By using the models we saw before that are able to predict characters or tokens we could instead create a heat map of “expectedness” highlighting those sections of the code that are the least expected. This doesn’t mean those lines are wrong but rather they are interesting and should be reviewed in more detail.

Given this program below:

func sum() int {
   s := 0
   for i := 0; i > 10; i++ {
      s += i
   }
   return s
}

we would highlight the > in i > 10, since the probability for that character to be there is below 0.01%.

In a similar way, by using a model solving the VARMISUSE problem introduced above we could detect a mistake on the code below:

from, err := os.Open("a.txt")
if err != nil {
  log.Fatal(err)
}
defer from.Close()to, err := os.Open("b.txt")
if err != nil {
  log.Fatal(err)
}
defer from.Close()

Can you see it? Clearly someone copy pasted some code and missed a replacement. The second file, to, will never be closed since the defer statement closes from twice instead.

Microsoft has added this kind of prediction to Intellisense and they successfully identified many mistakes on large C# codebases. See the original paper for more details.

Lastly, with code2vec we could identify when the name of a function is not adequate, therefore minimizing the possibility of mistakes in later code using these functions where their stated intention (through their name) and the actual implementation do not match.

source{d} lookout

Integrating all of these analyzers powered by Machine Learning with existing linters using more traditional techniques, we’ve developed source{d} lookout: a GitHub bot that will review your PRs and help you identify possible mistakes with higher accuracy.

This is what we call assisted code review, and it’s just the beginning of the many use cases we believe can benefit from ML on Code.
In the future we’d like to also predict bugs, enforce style guides automatically, and maybe one day we will be even able to generate code automatically from unit tests, specifications, or even natural language descriptions.

So does this mean developers will be obsolete?

No. Developers will be empowered.

In the same way architects did not disappear when CAD tools came to be, developers will simply become more efficient and we hope this will empower them to create even better software that will, at its time, improve how the rest of society performs their own tasks.

Learn More about source{d} products

Request a demo of source{d} Engine or source{d} Lookout
Sign up for upcoming Online Meetups
Sign up for source{d} bi-weekly Newsletter
Get involved with the source{d} community

Machine Learning on Go Code

Software is eating the world

Infographic: How Many Millions of Lines of Code Does It Take?

Today's data visualization comes from David McCandless from Information is Beautiful. Buy their awesome book called…

Machine Learning on Go Code

Announcing source{d} Engine beta for Code Retrieval and Analysis and source{d} Lookout Alpha for…

Today we’re excited to announce the public beta of source{d} Engine and public alpha of source{d} Lookout. Combining…

[1711.00740] Learning to Represent Programs with Graphs

Abstract: Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried…

[1803.09473] code2vec: Learning Distributed Representations of Code

Abstract: We present a neural model for representing snippets of code as continuous distributed vectors. The main idea…

Challenge #4: What can we build?

So does this mean developers will be obsolete?

Learn More about source{d} products

Written by Francesc Campoy