DeepCode AI: Symbolic AI versus Machine Learning

Frank Fischer
DeepCodeAI
Published in
4 min readDec 4, 2019

Hey,

You might have noticed: We are proudly carrying the “AI powered” logo in our subtitle. In developer communities, this is a double edge sword these days. It feels a bit overhyped (blockchain anyone?). So let me explain what we mean and that it is actually different from pure machine learning here.

Picture Gerd Altmann / Freiburg / Germany / pixabay.com

Machine Learning vs Symbolic AI

I guess first we need to agree on what intelligence is. Let us stick to the definition of Intelligence of DeepMind cofounder Shane Legg and AI scientist Marcus Hutter: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.” Back in the days (and I mean 1950s onwards), the main idea for Artificial Intelligence was to build a knowledge base of facts of the world. Based on these facts and rules, a machine could do logical reasoning and discover more facts and gain knowledge. This is called symbolic AI. We model the world using symbols and rules. We all see where the limits and problems are: How much facts is enough facts? Where do all these facts come from? The rules? The world is — say — mostly logical at best. How can we cope with illogical things? Etc etc…

Fast forward to today. When we speak about AI, we often get the latest advances in machine learning in the form of convolutional neural network (CNN) presented. The machines learn from massive amounts of data to find patterns and react to those. And these patterns can be on a higher abstraction layer but it is missing the underlying semantic, the meaning of things. So, while we can find cars in videos, how these cars interact with the each other, is of no interest to CNN. Here comes Symbolic AI.

Symbolic AI sees a renaissance recently. Google is using the Knowledge Graph to answer roughly 1/3 of all searches with it (see https://developers.google.com/knowledge-graph). According to Google, they harvested 70+ billion facts from sources like the CIA World Fact Book. When a search query comes in, Google traverses the graph to find facts and relationships, and displays those in a box.

DeepCode’s AI

DeepCode is using a symbolic AI mechanism fed with facts obtained via machine learning. We have a knowledge base of programming facts and rules that we match on the analyzed source code. The rules are generated by observing the differences in versions of open source repositories. So, whenever an open source project that we observe does a code change, our system tries to understand what happened and why and maybe we come up with a new rule. This rule is then applied to all projects we observe to see if it can be generalized and added to our knowledge base.

Symbolic AI makes sense in the context as source code is the dream of any researcher here. It follows a very strong grammar and you should be able to reason about its effects. It also allows us to give detailed explanations on why something was reported. On the other side, you are also limited as obviously the code needs to react to external input which most of the times cannot be predicted.

DeepCode works over a wide range of environments by transforming the source code in a tree-form which abstracts from the details of the specific language for example. There are language specific rules as some languages have challenges others do not have (say: Typing), but the method is applicable over all Turing-complete languages. On the flip side, when the system learned that data from a source of unsafe input needs to be sanitized before being used somewhere, it can apply this disregard of the language used.

The big benefits of Symbolic AI here is that (1) our system by transforming the source code in an intermediate representation and then arguing over it, keeps the semantic and does not only argue over probabilities of words. (2) Our system can learn from a very small number of samples (extreme case is one example) and generalize into various contexts.

In comparison, most alternative tools — especially those based on machine learning — treat code like text disregarding semantics and grammar. On top, these tools treat the ever growing number of external library and services functions all the same. But — as mentioned above — some are sources of direct user input (DANGEROUS!!), some are vulnerable sinks (e.g., database access). Those based on Symbolic AI are mostly limited on one language and the rules are handcrafted which makes adoption of new libraries slow. DeepCode’s approach goes beyond all of them.

In summary, DeepCode is using symbolic AI based on alternative representations within its engine. We abstract the source code and apply a knowledge base of rules on it that we learned by observing open source projects.

Obviously, there is much more to see and understand. So, hold on tight for the next articles… Also, try DeepCode on your own code and see the results. Just go on DeepCode.ai and sign up for free.

CU

0xff

--

--