DeepCode adds AI-based static code analysis support for C and C++

Frank Fischer
DeepCodeAI
Published in
4 min readMar 18, 2020

It was back in 1972 (Neil Diamond with “Song Song Blue” was leading the Swiss billboard charts and the Apollo 17 manned moon mission returned safely) when the C language designed by Dennis Ritchie in the famous Bell Labs appeared in public. In 1985 (now, Europe and “The Final Countdown” led the charts in Switzerland), Bjarne Stroustrup (again Bell Labs) published an extension to C (hence C++). The impact of both languages on the software ecosystem cannot be overstated. It is literally fair to say that every modern computer — from the tiny microcontrollers to the cloud systems — runs C and C++ based software to a certain extent.

C and C++ are dominant in the realm of software that is close to the hardware, such as operating systems or software with high performance or even real-time aspects. The two languages are preferred as they offer a large degree of freedom and control that’s needed. As these are very unique areas of software development, the margin for error is small and the possible impact is huge.

Yet, with great freedom comes great responsibility. One very prominent problem is memory management, which is different than in other languages such as Java or C#. In C/C++, developers are free to use libraries or totally rely on themselves. Doing it wrong leads to memory leaks (and therefore crashes) or buffer overflows (and therefore possible hacks). Also, the languages are not at a standstill. This year, we saw the finalization of C++20 standard in February adding new capabilities, yet maintaining backwards compatibility. As a result, all versions of C and C++ are found in software today.

DeepCode adds support for C and C++

It took DeepCode roughly three months to add support for C and C++ because some aspects of the current platform had to be extended. “We consider C++ to be a peculiar corner case of an imperative language with all its low-level features like memory management, references, pointers, etc. For any other ‘regular’ imperative programming language, we would not have to change anything in our points-to and type-state analysis. For any other language, we expect an implementation time of about one month,” said Jan Eberhardt, one of the leading engineers behind the new capability. Next to the basic capability to parse the language, the unique DeepCode AI system was trained using open-source codebases as well as private sources from multinational partners. “For the C++ dataset that was used in the first release, this resulted in over 300,000 repositories with a total of about 3.4 million files,” added Eberhardt.

Static code analysis for C and C++ is a challenging field, yet numerous competitors have worked on it for decades. What makes DeepCode confident to be a leading player in this field? “Our unique approach provides two major advantages new to the industry: First, based on our proprietary implementation of the solver engine, we are able to use large open-source codebases as training data. We scan the change histories on these codebases and identify patterns. We are magnitudes faster than any comparable system. Second, we use an augmented AI approach. Our engineers work on an abstraction level higher than the competition and are supported by a semi-supervised learning system to produce symbolic AI constraint rules. The data used to train the system is only available at scale due to our performance advantage. With this, we were able to leapfrog the whole industry after just three months of focus,” said Boris Paskalev, CEO of DeepCode.

The performance of the DeepCode system can be experienced firsthand: Scanning the Chromium project (around 16 GB of source code) only takes minutes and scanning the Linux source code takes around 6 minutes. One of the findings in Chromium is depicted in the picture above: The content of an external environment variable is used in code to allocate memory and a source for strcpy — a typical vector for buffer overflows. You can find the Chromium source code here (https://github.com/chromium/chromium). Just go on deepcode.ai, load it into your dashboard, and have a look yourself.

“Our mission is to enable all software developers in the world to take advantage of the knowledge buried in open source. To do so, we need to embed our service into their daily workflows,” added Paskalev. DeepCode provides plugins for Visual Studio Code, Atom, and Gitpod, as well as a command-line interface to call from within the development workflow. More integrations into typical developer tools are planned: they are open source so if you have ideas just let us know and jump-in.

DeepCode offers free scanning for open source projects and small teams. Next to C and C++, DeepCode supports Java, Python, JavaScript, and TypeScript. You can try it yourself at DeepCode.ai.

Update: Changed C++20 to reflect the correct naming and state of the process / thanks to Bartłomiej Filipek for pointing it out.

--

--