Exploring Software Code to Understand Design Intentions
By Virgile Prevosto (CEA List) and Olivier Bouzereau (OW2)
In order to ease collaboration between programmers, testers, and system operators, the DECODER platform provides deep analysis of source code repositories involved in business applications or embedded systems.
The goal of DECODER is to improve the efficiency of development, testing, verification and validation through a centralized platform. This platform is a work in progress, as it consolidates various sources of information to keep the knowledge about source code repositories, libraries and components in sync. An original feature of the project is that it considers software code as a form of natural language. Therefore, it relies on NLP (Natural Language Processing) tools and semi-formal abstract models to understand the intentions and the properties of source code.
The DECODER project started in January 2019 and is funded by the European Commission under the Horizon 2020 programme during three years, until December 2021. Seven organisations are involved including the French research institute CEA LIST as technical leader, the Polytechnic University of Valencia, Cap Gemini, Sysgo, Tree Technology, OW2 open source association and Technikon as project coordinator.
Assessed on several use cases, the DECODER open source platform aims at achieving the widest possible reuse. It should undergo a beta-test campaign by the end of 2020. From then on, multiple DevOps teams will be able to try it, along with several additional tools, to provide their feedback and to contribute to the project.
A Persistent Knowledge Monitor
DECODER is the acronym for DEveloper COmpanion for Documented and annotatEd code Reference.
A persistent database, dubbed PKM (Persistent Knowledge Monitor), acts as the central component of the DECODER architecture. Powered with data from multiple tools, this knowledge monitor is also responsible for answering to requests from each member of the DevOps team, and from any active participant during the application lifecycle, an architect, a tester, a maintainer, etc.
Some tools in the project are developed from scratch, while others are some extensions of pre-existing tools. For instance, modeling is supported by the UPV Moskitt tool which processes UML models. Query tools provide advanced queries to navigate in third-party APIs. On the verification side, code inspection goes through the FramaC (C language) and OpenJML (Java) tools. The Testar project generates interaction scenarios for the tests. Finally, the reporting capabilities will confirm if the code contains, as expected, the properties declared during the modeling stage.
Several NLP tools are responsible for extracting information from informal documentation, including bug reports, code comments, or developer forums. What matters here is the possibility of linking this unstructured information in natural language to more structured information at the code level or even at the specification level, in particular for making a formal verification.
For instance, a specification summary is intended to generate formal annotations to assist analysis tools like FramaC to perform verification activity.
The project consortium is building the database and establishing a common format that all of these tools can understand to communicate easily with the PKM monitor, either to write data into it or to extract information from it.
The DECODER tools have an impact during each phase of the software development, including maintenance and code upgrades. They intervene from the definition of the model, then during its implementation thanks to the increased IDE, and finally at the level of verifications and validation.
To support the maintenance and evolution of software, they will trace the part of the code or specification that is affected by a recent change, so that developers can focus on the necessary changes. At the same time, the PKM monitor plays a central role, all the information available on the code being stored there.
Understand the evolution of the code at each step
The gap between informal documents — written in English, French or any other language — and formal documents such as code and formal specifications, where they exist, remains to be bridged.
The DECODER project works in both directions; the team wants to extract information from natural language sentences, matching them to relevant parts of the code or specifications, and generate formal specifications from informal information.
In addition, extracting information from code or formal specifications will help present it in a more informal or human-readable format, as a semi-automated document generator would do.
This is where the abstract semi-formal models (ASFM) come in. They provide a natural graphic language to describe via a diagram what a function is supposed to do. Faced with a complex data structure, it is often more efficient to reason using a diagram. The idea is to provide ASFM diagrams useful for debugging, each diagram representing what happens when a piece of code is executed. Each member of the DevOps team can thus view everything that changes, after each evolution of the code.
Four use cases to improve the tools
Four use cases will evaluate DECODER tools and demonstrate their usefulness for the industry. The first concerns the analysis of Linux operating system drivers; it’s a question of quickly understanding whether or not drivers developed by third parties are eligible for integration into more or less critical embedded systems.
The well-known OpenCV computer vision library will be analysed in the context of a human-robot interface project, to facilitate understanding of its integration and onboarding of new developers.
MyThaiStar, a professional dynamic web application, focuses on user interface design and validation ; this is the Cap Gemini use case in DECODER.
Finally, the java use case analyzes various projects from the OW2 open source code base. The idea is to evaluate all the tools to check their usability by people who are not directly involved in the DECODER project.
A PKM meta model
A significant part of the work is focused on the design of the PKM monitor, with a meta-model described as a set of UML diagrams to better understand all the information to be placed in the monitor, and how it is related to each other.
A first prototype of the PKM server has been designed and developed. The Json (JavaScript Object Notation) open file format is used as the main channel for exchanging information between the PKM server and the tools that will communicate with it.
A first Json schema has been developed; it is based on the already designed UML meta-model. Other Json formats will be able to process static analysis results (like Sarif format). The members of the project wish to reuse several proven technologies to produce tools as interoperable as possible. At the back-end level, CEA LIST explored several document-oriented data managers, before selecting MongoDB, although this choice is not as crucial as the design of the json schema itself, since all modern database engines can natively handle json objects. The UPV Testar tool provides its graphical model part but could also be used to store complete documents. More generally, tools involved in the project have been extended to produce json documents following the PKM schemas. Moreover, beyond this low-level API, a high-level API and an accompanying user interface are currently under development.
Information retrieval
In order to improve the accuracy of the tool, it will be necessary to collect a data set for the NLP part of the project. The DECODER team is using several datasets — including code and associated documents — already available via GitHub. In addition, a training corpus of the DeepAPI project could also be used to train the tools, in particular to establish correspondences between sentences in natural language and sequences of calls in Java code.
These collected data sets provide enough elements to begin two-way experimentation, from code to natural language, to extract the main features of the code; and in the other direction, from natural language to code. The first experiment is considering programming languages as a particular type of foreign language and makes a more or less standard translation using neural networks. The goal is to see how to link the semantics of the code to its natural language counterpart to get an accurate view of the similarities between them.
A first prototype of the PKM monitor
Last highlight of the work accomplished, Capgemini organized several workshops to define what should be included in the PKM client as well as the most interesting type of requests to send to the database. These workshops raised the main roles that the user of the PKM monitor would fulfill in terms of software development, and a first idea of the usage scenarios and the type of controls to be given to the users of the client software. They are playing an important role in the definition of the high-level API for the PKM.
The implementation of the first prototype of the PKM monitor marked a milestone for the project. Notably, the diagram is an important element to guarantee that each tool has sufficient information for its needs. Based on that, work will focus on refining the json schema and facilitate information exchange between the various tools involved in the project through the PKM monitor.
For more information, please visit: https://www.decoder-project.eu