A deep dive into record-breaking accuracy in vulnerability detection

Anna Bacher
Jul 18 · 9 min read

More than 80 percent of cyber-attacks target the application layer, and the majority of application layer cyber-attacks are rooted in software vulnerabilities. The attack on First American Financial Corp. in May 2019 was made possible due to a vulnerability in its web site that exposed approximately 885 million documents — many of them with Social Security and bank account numbers — going back at least 16 years.

The best way to prevent cyber attacks is to find the vulnerabilities before the bad guys do. But, current commercial tools don’t do a good enough job. They rely on rule-based algorithms that miss too many flaws, they’re expensive to maintain, and they can’t find zero-day vulnerabilities.

This post presents the benchmarking results of Jaroona’s AI and deep learning based vulnerabilities detection solution versus the top commercial SAST tool (Checkmarx) and AI models from leading research universities. In the benchmark Jaroona achieves a true positive rate of 100% with 3.2% false positives, 5.5% false negatives, 96.1% accuracy and 91.2% precision. These results make our Static Application Security Testing (SAST) the best in class, beating the commercial average by 85% on false positives and 91% on false negatives.

Record-breaking Accuracy in Vulnerability Detection

Existing security audits are performed manually or with the use of static or dynamic analysis tools (SAST, DAST). These tools are based on 10 year old scientific research that was brought into production 7 to 10 years ago. Both static and dynamic analyzers are rule-based, and thus limited by their hand-engineered rules. They can’t find zero-day vulnerabilities, they are prone to human error and they are resource and cost intensive.

There is significant current research into the use of machine learning for program analysis. To our knowledge, none of the recent research has been implemented and used in production. Jaroona is among the first security companies to use the new research findings and our own R&D to achieve significant performance improvements in security vulnerability detection. The improvements are meaningful in detection efficiency, speed, and scalability.

The table below provides performance results, showing how well different methods and tools perform against code they have never seen before.

Table 1. Performance Comparison of Vulnerability Detection Solutions
  1. FPR: False Positive Rate. The proportion of false-positive samples out of the total number of identified vulnerabilities. The lower the number, the better. A higher number means that developers and security officers will waste more time evaluating non-existent vulnerabilities.
  2. FNR: False Negative Rate. The proportion of false-negative samples from the total number of identified vulnerabilities. The lower the number, the better. A higher number means that there is a greater chance that important vulnerabilities were not detected.
  3. A: Accuracy. The proportion of correctly classified samples from all detected samples. A higher percentage indicates a more accurate model.
  4. P: Precision. The proportion of correctly classified samples from the vulnerable samples. A higher percentage indicates a more precise model.
  5. F1: Measure. The overall effectiveness considering both precision and false-negative rate. The higher the number, the more effective the model.

Digging into the results, it is immediately clear that rule-based solutions have a very high false-negative rate (FNR): 56.8%, 70.4% and 85.3% for Checkmarx, Flawfinder and RATS respectively. Due to the high security risk associated with FNR, these tools (and RATS especially) are basically unusable in enterprise or high adversary untrusted risk environments, such as public blockchain or open internet computer paradigms.

Due to high FNR, P (precision) and F1 (overall effectiveness) is quite low for all rule-based solutions.

The AI-based solutions from research groups perform better. Starting with VUDDY (2017) and followed by VulDeePecker (2018), Draper & Boston University (2018) and SySeVR –BGRU (2018), it’s clear that AI-based detection shows a significant step forward on detection efficiency. SySeVR achieves a FNR of 5.6% as compared to 20.8% for Checkmarx, the state-of-the-art rule-based product. SySeVR also significantly outperforms Checkmarx in Accuracy, Precision, and F1.

Our model — Jaroona-BGRU — closely matches the performance of the top AI-based research model, SySeVR-BGRU. Our R&D team is working to improve results further, but our model already outcompetes all rule-based solutions, including the top commercial product: Checkmarx.

Detection Model Benchmark Dataset

The comparison table published here is taken from the SeSyVR research. We extended it to include the Draper & Boston University model and the Jaroona-BGRU model. Therefore, this is a fair comparison with no bias against the rule-based solutions (both open source and commercial) mentioned in the table.

We took the following steps to ensure transparency and a level playing field for the comparison:

  1. We reproduced Draper & Boston University’s model based on the description provided in their research paper [3].
  2. We used the SeSyVR earlier model (VulDeePeker) dataset as a basis and enhanced it with the latest patches from the National Vulnerability Database (NVD).
  3. We converted the dataset to represent individual functions preserving all statements — the format required by the Draper & Boston University model.
  4. The Jaroona-BGRU model dataset contained only statements relevant to the inspected vulnerability while preserving maximum available function call stack. We observed that the function call stack is very important for vulnerability detection because vulnerability may be present or absent depending on call context, and not only the called function itself.
  5. We trained both the Draper & Boston University model and the Jaroona-BGRU model using 80% of the constructed dataset and then validated and tested both models using 10% for the validation dataset and 10 % for the test dataset.

The comparison dataset consists of C/C++ data only as all research models and rule-based solutions were assessed using C/C++.

The latest patches from the National Vulnerability Database (NVD) were collected by parsing NVD data feeds and extracting patch information and source code. Pre-patch versions were considered vulnerable if the extracted part contained at least one changed line of code and the corresponding post-patch version was considered non-vulnerable.

In total, we added 1945 additional training samples for CWE-119 (959 vulnerable samples and 986 non-vulnerable samples) and 5630 additional training samples for CWE-399 (2734 vulnerable samples and 2896 non-vulnerable samples).

Benchmark Results

Draper & Boston University CNN Model versus Jaroona-BGRU Model

We used Draper & Boston University’s CNN model as a comparison basis as it outperforms other Draper & Boston University models, including their RNN model. For comparison, we took PR, Recall, MCC and F1 from Table III of Draper & Boston University’s paper[3]. We intentionally used results from Table III as they were achieved using real project source code from GitHub and Debian — similar to what we use for our comparison dataset. The best results from Draper & Boston University’s model are shown in Table IV in their paper. These results were achieved using a specially crafted dataset that has consistent style and structure and is used for static analyzers. This is synthetic data that won’t be found often in real world projects.

Our reproduction of Draper & Boston University’s model is very close to their original, as can be seen in our test results which closely track the results stated in Table III of their paper.

Fig. 1 shows Draper & Boston University’s model results based on our comparison dataset. Fig1. (x — number of iterations/epochs, y –values from 0 to 1 for mcc, f1, precision, recall).

Draper & Boston Universities’ best iteration — mcc: 0.5097, f1: 0.6434, precision: 0.5430, recall: 0.7893.

Fig.2 — Jaroona’s best iteration — mcc: 0.9019, f1: 0.9284, pr: 0.9128, rc: 0.9445.

Another important insight from Figures 1 and 2 is that in Jaroona’s model there is no overfitting, while in Draper & Boston University’s model overfitting occurs in the first few hundred iterations. In Draper & Boston University’s model val_mcc and val_f1 do not increase, while mcc and f1 do increase, and the distance between them also increases. In Jaroona’s model, training metrics and validation metrics grow simultaneously.

Our experiments show that the Jaroona model outperforms Draper & Boston University’s model. More detailed analysis can be provided upon request.

Fig. 1 Draper & Boston University’s model results
Fig. 2 Jaroona-BGRU model results

SySeVR BGRU Model versus Jaroona-BGRU Model

SySeVR is an improvement on the VulDeePecker approach [1] as stated in the SySeVR paper[2]. Therefore, in this section we compare Jaroona’s approach with SySeVR as it is the most advanced model created by Zhen Li and his team.

Based on multiple experiments with automatic detection and our many years of experience as ethical hackers, we have learned that semantic context is very important for vulnerability detection. Vulnerability may be present or absent depending on call context, and not only the called function itself. Similar conclusions are stated in the SySeVR paper. We also found out that our best performing model so far is BGRU (as is also stated in the SySeVR paper).

Jaroona uses more layers than the SySeVR BGRU model (shown in Fig.6 of the SySeVR paper) to achieve a number of improvements. We have added one dilated convolution layer (with dilation rate = 2) before BiGRUs to help track longer sequences. The output of this layer is concatenated to the embeddings output (not replacing it), so BiGRU gets both embeddings and convolution outputs. We also added a second dense layer.

SySeVR doesn’t specify the parameters of their BGRU architecture, so there may be some other differences between the architectures.

The comparison results are available in Table 1 above.

Moving towards Web 3.0: Why Detecting Vulnerabilities in Large Code Bases Matters

The emergence of Web 3.0 is resulting in faster and more flexible agile development and the movement from traditional monolithic web applications to modern applications that invoke many APIs and utilize microservices architectures.

With the new paradigm of Web 3.0, Cloud 3.0 and “Internet Computer” infrastructure, computer scientists, internet experts and developers believe that the Internet will behave like a decentralized operating system or peer-to-peer Internet on which the world can host the next generation of software, services and data.

An extremely large software codebase is being developed to enable Web 3.0 and “Internet Computer” from the ground up. While scalability issues are being successfully resolved, security measures remain a concern. The sheer size of the new software code accompanied with the concept of open Internet (meaning, no centralized network perimeter with good old firewalls) requires new methods that are able to detect vulnerabilities stemming from a wide range of causes in large code bases, while minimizing reliance on human experts.

To understand the security implications of million of lines of code being written for Web3.0 and “Internet Computer,” it is instructive to look back several decades to old Windows software and see how many security patches have been issued. Starting in 1985 with the first Windows production release, tens of thousands of patches with new vulnerabilities have been found and fixed every month, including top 10 severity vulnerabilities. In 2019 so far, 272 vulnerabilities have been found in Microsoft software (https://www.cvedetails.com/vulnerability-list/vendor_id-26/year-2019/Microsoft.html)

Conclusions

Increased automation is crucial to allow vulnerability discovery to scale to the large amount of code that must be secured for Web 3.0. The ability to statically identify vulnerabilities comprehensively, efficiently, and with few false positives or false negatives is an extremely important element of this mission.

At Jaroona, we have put considerable work and energy into achieving this goal. Today, we are happy to be able to demonstrate success. Jaroona is the first static analysis tool based on the power of machine learning and deep learning models capable of detecting vulnerabilities at a false positive rate of 3.2% and a false negative rate of 5.5%. These rates are considerably lower than that of other static analyzers and can be attributed entirely to providing a deep syntax and semantic context to each vulnerability used in training the deep learning network.

References

● [1] VulDeePecker: A Deep Learning-Based System for Vulnerability Detection, 5 Jan 2018 — Zhen Li etc. — https://arxiv.org/pdf/1801.01681.pdf

● [2] SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities, 21 Sep 2018 — Zhen Li etc. — https://arxiv.org/pdf/1807.06756.pdf

● [3] Automated Vulnerability Detection in Source Code Using Deep Reprpesentation Learning, 28 Nov 2018 — Rebecca L. Russel etc. (Draper & Boston University) — https://arxiv.org/pdf/1807.04320.pdf

Jaroona

Web 3.0 Cybersecurity Technology

Anna Bacher

Written by

Charismatic seasoned entrepreneur, CTO at Jaroona, ethical hacker and patent author with 19 years of experience in developing complex secure solutions.

Jaroona

Jaroona

Web 3.0 Cybersecurity Technology

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade