Patch-RAG: Using RAG to Generate Context-Aware Vulnerability Patching Step-by-Step Guides

Eden Yavin
Labs Notebook
Published in
8 min readJun 6, 2024

By: Eden Yavin, Gal Engelberg and Dan Klein

Keywords: SCA, LLM, RAG, Cybersecurity

Introduction

The Application Security (AppSec) domain grapples with the challenge of Software Components Analysis (SCA), a process that aims to identify and address third-party components and their associated vulnerabilities within software applications. During a SCA scan, the software and its SBOM (Software Bill of Materials) are analyzed to break it down into a list of integrated components originating from external sources. Subsequently, the scanner cross-references these components against a database of known vulnerabilities, mapping component versions to any documented security flaws.

As an illustrative example, consider CVE-2018–10055, which highlights a vulnerability affecting versions of the Python TensorFlow package prior to 1.7.1. In such cases where vulnerabilities are identified, the widely adopted approach to mitigating the associated risk is through patching, whereby the vulnerable component is updated to a secure, patched version. This patching process not only addresses the identified vulnerability but also ensures the ongoing security and stability of the software by incorporating the latest fixes and enhancements from the component maintainers. Given that this CVE dates back to 2018, we can evaluate GPT’s response with and without the grounding provided by our Patch-RAG approach, as GPT’s knowledge cutoff surely including information dated to 2018, hence this CVE should be encompassed within the model’s knowledge base.

While current SCA scanners are adept at identifying vulnerable components and informing stakeholders of potential risks, the subsequent steps involved in actually patching and mitigating those vulnerabilities can be tedious and time-consuming. Once a vulnerability is detected, a stakeholder may then research and identify the appropriate patched version of the component that addresses the security flaw. Furthermore, they must fetch and follow vendor-specific documentation or guidelines on how to properly apply the patch within the context of their software environment. This manual process is not only labor-intensive but also prone to errors, potentially leaving systems exposed until the update is successfully implemented.

With such time consuming and error prone process in mind, we asked ourselves how this process can be accelerated to frictionless and automated one?

LLM

Large language models (LLMs) have demonstrated remarkable capabilities in various domains, but their effectiveness can be limited when dealing with rapidly evolving software ecosystems and emerging vulnerabilities. LLMs have a finite knowledge cutoff date, meaning they may lack awareness of new libraries released after their training, as well as the latest patches and versions addressing recently disclosed vulnerabilities. This limitation exists even when LLMs have internet access, as they do not have the capability to automatically update their knowledge bases with the latest information. Furthermore, the ambiguity in package names (e.g., “pillow”) can cause LLMs to hallucinate or deviate from the intended task. Retrieval Augmented Generation (RAG) presents a promising solution by providing LLMs with structured data representations of package names, versions, and associated vulnerability information. This additional context aligns the LLM with the specific software component analysis task, mitigating the risks of hallucination and ensuring more accurate and up-to-date vulnerability management recommendations.

Important note: The common RAG pipeline applies a vector database and
an embedding retriever to search for semantically similar documents to
the query. In our case, we apply more concrete information retrievers:
REST API and Graph Database.

Figure 1. The full instance of the Patch-RAG pipeline.

Figure 1 depicts our approach. The Package Retriever component fetches the affected software package by querying a knowledge base with the provided CVE-ID. Subsequently, the Patch Retriever component queries another knowledge base to obtain the appropriate patch version for the identified package, utilizing the CVE-ID and affected package information. This structured data retrieval approach ensures that the LLM component receives precise and contextual inputs, enabling it to generate accurate and comprehensive step-by-step guides for updating the vulnerable package to the patched version.

You might think: “But can’t I just prompt the LLM to write me such a playbook without going through the RAG pipeline?”

Figure 2. GPT4 generated patching playbook to CVE-2018–10055

If we take a look at Figure 2, we can see GPT-4’s answer to the prompt of generating a step-by-step patching playbook for CVE-2018–10055 which affect the Python package TensorFlow. The correct solution, by the way, will be to update the package to version 6.0.0 and above, but there is no way for the LLM to know it without being trained on such information.

If you are thinking about giving the information yourself to the LLM, well this may work but is not scalable as code projects can have hundreds of vulnerable packages that will need patching. This is where the Patch-RAG comes into play and can do the work of fetching all the information the LLM will need automatically.

BRON

Figure 3. BRON Ontology Example. Image taken from BRON Paper

BRON is a knowledge graph (KG) that links and public threat and mitigation data from different data sources. KGs organize information as a network of connected concepts, rather than just raw text or tabular structure. As can be seen in Figure 3, BRON’s KG includes data about products that the vulnerability (CVE) is affecting. The products could vary from SQL databases like PostgreSQL to different Python packages and their versions. Using BRON’s knowledge, we can traverse the graph starting from a vulnerability node and collect all the products that are affected. The product data is saved in a Common Platform Enumeration (CPE) format. The CPE format includes not only the product’s name, vendor, and version but also its type — Application, Operating System, Hardware. Utilizing this knowledge, the Package Retriever can filter returned products by type and output only Application type products to pass it to the Patch Retriever. The CPE representation can serve us to further enhance our filtering, for example, to include only specific vendor applications.

Figure 4. BRON Example — Querying BRON for all the CPE related to CVE-2018–10055

Let’s examine Figure 4: The node that is selected is the CPE node, and the other is the CVE node. If we take a look at the data the CPE node contains, we can see the affected TensorFlow package.

OSV

The Open Source Vulnerability (OSV) Database is a comprehensive, machine-readable database of vulnerabilities affecting open-source software packages. For a deeper dive into the architecture and construction of the Open Source Vulnerability (OSV) database, we encourage readers to explore their insightful blog post. The Package Retriever component would query the OSV database using the input CVE-ID and package name, retrieving structured information about the impacted package versions and associated metadata. It will then extract the fix version from the data. With accurate package and version information retrieved from this authoritative source the LLM will be able to generate precise and reliable update guides.

Leveraging the OSV database as the Package Retriever component ensures that our system stays up-to-date with the latest vulnerability information and supports a wide range of open-source software packages, enhancing the overall effectiveness and coverage of our vulnerability management solution.

Figure 5. Example of OSV data about TensorFlow CVE-2018–10055

In Figure 5, we can see that OSV has the exact information we were looking for — the safe version, i.e., 1.7.1. We can utilize this data as the context to our LLM, as seen in Figure 1.

Putting it All Together

If we look at Figure 1, it has all the components that we talked about above put in place. Given a CVE-ID, the Package Retriever will query it for the affected CPEs. This will be given as input to the Patch Retriever that will query the OSV database for the fix version. The fix version, together with the product name (TensorFlow), ecosystem (Python), and CVE-ID will be forwarded to the LLM to generate the final guide.

Figure 6. GPT4 with Patch-RAG grounding for CVE-2018–10055

Using the same prompt as before but with the context retrieved from our pipeline, we can see the difference: As opposed to Figure 2, where the LLM was not aligned with our purpose for generating the playbook, in Figure 6, the LLM is now generating a more concrete playbook for our stakeholders to follow. This is due to the context we gave it about the affected product and the safe version. This aligned the LLM to understand it should create a Python-specific step-by-step playbook, just like we wanted. We evaluated the accuracy of the generated code in producing the correct update statements and discovered that the GPT4 Patch-RAG based grounding significantly outperforms GPT-4 prompting without grounding, with a 40% improvement in accuracy.

Benefits of Patch-RAG

  • Up-to-Date Data: By leveraging structured databases like OSV (Open Source Vulnerability), Patch-RAG ensures that the vulnerability and package information used is always current and up-to-date, eliminating the risk of outdated or stale data.
  • Full Automation: The end-to-end process of mapping CVE-IDs to affected packages, retrieving patch versions, and generating step-by-step guides can be fully automated, reducing manual effort and increasing operational efficiency.
  • Modular Database Integration: The architecture allows for seamless replacement or integration of different knowledge bases and databases, providing flexibility and adaptability as new sources of vulnerability and patch information become available.
  • Versatile Output Formats: The final output, the step-by-step update guide, can be generated in various formats to suit different use cases, including machine-readable formats for integration with other tools and automation processes, as well as human-readable formats (e.g., Markdown) for manual validation and review.
  • Scalability: By offloading the retrieval of structured data to dedicated components and knowledge bases, Patch-RAG can scale effectively to handle a large volume of CVEs and software packages without overburdening the language model.

Summary

In this blog, we introduced Patch-RAG, a novel approach that combines the power of large language models (LLMs) with structured knowledge bases to automate the process of generating step-by-step guides for addressing software vulnerabilities. By leveraging authoritative sources like the Open Source Vulnerability (OSV) database and integrating them into a modular architecture, Patch-RAG ensures that the most up-to-date and accurate information is used to identify affected packages and their corresponding patched versions. This structured data retrieval approach provides the LLM with precise and contextual inputs, enabling it to generate reliable and comprehensive update guides tailored to the specific software environment.

--

--