LLM Agents can Autonomously Exploit One-day Vulnerabilities
Large language models (LLMs) have become increasingly powerful and are increasingly embodied as agents. These agents can take actions, such as navigating web browsers, writing code, and executing code. Incredibly, LLM agents can now autonomously resolve complex bugs in code, perform economic analysis, and aid in the scientific discovery process.
While useful, researchers have become increasingly concerned about their dual-use capabilities: their ability to perform harmful tasks, especially in the context of cybersecurity. Cybersecurity is concerning as these agents can act autonomously without having to perform physical manipulations. For example, we recently showed that LLM agents can autonomously hack websites similar to those used in capture-the-flag exercises. Other work has shown that ChatGPT can be used to assist humans in penetration testing and malware generation. However, it is unknown whether LLM agents can autonomously exploit real-world vulnerabilities.
In our recent work, we show that LLM agents can autonomously exploit one-day vulnerabilities. To show this, we collected a benchmark of 15 real-world vulnerabilities, ranging from web vulnerabilities to vulnerabilities in container management software. Across these 15 vulnerabilities, our LLM agent can exploit 87% of them, compared to 0% for every other LLM we tested (GPT-3.5 and 8 open-source LLMs) and 0% for open-source vulnerability scanners (ZAP and Metasploit) at the time of writing.
In the remainder of this blog post, we will describe our benchmark, agent, and results.
Benchmark of Real-world Vulnerabilities
We first constructed a benchmark of real-world vulnerabilities in the one-day setting. One-day vulnerabilities are ones that have been described (e.g., in the CVE database) but have not been patched. These vulnerabilities can have real-world consequences, especially in hard-to-patch systems.
Although one-day vulnerabilities are published, this does not necessarily mean that existing tools can automatically find them. For example, malicious hackers or penetration testers without access to internal deployment details may not know the version of the software being deployed.
Many CVEs are in closed-source systems, so cannot be reproduced. As such, we focused on vulnerabilities in open-source software. We collected 15 vulnerabilities spanning web vulnerabilities, vulnerabilities in container management software, and vulnerabilities in Python packages. These vulnerabilities include those with high severity and those after the knowledge cutoff date of the LLMs we test.
We show a list of the vulnerabilities below, along with details of these vulnerabilities.
Agent Setup
Similar to the agent we built in prior work, we constructed an agent with tools, a goal, and the ability to plan. In addition, we gave the agent the CVE description to emulate the one-day setting. These agents are not complicated: our agent totaled 91 lines of code and 1056 tokens for the prompt. We show a diagram of our agent below and see our paper for more details.
Exploiting One-day Vulnerabilities
To test our agent, we used 10 base LLMs (GPT-4, GPT-3.5, and 8 open-source LLMs). We further used ZAP and Metasploit on the 15 vulnerabilities in our benchmark. For the agents, we ran the agent 5 times per vulnerability.
GPT-4 can exploit 87% of the vulnerabilities, compared to 0% for every other method we tested. These results suggest an “emergent capability” in GPT-4, although more investigation is required.
GPT-4 only fails on two vulnerabilities: Iris XSS and Hertzbeat RCE. Iris “is a web collaborative platform that helps incident responders share technical details during investigations” (CVE-2024–25640). The Iris web app is extremely difficult for an LLM agent to navigate, as the navigation is done through JavaScript. As a result, the agent tries to access forms/buttons without interacting with the necessary elements to make it available, which stops it from doing so. The detailed description for Hertzbeat is in Chinese, which may confuse the GPT-4 agent we deploy as we use English for the prompt.
We further find that removing the CVE description causes performance to drop from 87% to 7%. This suggests that naive applications of LLM agents still struggle in the zero-day setting.
Responsible Disclosure
We disclosed our findings to OpenAI prior to releasing our preprint. They have explicitly requested that our prompt and agent not be released to the wider public, so we have elected to withhold these specific details except upon request.
Conclusions
Our findings show that LLM agents are capable of autonomously exploiting real-world cyber vulnerabilities. Fortunately, our agent does not appear to be capable of exploiting zero-day vulnerabilities as is, although extensions may be capable of such exploits. Nonetheless, we hope that our findings encourage deployers of LLMs to carefully consider the dual-use nature of their capabilities.
Written by Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang