Both good and bad guys show that LLMs aren’t ready for prime-time
It’s common — and wise — to try out a new product before putting it on the market to see if it functions as intended or has some defects that weren’t caught during its development. That way you can fix problems before they start plaguing your customers.
There are multiple slang terms for it — test drive, trial run, shakedown cruise, etc. And it’s looking like that should have been done in much more depth to the large language model (LLM) systems that are the brains of artificial intelligence (AI) chatbots.
Because while the shakedown is indeed happening, it’s not being done in a controlled, premarket setting. It’s being done by both good and bad guys in the wild, so to speak, while chatbots are in broad, mainstream use, a couple of years after they hit the market.
The problem is not just that vulnerabilities can allow LLMs to be tricked into producing confidential information, inaccurate responses, hateful content, phishing emails, and malicious code. As Wired magazine reported last week, a team of researchers from the University of Pennsylvania showed that LLM system weaknesses can be exploited to cause physical damage as well.
The researchers “were able to persuade a simulated self-driving car to ignore stop signs and even drive off a bridge, get a wheeled robot to find the best place to detonate a bomb, and force a four-legged robot to spy on people and enter restricted areas,” according to the report.
George Pappas, head of a research lab at the university, told Wired that it’s not just about robots. “Any time you connect LLMs and foundation models to the physical world, you actually can convert harmful text into harmful actions,” he said.
Granted, this is not a new or original problem. LLM systems are built with software, and there is no such thing as perfect software. No software product has ever not needed an update.
As an AI-generated definition from Google puts it (with a grammatical error — see if you can catch it), “LLMs are essentially made from software, specifically a type of complex computer program built using machine learning (ML) algorithms, particularly neural networks, which allows them to process and generate human-like text based on vast amounts of data they are trained on; in simpler terms, they are software applications designed to understand and respond to language.”
Not enough pre-testing?
Still, given the ways that LLM systems can and are being manipulated, you could make the argument that it would have helped a lot to expose them to a lot more penetration testing, better known as pen testing.
That’s done by ethical hackers who try to find vulnerabilities in a software product before it’s released that could be exploited by criminal hackers.
And there are plenty of vulnerabilities in the most popular LLMs. The Open Web Application Security Project (OWASP) last year launched another of its iconic “top 10” lists, this one covering the worst LLM vulnerabilities. It released a new list last month, titled “OWASP Top 10 for LLM Applications 2025.”
And just a couple of weeks ago, Knostic, a software company focused on access controls for LLMs, posted a blog on what it contends is a “new class of attacks, named flowbreaking, affecting AI/ML-based system architecture for LLM applications and agents.”
According to the company, its researchers were able to undermine the policy “guardrails” built into those systems that are designed to prevent it from providing advice on a controversial topic like suicide.
One example is what the company called “second thoughts.” According to the blog, “When confronted with a sensitive topic, Microsoft 365 Copilot and ChatGPT answer questions that their first-line guardrails are supposed to stop. After a few lines of text they halt — seemingly having ‘second thoughts’ — before retracting the original answer (also known as Clawback), and replacing it with a new one without the offensive content, or a simple error message.”
But the Knostic researchers said they found that if a user clicks the Stop button while the answer is still streaming, “the LLM will not engage its second-line guardrails. As a result, the LLM will provide the user with the answer generated thus far, even though it violates system policies.”
“In other words, pressing the Stop button halts not only the answer generation but also the guardrails sequence […] What’s interesting here is that the model itself isn’t being exploited. It’s the code around the model.”
Andrew Bolster, senior manager of engineering at software security company Black Duck, agrees, calling the notation that “the model itself isn’t being exploited” an important distinction.
Insecure design
“Fundamentally, at no point is this ‘exploiting’ the LLM functionality itself, or even using the LLM to ‘do’ something unsafe,” he said. “This is demonstrating good old-fashioned insecure design, where the ‘flowbreaking’ has identified that the ‘Stop’ functionality, answer streaming functionality, and guardrails have not been fully integrated together.”
Knostic also includes a caveat about its findings: According to the blog, “Prompts that trigger the behavior we describe stop working anywhere from hours to weeks after first use. During the research we had to constantly iterate on these prompts or come up with new ones. The precise prompts we document below may not work when attempted, by the time of release.”
Another question about the Knostic research is if what it describes really amounts to a new class of attacks, separate from those in the OWASP top 10. The view from some other experts amounts to “sort of.”
Beth Linker, director of product management with Black Duck, said while the specific names may be different, there is some overlap between flowbreaking and items on the OWASP list. “Flowbreaking is a flavor of jailbreaking, in that it attempts to circumvent a guardrail,” Linker said, adding that “it may also bleed into insensitive output handling or sensitive information disclosure.”
The Knostic blog also recognizes the overlap Linker cited but argues that there are substantive differences. “Up to now jailbreaking and prompt injection techniques mostly focused on directly bypassing first-line guardrails by use of ‘language tricks’ and token-level attacks, breaking the system’s policy by exploiting its reasoning limitations,” according to the blog.
“In this research we’ve used these prompting techniques as a gateway into the inner workings of the AI/ML systems. Under the auspices of this approach, we try to understand the other components in the system, LLM-based or not, and to avoid them, bypass them, or use them against each other.”
OWASP did not respond to a request for comment about the Knostic research, but its latest list of LLM vulnerabilities includes
- Prompt injection: This happens when “user prompts alter the LLM’s behavior or output in unintended ways.”
- Sensitive information disclosure: This can include “personally identifiable information, financial details, health records, confidential business data, security credentials, and legal documents. Proprietary models may also have unique training methods and source code considered sensitive, especially in closed or foundation models.”
- Supply chain: Software supply chains for any product are at risk from code defects. But LLM supply chains are also at risk from vulnerabilities that “can affect the integrity of training data, models, and deployment platforms. These risks can result in biased outputs, security breaches, or system failures.”
- Data and model poisoning: This refers to when “pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities.”
- Improper output handling: This refers to “insufficient validation, sanitization, and handling of the outputs generated by LLMs before they are passed downstream to other components and systems,” according to OWASP, which adds that since LLM-generated content can be controlled — or manipulated — by prompts, “this behavior is similar to providing users indirect access to additional functionality.”
- Excessive agency: OWASP describes “agency” as an ability a developer grants to an LLM-based system “to call functions or interface with other systems via extensions (sometimes referred to as tools, skills or plugins by different vendors) to undertake actions in response to a prompt.” But too much of that agency can lead to “damaging actions to be performed in response to unexpected, ambiguous or manipulated outputs from an LLM, regardless of what is causing the LLM to malfunction.”
- System prompt leakage: This“refers to the risk that the system prompts or instructions used to steer the behavior of the model can also contain sensitive information that was not intended to be discovered,” according to OWASP.
- Vector and embedding weaknesses: This vulnerability applies to systems using Retrieval Augmented Generation (RAG), which OWASP describes as “a model adaptation technique that enhances the performance and contextual relevance of responses from LLM applications, by combining pre-trained language models with external knowledge sources.” But using RAG means that “weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited by malicious actions (intentional or unintentional) to inject harmful content, manipulate model outputs, or access sensitive information.”
- Misinformation: This refers to LLMs producing “false or misleading information that appears credible. This vulnerability can lead to security breaches, reputational damage, and legal liability.” Its most common cause is what has by now become one of the most well-known defects in LLMs, labeled “hallucination” — when the LLM fabricates content. OWASP adds that while hallucinations are a major source of misinformation, “they are not the only cause; biases introduced by the training data and incomplete information can also contribute.”
The OWASP list goes into significant additional detail, providing multiple examples of each vulnerability or attack technique, plus mitigation strategies. Linker said some of the mitigations that OWASP recommends for prompt injection, sensitive information disclosure, and improper output handling may apply to the flowbreaking defects described by Knostic.
All of which should help both developers and users to address the worst of the worst — the most dangerous flaws in LLM.
But whether flowbreaking really is new or only partially new, one thing is clear: LLMs and AI in general are an irresistible technology for both ethical and malicious hackers, who will continue to test drive it for the foreseeable future.
Bruce Schneier, author, blogger, and chief of security architecture at Inrupt, flagged the Knostic post on his own blog, noting that “In modern LLM systems, there is a lot of code between what you type and what the LLM receives, and between what the LLM produces and what you see. All that code is exploitable, and I expect many more vulnerabilities to be discovered in the coming year.”