Exploiting computer control agents
Computer control agents (CCAs) represent an evolution in how we interact with AI systems — autonomous tools that can directly operate our computers through the same graphical interfaces we use. While these tools promise to revolutionise our workflows, they also introduce a new class of cyber security vulnerabilities that deserve careful attention.
What are Computer Control Agents?
CCAs are AI systems that interpret natural language instructions and translate them into sequences of actions — clicking, typing, scrolling — to manipulate a computer screen interface just as humans do. Systems like Anthropic’s Computer Use work by processing screenshots, calculating pixel-based movements, and executing actions through virtual input devices. This “flipbook” approach allow AI models to operate any standard software without specialised APIs or integrations.
The vulnerability: indirect prompt injection
Unlike traditional prompt injection attacks where malicious instructions are given directly to an AI chatbot, indirect prompt injection in CCAs occurs when the agent encounters external content during task execution. This external content — whether from websites, files, or applications — can hijack the agent’s behaviour, derailing it from its original task.
The critical safety and security issue is that these agents can be manipulated through content they encounter while performing legitimate tasks, potentially leading to system compromise, data breaches, or unauthorised actions.
A taxonomy
The taxonomy categorises indirect prompt injection attacks along two dimensions: where the injection is introduced (i.e. the platform) and what the injection is (i.e. the technique used). This is specific to CCAs.
Vectors include:
- Website-based: Through social media, e-commerce sites, news platforms
- File-based: Embedded in PDFs, spreadsheets, presentations
- Application-based: Via email clients, messaging apps, productivity tools
Payload techniques include:
- Contextual exploitation: Instructions that appear relevant to the agent’s task
- Visual-spatial attacks: UI manipulation targeting vision-based agents
- Multi-stage attacks: Breaking malicious instructions into seemingly innocent steps
- Obfuscation: Hiding malicious instructions in complex structures or encoding
- System impersonation: Mimicking system messages to trick the agent
Real-world attack scenarios
Our research explored over 100 attack scenarios. Here are a few illustrative examples:
- E-commerce derailment: An agent tasked with checking a product price encounters an advertisement claiming the site is deprecated, leading it to an adversarial website with instructions to run local commands.
- Social media misinformation: An agent summarising financial results from a social media post is directed to an unofficial website containing misinformation, which it then reports to the user as factual.
- Destructive file processing: An agent processing image files encounters hidden text in one image instructing it to execute destructive system commands.
- Assignment review exploitation: A student embeds instructions in their assignment document that cause the grading agent to exfiltrate sensitive system information.
These diverse scenarios demonstrate how easily an agent’s initial, legitimate task can be subverted through external content it encounters, potentially leading to serious security breaches.
Key observations
Several patterns emerged from our testing:
- Content source matters: Agents appear more easily persuaded by instructions found in local files compared to web-based sources, perhaps reflecting greater trust in local content.
- Simplicity works: The most effective attacks often weren’t sophisticated jailbreaks but simple, contextual injections that subtly derailed the agent from its original task.
- Task alignment is critical: Indirect prompt injections are more successful when they align closely with the agent’s original task. Technical tasks are particularly vulnerable to technical exploitation.
Broader AI safety implications
CCAs represent a significant shift in AI capabilities — not because they’re inherently more intelligent, but because they can directly interact with our computing environments. This expanded interaction surface introduces new security vulnerabilities that traditional AI risk frameworks may not fully capture.
While these agents may maintain the same underlying AI safety classification level as their non-computer-controlling counterparts, their ability to affect real systems amplifies potential risks. If an attacker can successfully derail an agent, they effectively gain the ability to harness its full range of capabilities within the compromised environment.
Beyond external attacks on CCAs, an important consideration is their potential dual-use nature (such as misuse). These systems could be weaponized for offensive cyber operations like vulnerability scanning or automated exploit development, raising important questions for technical governance and AI safety frameworks.
Conclusion
As AI systems increasingly integrate into our workflows through direct computer control, we must recognise and address the unique security challenges they present. Indirect prompt injection represents a significant attack vector that bridges AI security and traditional cybersecurity concerns.
While Computer Use and similar tools represent promising advancements in human-AI collaboration, their early stage of development warrants caution — particularly when handling sensitive information or critical systems. The red teaming evaluation presented here reveals critical vulnerabilities but also highlights opportunities for developing more robust safeguards as these technologies mature.
For organisations deploying these technologies, understanding these risks is the first step toward responsible implementation. For developers, these insights can inform more resilient design. And for the broader AI safety community, this research underscores the importance of examining not just what AI systems can think, but what they can do.