AI Agents Should Be Treated as Untrusted Systems, Researchers Warn

Semi-realistic illustration of an AI agent in infrastructure, sandboxed with a shield and separated data/instructions.

Researchers released a paper arguing that AI agents should be treated as untrusted components because current models cannot reliably separate instructions from data. The warning challenges a core assumption behind autonomous-agent deployment, especially in systems where agents can call APIs, write files or move information across services.

The paper argues that model-level guardrails and stacked guard models are not enough to prevent prompt-injection attacks, data exfiltration or integrity failures. As agents gain operational authority, their risk profile moves beyond ordinary chatbot safety and into infrastructure, compliance and access-control design.

Prompt Guardrails Are Not Defense-in-Depth

The researchers identified a structural weakness in large language models: instructions and external data are processed within a single token stream. That makes it difficult for the model itself to reliably distinguish command intent from untrusted content, creating openings for malicious prompts to influence downstream actions.

Defensive prompting and semantic guardrails face the same problem. They rely on the same linguistic and statistical machinery as the primary agent, meaning attackers can often bypass them through rephrasing, indirect embedding or contextual manipulation.

Stacking additional guard models does not solve the issue, according to the paper. Guard layers can inherit similar biases and failure modes, leaving systems exposed even when multiple model-based filters appear to be in place.

The researchers described prompt hardening as fragile rather than foundational. If malicious content can cross from data into the control plane, the agent may authorize unsafe actions, leak sensitive information or corrupt system state.

Infrastructure Boundaries Become the Security Layer

The paper recommends a shift from trying to make agents inherently trustworthy to limiting what they can do. Least privilege should define every agent deployment, giving systems only the permissions necessary for the task at hand.

Sandboxing and runtime isolation are also central. Containment reduces blast radius when an agent misbehaves, preventing one compromised workflow from spreading across files, APIs or connected services.

The researchers also call for strict separation between instructions and data at the system boundary. External content should be treated as untrusted input, not as material that can redefine the agent’s operating rules.

Secure information flow is another key requirement. Infrastructure should enforce where sensitive data may travel, rather than allowing the agent to decide whether a destination is safe.

Tool design matters as well. Narrow, typed interfaces make validation more tractable, reducing ambiguity around what an agent is allowed to request, modify or transmit.

The governance implication is clear: autonomous agents create delegated authority that must be controlled, logged and tested. Product teams will need stronger identity, authorization, auditability and adversarial testing before placing agents into production workflows.

The broader takeaway is that agent safety is no longer only a model-quality problem. Treating agents as untrusted systems shifts investment toward containment, data-flow controls and accountable deployment architecture, raising the operational cost but reducing systemic risk.

Find Us on Socials

Join Our
Newsletter

Subscribe to get latest crypto news!

Latest News

You may also like

The Chain Observer
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.