Agent Runtime Security Feed Archive

Agent Runtime Security Feed Archive agent-runtime-security.html Agent Runtime Security 的长期订阅 RSS，汇总最近命中的论文和归档。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries ../papers/arxiv-eca9abd16e6a.html https://arxiv.org/abs/2606.26936v1#2026-06-26#agent-runtime-security Fri, 26 Jun 2026 13:16:53 +0800 With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit fr… Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation ../papers/arxiv-da78713ac31a.html https://arxiv.org/abs/2606.26686v1#2026-06-26#agent-runtime-security Fri, 26 Jun 2026 13:16:53 +0800 In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied r… AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems ../papers/arxiv-9259981f77a1.html https://arxiv.org/abs/2606.26859v1#2026-06-26#agent-runtime-security Fri, 26 Jun 2026 13:16:53 +0800 Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge… MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG ../papers/arxiv-d0d97e7148fe.html https://arxiv.org/abs/2606.26793v1#2026-06-26#agent-runtime-security Fri, 26 Jun 2026 13:16:53 +0800 Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search w… How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring ../papers/arxiv-25dbdb1a09a8.html https://arxiv.org/abs/2606.25487v1#2026-06-25#agent-runtime-security Thu, 25 Jun 2026 13:11:21 +0800 Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite… The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems ../papers/arxiv-2e1379a16fa7.html https://arxiv.org/abs/2606.26057v1#2026-06-25#agent-runtime-security Thu, 25 Jun 2026 13:11:21 +0800 AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mech… AI Snitches Get Glitches: Towards Evading Agentic Surveillance ../papers/arxiv-2cadb88c3ffc.html https://arxiv.org/abs/2606.25836v1#2026-06-25#agent-runtime-security Thu, 25 Jun 2026 13:11:21 +0800 To better assist users with completing challenging tasks, AI agents mediate communications, access data, and interact with different APIs. Many employers (and even nation-states) already provide their users with this technology. However, widespread adoption of AI agents creates a new risk to abuse access to user data for another goal: surveilling users. These users might not even have the ability or permission to control the actions and data accesses of the surveilling agents. We introduce and… Burnyard: Future of Malware Analysis ../papers/arxiv-471450e505c8.html https://arxiv.org/abs/2606.24778v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 Malware analysis is a critical aspect of modern cybersecurity. The prevailing industry practice, sandboxing, involves executing suspicious binaries within isolated virtual machines in large-scale data centers. However, this approach can unintentionally expose samples to public platforms such as VirusTotal and MalwareBazaar, and it is both resource-intensive and time-consuming. Burnyard addresses these limitations through a lightweight binary emulation platform that captures observable runtime b… LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context ../papers/arxiv-7d9e141e8dab.html https://arxiv.org/abs/2606.24585v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistant… Red-Teaming the Agentic Red-Team ../papers/arxiv-a2ceabb33333.html https://arxiv.org/abs/2606.24496v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws tha… PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models ../papers/arxiv-5e668ce8c325.html https://arxiv.org/abs/2606.24388v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adver… Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees ../papers/arxiv-862177e8e257.html https://arxiv.org/abs/2606.24322v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 LLM agents increasingly rely on persistent long-term memory, which creates a critical vulnerability that we study here: memory poisoning. An adversary can store untrusted content in one session that later steers a consequential action, such as a payment, a setting change, or data exfiltration, in a future session. Existing defenses base a memory item's authority to act on either its content (detection or trust-scoring) or its derivation history (lineage). We show that both signals are malleable… Pigeonholing: Bad prompts hurt models to collapse and make mistakes ../papers/arxiv-112c872ebf06.html https://arxiv.org/abs/2606.24267v1#2026-06-24#agent-runtime-security Wed, 24 Jun 2026 13:06:49 +0800 While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution,… Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? ../papers/arxiv-9dcfdba2a88b.html https://arxiv.org/abs/2606.23189v1#2026-06-23#agent-runtime-security Tue, 23 Jun 2026 13:10:02 +0800 Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three co… TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization ../papers/arxiv-0caba902aa17.html https://arxiv.org/abs/2606.23496v1#2026-06-23#agent-runtime-security Tue, 23 Jun 2026 13:10:02 +0800 Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants… GIF: Locally Sound Geometric Information Flow Control for LLMs ../papers/arxiv-db5b699a6c55.html https://arxiv.org/abs/2606.23277v1#2026-06-23#agent-runtime-security Tue, 23 Jun 2026 13:10:02 +0800 Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input t… What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? ../papers/arxiv-d2c45c3b54a7.html https://arxiv.org/abs/2606.20508v1#2026-06-19#agent-runtime-security Fri, 19 Jun 2026 14:26:15 +0800 Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrat… Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems ../papers/arxiv-c4b164c8bf0f.html https://arxiv.org/abs/2606.20470v1#2026-06-19#agent-runtime-security Fri, 19 Jun 2026 14:26:15 +0800 Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the… RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning ../papers/arxiv-380550205212.html https://arxiv.org/abs/2606.20142v1#2026-06-19#agent-runtime-security Fri, 19 Jun 2026 14:26:15 +0800 This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisi… Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services ../papers/arxiv-199c63f3e69d.html https://arxiv.org/abs/2606.19992v1#2026-06-19#agent-runtime-security Fri, 19 Jun 2026 14:26:15 +0800 In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once s… CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts ../papers/arxiv-4a4fa87e8c6b.html https://arxiv.org/abs/2606.19235v1#2026-06-18#agent-runtime-security Thu, 18 Jun 2026 14:03:08 +0800 Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node… Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners ../papers/arxiv-fca0d26d69a9.html https://arxiv.org/abs/2606.18198v1#2026-06-17#agent-runtime-security Wed, 17 Jun 2026 14:22:19 +0800 Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimoda… A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models ../papers/arxiv-b2420286985b.html https://arxiv.org/abs/2606.18193v1#2026-06-17#agent-runtime-security Wed, 17 Jun 2026 14:22:19 +0800 We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of… PreAct: Computer-Using Agents that Get Faster on Repeated Tasks ../papers/arxiv-87a19bc64272.html https://arxiv.org/abs/2606.17929v1#2026-06-17#agent-runtime-security Wed, 17 Jun 2026 14:22:19 +0800 Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instea… Automated jailbreak attack targeting multiple defense strategies ../papers/arxiv-d433e566ae71.html https://arxiv.org/abs/2606.16751v1#2026-06-16#agent-runtime-security Tue, 16 Jun 2026 14:38:43 +0800 Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extra… MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents ../papers/arxiv-9ae72ad425f0.html https://arxiv.org/abs/2606.16748v1#2026-06-16#agent-runtime-security Tue, 16 Jun 2026 14:38:43 +0800 Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench,… DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing ../papers/arxiv-cac1c936b49a.html https://arxiv.org/abs/2606.16527v1#2026-06-16#agent-runtime-security Tue, 16 Jun 2026 14:38:43 +0800 As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to… KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing ../papers/arxiv-1603705ba8b5.html https://arxiv.org/abs/2606.17034v1#2026-06-16#agent-runtime-security Tue, 16 Jun 2026 14:38:43 +0800 Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, maki… Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models ../papers/arxiv-55ea929d4f57.html https://arxiv.org/abs/2606.16808v1#2026-06-16#agent-runtime-security Tue, 16 Jun 2026 14:38:43 +0800 While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness… Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda ../papers/arxiv-29aa8644292b.html https://arxiv.org/abs/2606.13405v1#2026-06-12#agent-runtime-security Fri, 12 Jun 2026 13:55:02 +0800 LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based… ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm ../papers/arxiv-2035193c4622.html https://arxiv.org/abs/2606.13239v1#2026-06-12#agent-runtime-security Fri, 12 Jun 2026 13:55:02 +0800 Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program… Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents ../papers/arxiv-2423b3a60f1a.html https://arxiv.org/abs/2606.13174v1#2026-06-12#agent-runtime-security Fri, 12 Jun 2026 13:55:02 +0800 Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipelin… No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions ../papers/arxiv-238401359f93.html https://arxiv.org/abs/2606.13044v1#2026-06-12#agent-runtime-security Fri, 12 Jun 2026 13:55:02 +0800 As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussi… Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior ../papers/arxiv-35fa580490f4.html https://arxiv.org/abs/2606.13038v1#2026-06-12#agent-runtime-security Fri, 12 Jun 2026 13:55:02 +0800 As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through… Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code ../papers/arxiv-bf45a9834bb9.html https://arxiv.org/abs/2606.11817v1#2026-06-11#agent-runtime-security Thu, 11 Jun 2026 13:59:12 +0800 Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LL… OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents ../papers/arxiv-4d5c20727d4b.html https://arxiv.org/abs/2606.12341v1#2026-06-11#agent-runtime-security Thu, 11 Jun 2026 13:59:12 +0800 Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a… Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers ../papers/arxiv-9418cc76eedf.html https://arxiv.org/abs/2606.11949v1#2026-06-11#agent-runtime-security Thu, 11 Jun 2026 13:59:12 +0800 We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8… External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs ../papers/arxiv-bdd55bbe16b0.html https://arxiv.org/abs/2606.11806v1#2026-06-11#agent-runtime-security Thu, 11 Jun 2026 13:59:12 +0800 Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this… Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation ../papers/arxiv-3e30da0f0823.html https://arxiv.org/abs/2606.10749v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, resear… Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields ../papers/arxiv-0c1951bb8319.html https://arxiv.org/abs/2606.11042v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user… Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories ../papers/arxiv-c1a0a781eb5e.html https://arxiv.org/abs/2606.11176v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent… It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO ../papers/arxiv-41831e36aa6f.html https://arxiv.org/abs/2606.10931v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and… Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization ../papers/arxiv-c80cef967337.html https://arxiv.org/abs/2606.10860v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of on… When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models ../papers/arxiv-d9957907bca3.html https://arxiv.org/abs/2606.10740v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defi… Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties ../papers/arxiv-80675a0579e2.html https://arxiv.org/abs/2606.10615v1#2026-06-10#agent-runtime-security Wed, 10 Jun 2026 13:25:04 +0800 Collaborative computation across organizations is often constrained by the need to process sensitive data and proprietary code without exposing them to untrusted infrastructure or participants. Cryptographic approaches such as fully homomorphic encryption and secure multi-party computation provide strong confidentiality but remain impractical for general workloads due to their extreme computational cost. We present the Two-Way Confidential Virtual Machine (2cVM), a two-layer architecture that p… WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces ../papers/arxiv-df62d5981d92.html https://arxiv.org/abs/2606.09426v1#2026-06-09#agent-runtime-security Tue, 09 Jun 2026 13:12:49 +0800 Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable art… Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents ../papers/arxiv-62e95d1bbf77.html https://arxiv.org/abs/2606.09315v1#2026-06-09#agent-runtime-security Tue, 09 Jun 2026 13:12:49 +0800 BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a… What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks ../papers/arxiv-c78b7a180254.html https://arxiv.org/abs/2606.09700v1#2026-06-09#agent-runtime-security Tue, 09 Jun 2026 13:12:49 +0800 Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability,… PRISM: Recovering Instruction Sets from Language Model Activations ../papers/arxiv-d104b9c3b446.html https://arxiv.org/abs/2606.09563v1#2026-06-09#agent-runtime-security Tue, 09 Jun 2026 13:12:49 +0800 As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints,… GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection ../papers/arxiv-50a6cf67d9a1.html https://arxiv.org/abs/2606.05566#2026-06-05#agent-runtime-security Fri, 05 Jun 2026 13:25:00 +0800 Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversaria…