prompt injection Topic Archive

prompt injection Topic Archive prompt-injection.html 关键词 prompt injection 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings ../papers/arxiv-663bd6d3e1b5.html https://arxiv.org/abs/2606.27287v1#2026-06-26#prompt-injection Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few c… MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG ../papers/arxiv-d0d97e7148fe.html https://arxiv.org/abs/2606.26793v1#2026-06-26#prompt-injection Fri, 26 Jun 2026 13:16:53 +0800 Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search w… How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring ../papers/arxiv-25dbdb1a09a8.html https://arxiv.org/abs/2606.25487v1#2026-06-25#prompt-injection Thu, 25 Jun 2026 13:11:21 +0800 Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite… AI Snitches Get Glitches: Towards Evading Agentic Surveillance ../papers/arxiv-2cadb88c3ffc.html https://arxiv.org/abs/2606.25836v1#2026-06-25#prompt-injection Thu, 25 Jun 2026 13:11:21 +0800 To better assist users with completing challenging tasks, AI agents mediate communications, access data, and interact with different APIs. Many employers (and even nation-states) already provide their users with this technology. However, widespread adoption of AI agents creates a new risk to abuse access to user data for another goal: surveilling users. These users might not even have the ability or permission to control the actions and data accesses of the surveilling agents. We introduce and… GIF: Locally Sound Geometric Information Flow Control for LLMs ../papers/arxiv-db5b699a6c55.html https://arxiv.org/abs/2606.23277v1#2026-06-23#prompt-injection Tue, 23 Jun 2026 13:10:02 +0800 Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input t… CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts ../papers/arxiv-4a4fa87e8c6b.html https://arxiv.org/abs/2606.19235v1#2026-06-18#prompt-injection Thu, 18 Jun 2026 14:03:08 +0800 Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node… KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing ../papers/arxiv-1603705ba8b5.html https://arxiv.org/abs/2606.17034v1#2026-06-16#prompt-injection Tue, 16 Jun 2026 14:38:43 +0800 Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, maki… Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents ../papers/arxiv-eae3b85b0b57.html https://arxiv.org/abs/2606.13385v1#2026-06-12#prompt-injection Fri, 12 Jun 2026 13:55:02 +0800 Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the… No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions ../papers/arxiv-238401359f93.html https://arxiv.org/abs/2606.13044v1#2026-06-12#prompt-injection Fri, 12 Jun 2026 13:55:02 +0800 As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussi… Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior ../papers/arxiv-35fa580490f4.html https://arxiv.org/abs/2606.13038v1#2026-06-12#prompt-injection Fri, 12 Jun 2026 13:55:02 +0800 As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through… External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs ../papers/arxiv-bdd55bbe16b0.html https://arxiv.org/abs/2606.11806v1#2026-06-11#prompt-injection Thu, 11 Jun 2026 13:59:12 +0800 Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this… Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation ../papers/arxiv-3e30da0f0823.html https://arxiv.org/abs/2606.10749v1#2026-06-10#prompt-injection Wed, 10 Jun 2026 13:25:04 +0800 Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, resear… Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization ../papers/arxiv-c80cef967337.html https://arxiv.org/abs/2606.10860v1#2026-06-10#prompt-injection Wed, 10 Jun 2026 13:25:04 +0800 Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of on… Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents ../papers/arxiv-62e95d1bbf77.html https://arxiv.org/abs/2606.09315v1#2026-06-09#prompt-injection Tue, 09 Jun 2026 13:12:49 +0800 BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a… PRISM: Recovering Instruction Sets from Language Model Activations ../papers/arxiv-d104b9c3b446.html https://arxiv.org/abs/2606.09563v1#2026-06-09#prompt-injection Tue, 09 Jun 2026 13:12:49 +0800 As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints,… GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection ../papers/arxiv-50a6cf67d9a1.html https://arxiv.org/abs/2606.05566#2026-06-05#prompt-injection Fri, 05 Jun 2026 13:25:00 +0800 Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversaria… What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems ../papers/arxiv-36af303851a6.html https://arxiv.org/abs/2606.04425#2026-06-04#prompt-injection Thu, 04 Jun 2026 14:02:06 +0800 Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the syst… Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents ../papers/arxiv-4dafb8f9ce98.html https://arxiv.org/abs/2606.04141#2026-06-04#prompt-injection Thu, 04 Jun 2026 14:02:06 +0800 LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we t… From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents ../papers/arxiv-8f4b9a446200.html https://arxiv.org/abs/2606.04329#2026-06-04#prompt-injection Thu, 04 Jun 2026 14:02:06 +0800 Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architec… From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework ../papers/arxiv-f9dbc4fc0301.html https://arxiv.org/abs/2606.03777#2026-06-03#prompt-injection Wed, 03 Jun 2026 14:09:56 +0800 AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in t… AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations ../papers/arxiv-c3bd67dd1e74.html https://arxiv.org/abs/2606.02240v1#2026-06-02#prompt-injection Tue, 02 Jun 2026 13:56:35 +0800 Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce… LACUNA: Safe Agents as Recursive Program Holes ../papers/arxiv-38170f767971.html https://arxiv.org/abs/2605.28617v1#2026-05-28#prompt-injection Thu, 28 May 2026 13:15:52 +0800 LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each suc… Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals ../papers/arxiv-aa2f4d96a852.html https://arxiv.org/abs/2605.26999v1#2026-05-27#prompt-injection Wed, 27 May 2026 13:23:19 +0800 Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated… How Agentic AI Coding Assistants Become the Attacker's Shell ../papers/arxiv-1c17bad504b8.html https://arxiv.org/abs/2605.25871v1#2026-05-26#prompt-injection Tue, 26 May 2026 13:09:24 +0800 Agentic AI coding assistants can edit files, run commands, and access the internet on behalf of developers. However, their reliance on unvetted external artifacts introduces a new attack vector. Hidden instructions in external artifacts can hijack these assistants, turning them into an attacker's shell to run unauthorized commands. In this article, we examine how these prompt injection attacks work, measure their prevalence, discuss the limitations and challenges of current defenses, and sugges… An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments ../papers/arxiv-0ec08efc6fec.html https://arxiv.org/abs/2605.18133v1#2026-05-19#prompt-injection Tue, 19 May 2026 13:08:04 +0800 LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation… Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks ../papers/arxiv-2a81926568cc.html https://arxiv.org/abs/2605.18583v1#2026-05-19#prompt-injection Tue, 19 May 2026 13:08:04 +0800 Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign t… Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents ../papers/arxiv-37e0f79684f8.html https://arxiv.org/abs/2605.13471v1#2026-05-14#prompt-injection Thu, 14 May 2026 12:52:54 +0800 Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's identity, folding messaging, memory, self-authored skills, scheduling, and shell into one authority boundary. This configuration opens what we call \emph{sleeper channels}: an untrusted input to one surface persists as a memory, skill, scheduled job, or filesystem patch, then fires later through a different surface with no attacker present. Two independent axes define the class: persistence substra… RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems ../papers/arxiv-6185678dd76d.html https://arxiv.org/abs/2605.10862v1#2026-05-12#prompt-injection Tue, 12 May 2026 12:42:08 +0800 This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections. 来源分组：Agent Runtime Securi… Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation ../papers/arxiv-24f56ae93690.html https://arxiv.org/abs/2605.06393v1#2026-05-08#prompt-injection Fri, 08 May 2026 14:15:32 +0800 Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers, files, scripts, system commands, and external communication channels. While useful for automating real tasks, this capability also creates a host-level abuse surface: a legitimately deployed agent may be steered toward unsafe operations through malicious messages, indirect prompt injection, unsafe skills, or tampering along the host-side… Detecting Safety Violations Across Many Agent Traces ../papers/arxiv-5cf42310e590.html https://arxiv.org/abs/2604.11806v1#2026-04-14#prompt-injection Tue, 14 Apr 2026 11:37:06 +0800 To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across… ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection ../papers/arxiv-c894eb6a7f68.html https://arxiv.org/abs/2604.11790v1#2026-04-14#prompt-injection Tue, 14 Apr 2026 11:37:06 +0800 Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server in…