indirect prompt injection Topic Archive

indirect prompt injection Topic Archive indirect-prompt-injection.html 关键词 indirect prompt injection 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts ../papers/arxiv-4a4fa87e8c6b.html https://arxiv.org/abs/2606.19235v1#2026-06-18#indirect-prompt-injection Thu, 18 Jun 2026 14:03:08 +0800 Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-injection surface where attackers hide instructions in comments, strings, identifiers, or decoy code. We propose CodeSentinel, a three-layer inference-time sanitizer. It uses Tree-sitter to extract high-risk model-facing CST nodes, then combines syntax-guided pre-filtering, CST-guided Dynamic Min-K\% scoring, and node… Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents ../papers/arxiv-4dafb8f9ce98.html https://arxiv.org/abs/2606.04141#2026-06-04#indirect-prompt-injection Thu, 04 Jun 2026 14:02:06 +0800 LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we t… AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations ../papers/arxiv-c3bd67dd1e74.html https://arxiv.org/abs/2606.02240v1#2026-06-02#indirect-prompt-injection Tue, 02 Jun 2026 13:56:35 +0800 Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce… An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments ../papers/arxiv-0ec08efc6fec.html https://arxiv.org/abs/2605.18133v1#2026-05-19#indirect-prompt-injection Tue, 19 May 2026 13:08:04 +0800 LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation… Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation ../papers/arxiv-24f56ae93690.html https://arxiv.org/abs/2605.06393v1#2026-05-08#indirect-prompt-injection Fri, 08 May 2026 14:15:32 +0800 Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers, files, scripts, system commands, and external communication channels. While useful for automating real tasks, this capability also creates a host-level abuse surface: a legitimately deployed agent may be steered toward unsafe operations through malicious messages, indirect prompt injection, unsafe skills, or tampering along the host-side… ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection ../papers/arxiv-c894eb6a7f68.html https://arxiv.org/abs/2604.11790v1#2026-04-14#indirect-prompt-injection Tue, 14 Apr 2026 11:37:06 +0800 Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server in…