Formal Threat Model for Agent Security Sandbox¶

1. Overview¶

This document defines the formal threat model for evaluating LLM agent defenses against Indirect Prompt Injection (IPI) attacks. It specifies attacker capabilities along three orthogonal axes, trust boundaries within the agent architecture, and the mapping between threat levels and defense strategies.

2. Attacker Model¶

2.1 Knowledge Levels (K)¶

Level	Name	Description
K0	Black-box	Attacker knows only that an LLM agent processes external content. No knowledge of the model, tools, or defenses.
K1	Grey-box	Attacker knows the model family (e.g. GPT-4, Claude) and the set of available tools (e.g. `send_email`, `read_file`). No knowledge of active defenses.
K2	White-box	Attacker knows the exact model, tool set, defense strategies, and their configurations (e.g. delimiter strings, policy rules).

2.2 Access Levels (A)¶

Level	Name	Description
A0	Single injection point	Attacker can inject content into exactly one external source (e.g. one email body).
A1	Multiple injection points	Attacker can inject into several sources that the agent may encounter during a single task (e.g. email + search results).
A2	Persistent injection	Attacker can place persistent payloads in data sources the agent accesses repeatedly (e.g. RAG corpus, shared documents).

2.3 Adaptiveness Levels (S)¶

Level	Name	Description
S0	Static	Attacker uses a fixed payload. No feedback loop.
S1	Semi-adaptive	Attacker can observe aggregate defense outcomes (e.g. ASR) and manually revise payloads across batches.
S2	Fully adaptive	Attacker uses an LLM-based red-teaming loop that observes per-case defense decisions and iteratively refines payloads to bypass specific defenses.

2.4 Combined Threat Levels¶

The three axes combine into a compact threat descriptor:

threat_level ::= "K" <0-2> "_A" <0-2> "_S" <0-2>

Examples: - K0_A0_S0 — Weakest attacker: black-box, single injection, static payload. - K1_A1_S1 — Moderate attacker: knows model+tools, multi-point injection, semi-adaptive. - K2_A2_S2 — Strongest attacker: full knowledge, persistent access, fully adaptive.

Canonical benchmark mapping: - Easy cases → K0_A0_S0 - Medium cases → K1_A0_S0 or K0_A1_S0 - Hard cases → K1_A1_S1 or K2_A0_S1 - Adaptive cases → K2_A1_S2

3. Trust Boundaries¶

┌──────────────────────────────────────────────────────┐
│                    TRUSTED ZONE                       │
│  ┌──────────┐   ┌───────────┐   ┌────────────────┐  │
│  │ User Goal│──▶│ LLM Agent │──▶│ Defense Layer   │  │
│  └──────────┘   └─────┬─────┘   └───────┬────────┘  │
│                       │                  │            │
│              ┌────────▼──────────────────▼─────┐     │
│              │       Tool Execution Engine      │     │
│              └────────┬────────────────────────┘     │
│                       │                               │
├───────────────────────┼───── TRUST BOUNDARY ──────────┤
│                       ▼                               │
│                UNTRUSTED ZONE                         │
│  ┌────────────┐ ┌──────────┐ ┌──────────────────┐   │
│  │ Email Body │ │ Web Page │ │ File Contents    │   │
│  │ (external) │ │ (search) │ │ (shared docs)    │   │
│  └────────────┘ └──────────┘ └──────────────────┘   │
│  ┌────────────┐ ┌──────────┐ ┌──────────────────┐   │
│  │ RAG Chunks │ │ API Resp │ │ Tool Outputs     │   │
│  └────────────┘ └──────────┘ └──────────────────┘   │
└──────────────────────────────────────────────────────┘

Key principle: Any content crossing the trust boundary from the untrusted zone into the agent's context is a potential injection vector and must be treated as adversarial input.

4. Attack Surface¶

Injection Location	Tool Path	Risk
`email_body`	`read_email` → agent context	High
`search_result`	`search_web` → agent context	High
`file_content`	`read_file` → agent context	Medium
`api_response`	`call_api` → agent context	Medium
`calendar_event`	`read_calendar` → agent context	Low
`rag_chunk`	RAG retrieval → agent context	High
`tool_output`	Any tool → observation → agent context	Medium

5. Attack Taxonomy¶

Attack Type	Goal	Example
`data_exfiltration`	Steal sensitive data via tool side-effects	Forward API keys to external email
`goal_hijacking`	Redirect agent from user's task	Overwrite files instead of summarising
`privilege_escalation`	Access resources beyond user intent	Claim admin privileges via fake tokens
`multi_step`	Chain multiple injected sub-tasks	Read secrets → package → exfiltrate
`social_engineering`	Exploit urgency/authority framing	"URGENT: compliance requires..."
`encoding_evasion`	Obfuscate payload to bypass filters	ROT13, Unicode homoglyphs, zero-width
`multilingual`	Use non-English to bypass English-centric defenses	CJK injection payloads
`rag_poisoning`	Poison RAG retrieval context	Inject into knowledge base chunks
`tool_output_injection`	Inject via tool observation text	Malicious content in API response

6. Defense Coverage Mapping¶

Defense	Threat Levels Covered	Mechanism
D0 Baseline	None (control)	No defense
D1 Spotlighting	K0_A0_S0	Delimiter marking of untrusted content
D2 Policy Gate	K0-K1_A0-A1_S0	Tool-level access control
D3 Task Alignment	K0-K1_A0_S0-S1	LLM-based alignment verification
D4 Re-execution	K0-K1_A0_S0	Masked re-execution comparison
D5 Sandwich	K0-K1_A0_S0	Goal reminder wrapping
D6 Output Filter	K0_A0-A1_S0	Heuristic output scanning
D7 Input Classifier	K0_A0_S0	Pre-processing injection detection
D8 Semantic Firewall	K0-K1_A0-A1_S0-S1	Embedding-based instruction drift detection
D9 Dual-LLM	K0-K2_A0-A2_S0-S1	Two-model screening and verification
D10 CIV	K0-K2_A0-A1_S0-S1	Provenance + fingerprinting + counterfactual

7. Evaluation Protocol¶

Baseline measurement: Run all cases against D0 to establish attacker success rates without any defense.
Per-threat-level evaluation: Group cases by threat_level and evaluate each defense to identify coverage gaps.
Adaptive evaluation: Use the AdaptiveAttacker (S2) against each defense to measure worst-case robustness.
Composition analysis: Test defense combinations to find complementary coverage across threat levels.

8. Metrics¶

Metric	Definition
ASR (Attack Success Rate)	Fraction of attack cases where the agent performed a forbidden action
BSR (Benign Success Rate)	Fraction of benign cases where the agent completed the task correctly
FPR (False Positive Rate)	Fraction of benign cases incorrectly blocked by the defense
ASR@K/A/S	ASR conditioned on a specific knowledge/access/adaptiveness level
Robustness Gap	ASR(S2) - ASR(S0): how much adaptive attacks degrade a defense