Skip to content

Architecture

Three-Layer Design

ASB is organized into three layers:

Layer 1: Agent Core

  • ReactAgent: Implements the ReAct (Reasoning + Acting) pattern
  • LLMClient: Abstraction supporting OpenAI, Anthropic, OpenAI-compatible, and Mock providers
  • ToolRegistry: Centralized tool management with risk metadata
  • ConversationMemory: Message history management with windowing strategies

Layer 2: Security & Evaluation

  • DefenseStrategy: Abstract interface for all defenses (D0-D10)
  • D0: Baseline, D1: Spotlighting, D2: Policy Gate, D3: Task Alignment
  • D4: Re-execution, D5: Sandwich, D6: Output Filter, D7: Input Classifier
  • D8: Semantic Firewall, D9: Dual-LLM, D10: CIV
  • prepare_context(): Modifies the prompt before sending to LLM
  • should_allow_tool_call(): Gates tool execution at runtime
  • CompositeDefense: Pipeline combining multiple strategies
  • AutoJudge: Rule-based verdict system (attack_succeeded/blocked, benign_completed/blocked)
  • MetricsCalculator: Computes ASR, BSR, FPR from judge results
  • ExperimentRunner: Orchestrates benchmark execution with fresh state per case

Layer 3: Interface

  • CLI (asb): Click-based command-line tool with run/evaluate/report/serve commands
  • Streamlit UI: Interactive demo with agent execution, audit trail, and benchmark visualization

Data Flow

User Goal + Untrusted Content
        ↓
   DefenseStrategy.prepare_context()
        ↓
   ReactAgent.run() loop:
     1. LLM generates thought + action
     2. DefenseStrategy.should_allow_tool_call()
     3. If allowed: Tool executes → observation
        If blocked: "BLOCKED" observation
     4. Repeat until Final Answer or max_steps
        ↓
   AgentTrajectory (full execution trace)
        ↓
   AutoJudge.judge() → JudgeResult
        ↓
   MetricsCalculator → EvaluationMetrics

Tool Risk Classification

Risk Level Examples Policy
LOW search_web, create_document Always allowed
MEDIUM read_email, read_file Allowed by default
HIGH send_email, write_file Whitelist enforced
CRITICAL execute_code Blocked by default