Keyword Tracking

关键词追踪：jailbreak

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

近期走势

最近一次命中来自 Agent Runtime Security：Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-06-26

2026-06-26 13:16:53 (Asia/Shanghai)

Agent Runtime Security

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries

查看原始来源

With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious…

2026-06-25

2026-06-25 13:11:21 (Asia/Shanghai)

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

查看原始来源

With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-c…

RAS: Measuring LLM Safety Through Refusal Alignment

查看原始来源

Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety poli…

Agent Runtime Security

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

查看原始来源

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safet…

2026-06-24

2026-06-24 13:06:49 (Asia/Shanghai)

Agent Runtime Security

LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context

查看原始来源

While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for tra…

Agent Runtime Security

Pigeonholing: Bad prompts hurt models to collapse and make mistakes

查看原始来源

While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we cal…

2026-06-23

2026-06-23 13:10:02 (Asia/Shanghai)

Agent Runtime Security

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

查看原始来源

Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM…

2026-06-19

2026-06-19 14:26:15 (Asia/Shanghai)

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

查看原始来源

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial press…

Agent Runtime Security

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

查看原始来源

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We stu…

Agent Runtime Security

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

查看原始来源

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilit…

2026-06-17

2026-06-17 14:22:19 (Asia/Shanghai)

Agent Runtime Security

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

查看原始来源

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak atta…

2026-06-16

2026-06-16 14:38:43 (Asia/Shanghai)

Agent Runtime Security

Automated jailbreak attack targeting multiple defense strategies

查看原始来源

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility t…

Agent Runtime Security

DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

查看原始来源

As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often re…

Agent Runtime Security

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

查看原始来源

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, pr…

2026-06-11

2026-06-11 13:59:12 (Asia/Shanghai)

Agent Runtime Security

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

查看原始来源

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decodin…

Agent Runtime Security

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

查看原始来源

Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally i…

Agent Runtime Security

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

查看原始来源

We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of…

2026-06-10

2026-06-10 13:25:04 (Asia/Shanghai)

Agent Runtime Security

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

查看原始来源

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn ref…

2026-06-05

2026-06-05 13:25:00 (Asia/Shanghai)

Agent Runtime Security

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

查看原始来源

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark e…

Agent Runtime Security

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

查看原始来源

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In…

Agent Runtime Security

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

查看原始来源

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic s…

2026-06-04

2026-06-04 14:02:06 (Asia/Shanghai)

Agent Runtime Security

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

查看原始来源

Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from auto…

2026-06-03

2026-06-03 14:09:56 (Asia/Shanghai)

Agent Runtime Security

D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

查看原始来源

Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts tow…

Agent Runtime Security

MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety

查看原始来源

Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPS…

Agent Runtime Security

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

查看原始来源

Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet exi…

2026-06-02

2026-06-02 13:56:35 (Asia/Shanghai)

Agent Runtime Security

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

查看原始来源

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have s…

2026-05-29

2026-05-29 13:18:32 (Asia/Shanghai)

Agent Runtime Security

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

查看原始来源

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks…

2026-05-28

2026-05-28 13:15:52 (Asia/Shanghai)

Agent Runtime Security

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

查看原始来源

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a…

2026-05-27

2026-05-27 13:23:19 (Asia/Shanghai)

Agent Runtime Security

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

查看原始来源

In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the mo…

2026-05-21

2026-05-21 13:14:24 (Asia/Shanghai)

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

查看原始来源

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated…

2026-05-20

2026-05-20 13:10:58 (Asia/Shanghai)

Agent Runtime Security

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

查看原始来源

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a m…

2026-05-19

2026-05-19 13:08:04 (Asia/Shanghai)

Agent Runtime Security

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

查看原始来源

LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability,…

Agent Runtime Security

Multilingual jailbreaking of LLMs using low-resource languages

查看原始来源

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African…

Agent Runtime Security

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

查看原始来源

The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a…

2026-05-13

2026-05-13 12:54:34 (Asia/Shanghai)

Agent Runtime Security

Metaphor Is Not All Attention Needs

查看原始来源

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to ma…

2026-05-12

2026-05-12 12:42:08 (Asia/Shanghai)

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

查看原始来源

The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, wh…

Agent Runtime Security

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

查看原始来源

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibil…

Agent Runtime Security

Re-Triggering Safeguards within LLMs for Jailbreak Detection

查看原始来源

This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in sa…

Agent Runtime Security

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

查看原始来源

This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approac…

2026-05-07

2026-05-07 12:38:06 (Asia/Shanghai)

SoK: Robustness in Large Language Models against Jailbreak Attacks

查看原始来源

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmfu…

2026-05-05

2026-05-05 12:20:54 (Asia/Shanghai)

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

查看原始来源

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational sett…

2026-04-24

2026-04-24 11:46:20 (Asia/Shanghai)

LLM

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

查看原始来源

Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn I…