jailbreak Topic Archive

jailbreak Topic Archive jailbreak.html 关键词 jailbreak 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries ../papers/arxiv-eca9abd16e6a.html https://arxiv.org/abs/2606.26936v1#2026-06-26#jailbreak Fri, 26 Jun 2026 13:16:53 +0800 With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit fr… Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation ../papers/arxiv-cca18893a109.html https://arxiv.org/abs/2606.25782v1#2026-06-25#jailbreak Thu, 25 Jun 2026 13:11:21 +0800 With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably id… RAS: Measuring LLM Safety Through Refusal Alignment ../papers/arxiv-27c960f270d2.html https://arxiv.org/abs/2606.25750v1#2026-06-25#jailbreak Thu, 25 Jun 2026 13:11:21 +0800 Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directio… How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring ../papers/arxiv-25dbdb1a09a8.html https://arxiv.org/abs/2606.25487v1#2026-06-25#jailbreak Thu, 25 Jun 2026 13:11:21 +0800 Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite… LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context ../papers/arxiv-7d9e141e8dab.html https://arxiv.org/abs/2606.24585v1#2026-06-24#jailbreak Wed, 24 Jun 2026 13:06:49 +0800 While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistant… Pigeonholing: Bad prompts hurt models to collapse and make mistakes ../papers/arxiv-112c872ebf06.html https://arxiv.org/abs/2606.24267v1#2026-06-24#jailbreak Wed, 24 Jun 2026 13:06:49 +0800 While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution,… TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization ../papers/arxiv-0caba902aa17.html https://arxiv.org/abs/2606.23496v1#2026-06-23#jailbreak Tue, 23 Jun 2026 13:10:02 +0800 Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants… LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems ../papers/arxiv-a473df34c717.html https://arxiv.org/abs/2606.20408v1#2026-06-19#jailbreak Fri, 19 Jun 2026 14:26:15 +0800 Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical… What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? ../papers/arxiv-d2c45c3b54a7.html https://arxiv.org/abs/2606.20508v1#2026-06-19#jailbreak Fri, 19 Jun 2026 14:26:15 +0800 Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrat… Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems ../papers/arxiv-c4b164c8bf0f.html https://arxiv.org/abs/2606.20470v1#2026-06-19#jailbreak Fri, 19 Jun 2026 14:26:15 +0800 Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the… A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models ../papers/arxiv-b2420286985b.html https://arxiv.org/abs/2606.18193v1#2026-06-17#jailbreak Wed, 17 Jun 2026 14:22:19 +0800 We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of… Automated jailbreak attack targeting multiple defense strategies ../papers/arxiv-d433e566ae71.html https://arxiv.org/abs/2606.16751v1#2026-06-16#jailbreak Tue, 16 Jun 2026 14:38:43 +0800 Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extra… DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing ../papers/arxiv-cac1c936b49a.html https://arxiv.org/abs/2606.16527v1#2026-06-16#jailbreak Tue, 16 Jun 2026 14:38:43 +0800 As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to… Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models ../papers/arxiv-55ea929d4f57.html https://arxiv.org/abs/2606.16808v1#2026-06-16#jailbreak Tue, 16 Jun 2026 14:38:43 +0800 While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness… Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code ../papers/arxiv-bf45a9834bb9.html https://arxiv.org/abs/2606.11817v1#2026-06-11#jailbreak Thu, 11 Jun 2026 13:59:12 +0800 Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LL… OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents ../papers/arxiv-4d5c20727d4b.html https://arxiv.org/abs/2606.12341v1#2026-06-11#jailbreak Thu, 11 Jun 2026 13:59:12 +0800 Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a… Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers ../papers/arxiv-9418cc76eedf.html https://arxiv.org/abs/2606.11949v1#2026-06-11#jailbreak Thu, 11 Jun 2026 13:59:12 +0800 We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8… When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models ../papers/arxiv-d9957907bca3.html https://arxiv.org/abs/2606.10740v1#2026-06-10#jailbreak Wed, 10 Jun 2026 13:25:04 +0800 Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defi… GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection ../papers/arxiv-50a6cf67d9a1.html https://arxiv.org/abs/2606.05566#2026-06-05#jailbreak Fri, 05 Jun 2026 13:25:00 +0800 Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversaria… Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack ../papers/arxiv-088c75fca1b6.html https://arxiv.org/abs/2606.05614#2026-06-05#jailbreak Fri, 05 Jun 2026 13:25:00 +0800 Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensiv… Beyond Similarity: Trustworthy Memory Search for Personal AI Agents ../papers/arxiv-5005be318521.html https://arxiv.org/abs/2606.06054#2026-06-05#jailbreak Fri, 05 Jun 2026 13:25:00 +0800 Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced ja… MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models ../papers/arxiv-cd6f58fb55d6.html https://arxiv.org/abs/2606.04027#2026-06-04#jailbreak Thu, 04 Jun 2026 14:02:06 +0800 Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uni… D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting ../papers/arxiv-c66df945b77b.html https://arxiv.org/abs/2606.02640#2026-06-03#jailbreak Wed, 03 Jun 2026 14:09:56 +0800 Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts toward harmful goals. Existing defenses largely detect or block unsafe content at individual turns or at the final response, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce D-Judge, a semantics-preserving output rewriting d… MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety ../papers/arxiv-ed5c28d51e28.html https://arxiv.org/abs/2606.02630#2026-06-03#jailbreak Wed, 03 Jun 2026 14:09:56 +0800 Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistingu… Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing ../papers/arxiv-fb00761396ec.html https://arxiv.org/abs/2606.02822#2026-06-03#jailbreak Wed, 03 Jun 2026 14:09:56 +0800 Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refus… Jailbreaking Multimodal Large Language Models using Multi-Clip Video ../papers/arxiv-b31844e10897.html https://arxiv.org/abs/2606.02111v1#2026-06-02#jailbreak Tue, 02 Jun 2026 13:56:35 +0800 As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the… Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures ../papers/arxiv-cf275d1c940d.html https://arxiv.org/abs/2605.29629#2026-05-29#jailbreak Fri, 29 May 2026 13:18:32 +0800 Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-atta… Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests ../papers/arxiv-5b773c3bcb3e.html https://arxiv.org/abs/2605.28734v1#2026-05-28#jailbreak Thu, 28 May 2026 13:15:52 +0800 A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code… BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning ../papers/arxiv-353d045afa65.html https://arxiv.org/abs/2605.27110v1#2026-05-27#jailbreak Wed, 27 May 2026 13:23:19 +0800 In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Be… LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models ../papers/arxiv-c2f22cc79ee3.html https://arxiv.org/abs/2605.21362v1#2026-05-21#jailbreak Thu, 21 May 2026 13:14:24 +0800 Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-… Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models ../papers/arxiv-db4d27560487.html https://arxiv.org/abs/2605.19485#2026-05-20#jailbreak Wed, 20 May 2026 13:10:58 +0800 Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention p… An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments ../papers/arxiv-0ec08efc6fec.html https://arxiv.org/abs/2605.18133v1#2026-05-19#jailbreak Tue, 19 May 2026 13:08:04 +0800 LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation… Multilingual jailbreaking of LLMs using low-resource languages ../papers/arxiv-e831b0db2fc4.html https://arxiv.org/abs/2605.18239v1#2026-05-19#jailbreak Tue, 19 May 2026 13:08:04 +0800 Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved… Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models ../papers/arxiv-4a2f8ff0611d.html https://arxiv.org/abs/2605.18168v1#2026-05-19#jailbreak Tue, 19 May 2026 13:08:04 +0800 The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a carrier for malicious payloads, relying on semantic optimization, acoustic parameter control, or additive perturbation to embed harmful content into the audio signal. In this work, we challenge this necessity and propose a new paradigm in which the role of audio shifts from content injection to safety alignment interfe… Metaphor Is Not All Attention Needs ../papers/arxiv-7ab1d8457fd2.html https://arxiv.org/abs/2605.12128v1#2026-05-13#jailbreak Wed, 13 May 2026 12:54:34 +0800 Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effective… LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments ../papers/arxiv-8c7a92cf216a.html https://arxiv.org/abs/2605.10779v1#2026-05-12#jailbreak Tue, 12 May 2026 12:42:08 +0800 The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS… Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization ../papers/arxiv-f79ef4805ce1.html https://arxiv.org/abs/2605.10764v1#2026-05-12#jailbreak Tue, 12 May 2026 12:42:08 +0800 Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already… Re-Triggering Safeguards within LLMs for Jailbreak Detection ../papers/arxiv-14d1baeefb41.html https://arxiv.org/abs/2605.10611v1#2026-05-12#jailbreak Tue, 12 May 2026 12:42:08 +0800 This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approa… Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing ../papers/arxiv-d0bd36433ccb.html https://arxiv.org/abs/2605.10582v1#2026-05-12#jailbreak Tue, 12 May 2026 12:42:08 +0800 This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disru… SoK: Robustness in Large Language Models against Jailbreak Attacks ../papers/arxiv-3ce6e1f63aa8.html https://arxiv.org/abs/2605.05058v1#2026-05-07#jailbreak Thu, 07 May 2026 12:38:06 +0800 Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack succ… MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety ../papers/arxiv-51601cee2f95.html https://arxiv.org/abs/2605.01687#2026-05-05#jailbreak Tue, 05 May 2026 12:20:54 +0800 We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expan… Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models ../papers/arxiv-558f38b55608.html https://arxiv.org/abs/2604.21860v1#2026-04-24#jailbreak Fri, 24 Apr 2026 11:46:20 +0800 Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source…