<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>guardrail Topic Archive</title>
<link>guardrail.html</link>
<description>关键词 guardrail 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Sun, 28 Jun 2026 05:24:06 +0000</lastBuildDate>
<item>
<title>Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation</title>
<link>../papers/arxiv-da78713ac31a.html</link>
<guid>https://arxiv.org/abs/2606.26686v1#2026-06-26#guardrail</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied r…</description>
</item>
<item>
<title>AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems</title>
<link>../papers/arxiv-9259981f77a1.html</link>
<guid>https://arxiv.org/abs/2606.26859v1#2026-06-26#guardrail</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge…</description>
</item>
<item>
<title>MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction</title>
<link>../papers/arxiv-924e9f45b440.html</link>
<guid>https://arxiv.org/abs/2606.25651v1#2026-06-25#guardrail</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical e…</description>
</item>
<item>
<title>Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation</title>
<link>../papers/arxiv-cca18893a109.html</link>
<guid>https://arxiv.org/abs/2606.25782v1#2026-06-25#guardrail</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably id…</description>
</item>
<item>
<title>The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems</title>
<link>../papers/arxiv-2e1379a16fa7.html</link>
<guid>https://arxiv.org/abs/2606.26057v1#2026-06-25#guardrail</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent&#x27;s own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent&#x27;s address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mech…</description>
</item>
<item>
<title>Red-Teaming the Agentic Red-Team</title>
<link>../papers/arxiv-a2ceabb33333.html</link>
<guid>https://arxiv.org/abs/2606.24496v1#2026-06-24#guardrail</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws tha…</description>
</item>
<item>
<title>PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models</title>
<link>../papers/arxiv-5e668ce8c325.html</link>
<guid>https://arxiv.org/abs/2606.24388v1#2026-06-24#guardrail</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adver…</description>
</item>
<item>
<title>Measuring &amp; Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts</title>
<link>../papers/arxiv-a44e1223beb7.html</link>
<guid>https://arxiv.org/abs/2606.23375v1#2026-06-23#guardrail</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain de…</description>
</item>
<item>
<title>LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems</title>
<link>../papers/arxiv-a473df34c717.html</link>
<guid>https://arxiv.org/abs/2606.20408v1#2026-06-19#guardrail</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical…</description>
</item>
<item>
<title>RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning</title>
<link>../papers/arxiv-380550205212.html</link>
<guid>https://arxiv.org/abs/2606.20142v1#2026-06-19#guardrail</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer&#x27;s internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisi…</description>
</item>
<item>
<title>PreAct: Computer-Using Agents that Get Faster on Repeated Tasks</title>
<link>../papers/arxiv-87a19bc64272.html</link>
<guid>https://arxiv.org/abs/2606.17929v1#2026-06-17#guardrail</guid>
<pubDate>Wed, 17 Jun 2026 14:22:19 +0800</pubDate>
<description>Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instea…</description>
</item>
<item>
<title>Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda</title>
<link>../papers/arxiv-29aa8644292b.html</link>
<guid>https://arxiv.org/abs/2606.13405v1#2026-06-12#guardrail</guid>
<pubDate>Fri, 12 Jun 2026 13:55:02 +0800</pubDate>
<description>LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent&#x27;s decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based…</description>
</item>
<item>
<title>It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO</title>
<link>../papers/arxiv-41831e36aa6f.html</link>
<guid>https://arxiv.org/abs/2606.10931v1#2026-06-10#guardrail</guid>
<pubDate>Wed, 10 Jun 2026 13:25:04 +0800</pubDate>
<description>Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and…</description>
</item>
<item>
<title>The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model</title>
<link>../papers/arxiv-eb30ecfd2496.html</link>
<guid>https://arxiv.org/abs/2606.09735v1#2026-06-09#guardrail</guid>
<pubDate>Tue, 09 Jun 2026 13:12:49 +0800</pubDate>
<description>The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.&#x27;&#x27; Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic cas…</description>
</item>
<item>
<title>What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks</title>
<link>../papers/arxiv-c78b7a180254.html</link>
<guid>https://arxiv.org/abs/2606.09700v1#2026-06-09#guardrail</guid>
<pubDate>Tue, 09 Jun 2026 13:12:49 +0800</pubDate>
<description>Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability,…</description>
</item>
<item>
<title>GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection</title>
<link>../papers/arxiv-50a6cf67d9a1.html</link>
<guid>https://arxiv.org/abs/2606.05566#2026-06-05#guardrail</guid>
<pubDate>Fri, 05 Jun 2026 13:25:00 +0800</pubDate>
<description>Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversaria…</description>
</item>
<item>
<title>From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents</title>
<link>../papers/arxiv-871cd8f730a3.html</link>
<guid>https://arxiv.org/abs/2606.05805#2026-06-05#guardrail</guid>
<pubDate>Fri, 05 Jun 2026 13:25:00 +0800</pubDate>
<description>LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but…</description>
</item>
<item>
<title>Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack</title>
<link>../papers/arxiv-088c75fca1b6.html</link>
<guid>https://arxiv.org/abs/2606.05614#2026-06-05#guardrail</guid>
<pubDate>Fri, 05 Jun 2026 13:25:00 +0800</pubDate>
<description>Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensiv…</description>
</item>
<item>
<title>The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models</title>
<link>../papers/arxiv-77af413d5094.html</link>
<guid>https://arxiv.org/abs/2606.05183#2026-06-05#guardrail</guid>
<pubDate>Fri, 05 Jun 2026 13:25:00 +0800</pubDate>
<description>Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under thr…</description>
</item>
<item>
<title>Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification</title>
<link>../papers/arxiv-4a5867dff3ab.html</link>
<guid>https://arxiv.org/abs/2606.04037#2026-06-04#guardrail</guid>
<pubDate>Thu, 04 Jun 2026 14:02:06 +0800</pubDate>
<description>Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permis…</description>
</item>
<item>
<title>Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems</title>
<link>../papers/arxiv-553f13fc044d.html</link>
<guid>https://arxiv.org/abs/2606.02755#2026-06-03#guardrail</guid>
<pubDate>Wed, 03 Jun 2026 14:09:56 +0800</pubDate>
<description>Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mismatch makes ordinary post-hoc benchmarking insufficient for systems that must be safe, reliable, auditable, and economically useful. This paper contributes an evaluation-protocol extension for operational LLM systems grounded in acceptance-test-driven development, safety engineering, and business-centric validation. T…</description>
</item>
<item>
<title>Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs</title>
<link>../papers/arxiv-34f400417e1c.html</link>
<guid>https://arxiv.org/abs/2606.02581#2026-06-03#guardrail</guid>
<pubDate>Wed, 03 Jun 2026 14:09:56 +0800</pubDate>
<description>Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token costs and end-to-end latency. Static retrieval configurations cannot resolve this tension across heterogeneous query workloads -- simple definitional queries waste budget on unnecessary context, while complex analytical prompts are underserved by shallow retrieval. This paper introduces \emph{Cost-Aware RAG} (CA-RAG), a per-query routing framework that selec…</description>
</item>
<item>
<title>SentGuard: Sentence-Level Streaming Guardrails for Large Language Models</title>
<link>../papers/arxiv-703e6bba37af.html</link>
<guid>https://arxiv.org/abs/2606.02041v1#2026-06-02#guardrail</guid>
<pubDate>Tue, 02 Jun 2026 13:56:35 +0800</pubDate>
<description>Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail…</description>
</item>
<item>
<title>PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration</title>
<link>../papers/arxiv-52d3446d1c6c.html</link>
<guid>https://arxiv.org/abs/2606.02196v1#2026-06-02#guardrail</guid>
<pubDate>Tue, 02 Jun 2026 13:56:35 +0800</pubDate>
<description>The rapid expansion of the Python ecosystem has fueled two distinct but converging threats: adversaries increasingly target the software supply chain via the Python Package Index (PyPI), while also building evasive, cross-platform malicious binaries compiled from source code written in Python. Current program analysis techniques struggle to address this dual threat. Static analysis based tools are often blinded by runtime obfuscation and compiled bytecode, while dynamic analysis based ones are…</description>
</item>
<item>
<title>Provably Secure Agent Guardrail</title>
<link>../papers/arxiv-5045007906ff.html</link>
<guid>https://arxiv.org/abs/2605.29251#2026-05-29#guardrail</guid>
<pubDate>Fri, 29 May 2026 13:18:32 +0800</pubDate>
<description>As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrai…</description>
</item>
<item>
<title>Robust and Efficient Guardrails with Latent Reasoning</title>
<link>../papers/arxiv-c9e6fa58a2c3.html</link>
<guid>https://arxiv.org/abs/2605.29068#2026-05-29#guardrail</guid>
<pubDate>Fri, 29 May 2026 13:18:32 +0800</pubDate>
<description>Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardra…</description>
</item>
<item>
<title>AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security</title>
<link>../papers/arxiv-4e03d4a36b2f.html</link>
<guid>https://arxiv.org/abs/2605.29801#2026-05-29#guardrail</guid>
<pubDate>Fri, 29 May 2026 13:18:32 +0800</pubDate>
<description>Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex…</description>
</item>
<item>
<title>The Ethics of LLM Sandbox and Persona Dynamics</title>
<link>../papers/arxiv-353273667742.html</link>
<guid>https://arxiv.org/abs/2605.28647v1#2026-05-28#guardrail</guid>
<pubDate>Thu, 28 May 2026 13:15:52 +0800</pubDate>
<description>It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, w…</description>
</item>
<item>
<title>EviACT: An Evidence-to-Action Framework for Agentic Program Repair</title>
<link>../papers/arxiv-4ff8b4048b9a.html</link>
<guid>https://arxiv.org/abs/2605.27238v1#2026-05-27#guardrail</guid>
<pubDate>Wed, 27 May 2026 13:23:19 +0800</pubDate>
<description>LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still struggle to use execution evidence to guide localization, patch generation, and validation. We propose EviACT (Evidence-to-Action), an agentic APR framework that coordinates three evidence-driven guardrails across repair stages. The retrieval scaffold grounds repair context, the compile gate filters invalid edits, and…</description>
</item>
<item>
<title>AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian</title>
<link>../papers/arxiv-32a0b59832be.html</link>
<guid>https://arxiv.org/abs/2605.26954v1#2026-05-27#guardrail</guid>
<pubDate>Wed, 27 May 2026 13:23:19 +0800</pubDate>
<description>Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, viole…</description>
</item>
<item>
<title>Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents</title>
<link>../papers/arxiv-061ab2c8083a.html</link>
<guid>https://arxiv.org/abs/2605.22634v1#2026-05-22#guardrail</guid>
<pubDate>Fri, 22 May 2026 13:08:19 +0800</pubDate>
<description>Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to express more than task guidance: they must make goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable…</description>
</item>
<item>
<title>Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains</title>
<link>../papers/arxiv-2446b5b51703.html</link>
<guid>https://arxiv.org/abs/2605.19940#2026-05-20#guardrail</guid>
<pubDate>Wed, 20 May 2026 13:10:58 +0800</pubDate>
<description>Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajecto…</description>
</item>
<item>
<title>SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents</title>
<link>../papers/arxiv-301348ca3532.html</link>
<guid>https://arxiv.org/abs/2605.19219#2026-05-20#guardrail</guid>
<pubDate>Wed, 20 May 2026 13:10:58 +0800</pubDate>
<description>A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and in…</description>
</item>
<item>
<title>Multilingual jailbreaking of LLMs using low-resource languages</title>
<link>../papers/arxiv-e831b0db2fc4.html</link>
<guid>https://arxiv.org/abs/2605.18239v1#2026-05-19#guardrail</guid>
<pubDate>Tue, 19 May 2026 13:08:04 +0800</pubDate>
<description>Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved…</description>
</item>
<item>
<title>LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs</title>
<link>../papers/arxiv-2403f4eee68c.html</link>
<guid>https://arxiv.org/abs/2605.13334v1#2026-05-14#guardrail</guid>
<pubDate>Thu, 14 May 2026 12:52:54 +0800</pubDate>
<description>Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn &quot;write an argumentative essay&quot; conversation, can persuade other frontier-class LLMs (including…</description>
</item>
<item>
<title>Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs</title>
<link>../papers/arxiv-24c53f7838ce.html</link>
<guid>https://arxiv.org/abs/2605.10633v1#2026-05-12#guardrail</guid>
<pubDate>Tue, 12 May 2026 12:42:08 +0800</pubDate>
<description>Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model&#x27;s broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that…</description>
</item>
<item>
<title>Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones</title>
<link>../papers/arxiv-5f22c5be2bee.html</link>
<guid>https://arxiv.org/abs/2605.03788v1#2026-05-06#guardrail</guid>
<pubDate>Wed, 06 May 2026 12:37:23 +0800</pubDate>
<description>Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution. This paper presents a mission-agnostic, agent-enhanced LLM framework for UAV swarm control, where users express mission objectives in natural language and the system autonomously executes them through gr…</description>
</item>
<item>
<title>Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents</title>
<link>../papers/arxiv-73a5a9b54f1e.html</link>
<guid>https://arxiv.org/abs/2604.25684v1#2026-04-29#guardrail</guid>
<pubDate>Wed, 29 Apr 2026 12:26:28 +0800</pubDate>
<description>The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions. We address this gap by drawing on how humans self-govern naturally: before acting, humans engage delibera…</description>
</item>
<item>
<title>Parallax: Why AI Agents That Think Must Never Act</title>
<link>../papers/arxiv-fe385734239d.html</link>
<guid>https://arxiv.org/abs/2604.12986v1#2026-04-15#guardrail</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the…</description>
</item>
<item>
<title>Advancing neurotech justice in youth digital mental health: insights from an interdisciplinary and cross-generational workshop.</title>
<link>../papers/pubmed-b4af6d31f5b6.html</link>
<guid>https://pubmed.ncbi.nlm.nih.gov/41951757/#2026-04-09#guardrail</guid>
<pubDate>Thu, 09 Apr 2026 14:51:56 +0800</pubDate>
<description>Researchers and clinicians are increasingly looking to leverage artificial intelligence (AI) and digital tools to improve psychiatric care. Of particular promise is addressing the youth mental health crisis. Yet, the introduction of AI-enabled digital technologies for psychiatric treatment of young adults raises a host of ethical, legal, and societal issues (ELSI). To provide guidance in addressing these issues, we convened a two-day meeting at the Radcliffe Institute for Advanced Study at Harv…</description>
</item>
</channel>
</rss>
