Keyword Tracking

关键词追踪：guardrail

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

近期走势

最近一次命中来自 Agent Runtime Security：Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-06-26

2026-06-26 13:16:53 (Asia/Shanghai)

Agent Runtime Security

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

查看原始来源

In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that ste…

Agent Runtime Security

AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems

查看原始来源

Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural…

2026-06-25

2026-06-25 13:11:21 (Asia/Shanghai)

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

查看原始来源

As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even m…

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

查看原始来源

With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-c…

Agent Runtime Security

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

查看原始来源

AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own…

2026-06-24

2026-06-24 13:06:49 (Asia/Shanghai)

Agent Runtime Security

Red-Teaming the Agentic Red-Team

查看原始来源

The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused…

Agent Runtime Security

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

查看原始来源

We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and…

2026-06-23

2026-06-23 13:10:02 (Asia/Shanghai)

Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts

查看原始来源

While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigate…

2026-06-19

2026-06-19 14:26:15 (Asia/Shanghai)

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

查看原始来源

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial press…

Agent Runtime Security

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

查看原始来源

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer…

2026-06-17

2026-06-17 14:22:19 (Asia/Shanghai)

Agent Runtime Security

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

查看原始来源

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen…

2026-06-12

2026-06-12 13:55:02 (Asia/Shanghai)

Agent Runtime Security

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

查看原始来源

LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these…

2026-06-10

2026-06-10 13:25:04 (Asia/Shanghai)

Agent Runtime Security

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

查看原始来源

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and…

2026-06-09

2026-06-09 13:12:49 (Asia/Shanghai)

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

查看原始来源

The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behav…

Agent Runtime Security

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

查看原始来源

Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized…

2026-06-05

2026-06-05 13:25:00 (Asia/Shanghai)

Agent Runtime Security

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

查看原始来源

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark e…

Agent Runtime Security

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

查看原始来源

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categ…

Agent Runtime Security

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

查看原始来源

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In…

Agent Runtime Security

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

查看原始来源

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity…

2026-06-04

2026-06-04 14:02:06 (Asia/Shanghai)

Agent Runtime Security

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

查看原始来源

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production depl…

2026-06-03

2026-06-03 14:09:56 (Asia/Shanghai)

Agent Runtime Security

Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems

查看原始来源

Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mi…

Agent Runtime Security

Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs

查看原始来源

Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token costs and end-to-end latency. Static ret…

2026-06-02

2026-06-02 13:56:35 (Asia/Shanghai)

Agent Runtime Security

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

查看原始来源

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall in…

Agent Runtime Security

PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration

查看原始来源

The rapid expansion of the Python ecosystem has fueled two distinct but converging threats: adversaries increasingly target the software supply chain via the Python Package Index…

2026-05-29

2026-05-29 13:18:32 (Asia/Shanghai)

Agent Runtime Security

Provably Secure Agent Guardrail

查看原始来源

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in art…

Agent Runtime Security

Robust and Efficient Guardrails with Latent Reasoning

查看原始来源

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single…

Agent Runtime Security

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

查看原始来源

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI mo…

2026-05-28

2026-05-28 13:15:52 (Asia/Shanghai)

Agent Runtime Security

The Ethics of LLM Sandbox and Persona Dynamics

查看原始来源

It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world…

2026-05-27

2026-05-27 13:23:19 (Asia/Shanghai)

Agent Runtime Security

EviACT: An Evidence-to-Action Framework for Agentic Program Repair

查看原始来源

LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still…

Agent Runtime Security

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

查看原始来源

Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafe…

2026-05-22

2026-05-22 13:08:19 (Asia/Shanghai)

Agent Runtime Security

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

查看原始来源

Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to express more than ta…

2026-05-20

2026-05-20 13:10:58 (Asia/Shanghai)

Agent Runtime Security

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

查看原始来源

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-depende…

Agent Runtime Security

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

查看原始来源

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks d…

2026-05-19

2026-05-19 13:08:04 (Asia/Shanghai)

Agent Runtime Security

Multilingual jailbreaking of LLMs using low-resource languages

查看原始来源

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African…

2026-05-14

2026-05-14 12:52:54 (Asia/Shanghai)

Agent Runtime Security

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

查看原始来源

Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, argu…

2026-05-12

2026-05-12 12:42:08 (Asia/Shanghai)

Agent Runtime Security

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

查看原始来源

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work li…

2026-05-06

2026-05-06 12:37:23 (Asia/Shanghai)

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

查看原始来源

Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains…

2026-04-29

2026-04-29 12:26:28 (Asia/Shanghai)

Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

查看原始来源

The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime…

2026-04-15

2026-04-15 11:35:50 (Asia/Shanghai)

LLM

Parallax: Why AI Agents That Think Must Never Act

查看原始来源

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots b…

2026-04-09

2026-04-09 14:51:56 (Asia/Shanghai)

PubMed AI

Advancing neurotech justice in youth digital mental health: insights from an interdisciplinary and cross-generational workshop.

查看原始来源

Researchers and clinicians are increasingly looking to leverage artificial intelligence (AI) and digital tools to improve psychiatric care. Of particular promise is addressing the…