# 每日论文简报

- 生成时间：2026-05-15 14:57:29 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=11, Agent Runtime Security=0, Terminal and SWE Agents=6
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Evaluation」：命中 11 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》、《APWA: A Distributed Architecture for Parallelizable Agentic Workflows》。
- 主题「LLM」：命中 10 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》、《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》。
- 主题「Language Model」：命中 7 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》、《Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use》。
- 主题「Agent」：命中 5 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory》、《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》。
- 主题「Benchmark」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》。

## 主题聚焦

### Evaluation

- 命中篇数：11
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》、《APWA: A Distributed Architecture for Parallelizable Agentic Workflows》、《MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory》
- 主题速读：
  - 《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matr…
  - 《APWA: A Distributed Architecture for Parallelizable Agentic Workflows》〔评测 / 应用 / 方法〕：Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide br…

### LLM

- 命中篇数：10
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》、《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》、《Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use》
- 主题速读：
  - 《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matr…
  - 《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》〔数据 / 应用 / 方法〕：As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must…

### Language Model

- 命中篇数：7
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》、《Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use》、《On the Cultural Anachronism and Temporal Reasoning in Vision Language Models》
- 主题速读：
  - 《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》〔数据 / 应用 / 方法〕：As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must…
  - 《Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use》〔评测 / 方法〕：Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structura…

### Agent

- 命中篇数：5
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory》、《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》、《Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation》
- 主题速读：
  - 《MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory》〔评测 / 方法〕：Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning.…
  - 《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》〔评测 / 方法〕：Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities…

### Benchmark

- 命中篇数：1
- 覆盖分组：Terminal and SWE Agents
- 代表论文：《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》
- 主题速读：
  - 《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》〔评测 / 方法〕：Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities…

## LM 观察

### 本组速览

- 《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matr…
- 《SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning》〔数据 / 应用 / 方法〕：As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must…
- 《Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use》〔评测 / 方法〕：Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structura…
- 《APWA: A Distributed Architecture for Parallelizable Agentic Workflows》〔评测 / 应用 / 方法〕：Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide br…
- 《MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory》〔评测 / 方法〕：Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning.…

### 论文速览

1. [Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks](https://arxiv.org/abs/2605.15118v1)
   - Published：2026-05-15 01:30
   - 作者：Karthik Raghu Iyer，Yazdan Jamshidi，Nicholas Bray，Alexey A. Shvets
   - 来源：arxiv
   - 相关性分数：202
   - 命中原因：title matched "LLM"; title matched "RAG"; title matched "benchmark"; summary matched "agent"
   - 分类：cs.CR, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2605.15118v1
   - 摘要：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

2. [SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning](https://arxiv.org/abs/2605.15044v1)
   - Published：2026-05-15 00:36
   - 作者：KiHyun Nam，Jungwoo Heo，Siu Bae，Ha-Jin Yu，Joon Son Chung
   - 来源：arxiv
   - 相关性分数：161
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.SD, cs.AI, cs.LG, cs.MM, eess.AS
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.15044v1
   - 摘要：As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

3. [Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use](https://arxiv.org/abs/2605.15041v1)
   - Published：2026-05-15 00:36
   - 作者：Renning Pang，Tian Lan，Leyuan Liu，Piao Tong，Sheng Cao，Xiaosong Zhang
   - 来源：arxiv
   - 相关性分数：161
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.15041v1
   - 摘要：Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

4. [APWA: A Distributed Architecture for Parallelizable Agentic Workflows](https://arxiv.org/abs/2605.15132v1)
   - Published：2026-05-15 01:40
   - 作者：Evan Rose，Tushin Mallick，Matthew D. Laws，Cristina Nita-Rotaru，Alina Oprea
   - 来源：arxiv
   - 相关性分数：158
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI, cs.DC, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2605.15132v1
   - 摘要：Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

5. [MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory](https://arxiv.org/abs/2605.15128v1)
   - Published：2026-05-15 01:37
   - 作者：Minghao Guo，Qingyue Jiao，Zeru Shi，Yihao Quan，Boxuan Zhang，Danrui Li 等
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "agent"; title matched "evaluation"; summary matched "reasoning"; summary matched "benchmark"
   - 分类：cs.CV, cs.CL, cs.IR
   - 标签：评测 / 方法
   - 主题词：Evaluation / Agent
   - PDF：https://arxiv.org/pdf/2605.15128v1
   - 摘要：Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

6. [On the Cultural Anachronism and Temporal Reasoning in Vision Language Models](https://arxiv.org/abs/2605.15071v1)
   - Published：2026-05-15 00:58
   - 作者：Mukul Ranjan，Prince Jha，Khushboo Kumari，Zhiqiang Shen
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "language model"; title matched "reasoning"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.CV, cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Evaluation / Language Model
   - PDF：https://arxiv.org/pdf/2605.15071v1
   - 摘要：Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

7. [SemaTune: Semantic-Aware Online OS Tuning with Large Language Models](https://arxiv.org/abs/2605.15026v1)
   - Published：2026-05-15 00:25
   - 作者：Georgios Liargkovas，Mihir Nitin Joshi，Hubertus Franke，Kostis Kaffes
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "benchmark"
   - 分类：cs.OS, cs.AI, cs.PF
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.15026v1
   - 摘要：Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action--response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5\% over default settings and by 153.3\% relative to the strongest non-LLM baseline. A 30-window session costs about \$0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.

8. [Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models](https://arxiv.org/abs/2605.14938v1)
   - Published：2026-05-14 23:13
   - 作者：Yuehao Liu，Shanyan Guan，Weijia Zhang，Xuanming Shang，Yanhao Ge，Wei Li 等
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.LG, cs.CV
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.14938v1
   - 摘要：Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

9. [A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models](https://arxiv.org/abs/2605.14929v1)
   - Published：2026-05-14 23:03
   - 作者：Earl Killian
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "RAG"; summary matched "evaluation"
   - 分类：cs.LG, cs.AR
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Language Model
   - PDF：https://arxiv.org/pdf/2605.14929v1
   - 摘要：Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5--6 bits per weight on hardware with per-layer LUT decode. The methodology combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) is proposed to improve performance, energy, and cost. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost, demonstrating that block-scaled small atoms with carefully chosen scale precision can replace conventionally-deployed FP8. Full evaluation across the 4.5--6 bpw range, including layer promotion and sparse residual correction, is reported in a companion paper.

10. [Quantifying and Mitigating Premature Closure in Frontier LLMs](https://arxiv.org/abs/2605.15000v1)
   - Published：2026-05-15 00:02
   - 作者：Rebecca Handler，Suhana Bedi，Nigam Shah
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2605.15000v1
   - 摘要：Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

11. [Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling](https://arxiv.org/abs/2605.15100v1)
   - Published：2026-05-15 01:19
   - 作者：Rongman Xu，Yifei Li，Tianzhe Zhao，Yanrui Wu，Bo Li，Hang Yan
   - 来源：arxiv
   - 相关性分数：136
   - 命中原因：summary matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2605.15100v1
   - 摘要：Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.

## Agent Runtime Security 观察

今日没有新的命中文献。

## Terminal and SWE Agents 观察

### 本组速览

- 《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》〔评测 / 方法〕：Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities…
- 《Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation》〔评测 / 应用 / 方法〕：Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely…
- 《SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades》〔评测 / 方法〕：Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Exis…
- 《Documentation-Guided Agentic Codebase Migration from C to Rust》〔评测 / 方法〕：Migrating legacy C repositories to Rust promises stronger memory safety, but existing translators often work at the level of files or functions and miss archit…
- 《Comparing Developer and LLM Biases in Code Evaluation》〔评测 / 应用 / 方法〕：As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambig…

### 论文速览

1. [CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing](https://arxiv.org/abs/2605.14084)
   - Published：2026-05-15 12:00
   - 作者：Mingzhi Zhu，Michele Merler，Raju Pavuluri，Stacy Patterson
   - 来源：arxiv
   - 相关性分数：115
   - 命中原因：title matched "code agent"; summary matched "Terminal-Bench"; summary matched "SWE-bench"; has PDF
   - 分类：cs.SE, cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2605.14084
   - 摘要：Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

2. [Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation](https://arxiv.org/abs/2605.14563)
   - Published：2026-05-15 12:00
   - 作者：Suyoung Bae，Jaehoon Lee，Changkyu Choi，YunSeok Choi，Jee-Hyong Lee
   - 来源：arxiv
   - 相关性分数：97
   - 命中原因：title matched "repository-level"; summary matched "coding agent"; has PDF; has rich summary
   - 分类：cs.SE, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Agent
   - PDF：https://arxiv.org/pdf/2605.14563
   - 摘要：Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.

3. [SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades](https://arxiv.org/abs/2605.14415)
   - Published：2026-05-15 12:00
   - 作者：Man Ho Lam，Chaozheng Wang，Hange Liu，Jingyu Xiao，Haau-sing Li，Jen-tse Huang 等
   - 来源：arxiv
   - 相关性分数：97
   - 命中原因：title matched "coding agent"; summary matched "issue resolution"; has PDF; has rich summary
   - 分类：cs.SE, cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Agent
   - PDF：https://arxiv.org/pdf/2605.14415
   - 摘要：Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

4. [Documentation-Guided Agentic Codebase Migration from C to Rust](https://arxiv.org/abs/2605.14634)
   - Published：2026-05-15 12:00
   - 作者：Minh Le-Anh，Anh Nguyen Hoang，Bach Le，Nghi D. Q. Bui
   - 来源：arxiv
   - 相关性分数：75
   - 命中原因：summary matched "coding agent"; summary matched "repository-level"; has PDF; has rich summary
   - 分类：cs.SE
   - 标签：评测 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2605.14634
   - 摘要：Migrating legacy C repositories to Rust promises stronger memory safety, but existing translators often work at the level of files or functions and miss architectural intent. We present RustPrint, a documentation-guided agentic framework for repository-level C-to-Rust migration. RustPrint first converts the source repository into architecture-aware documentation and treats it as a migration blueprint capturing module structure, data flow, APIs, and design rationale. Coding agents then use this blueprint to plan crates, implement modules, check compilability, reduce unsafe code, and iteratively refine the translated repository. RustPrint next compares documentation from the Rust output against the source documentation and uses mismatches as repair signals. It also translates and runs source test suites so runtime failures can guide targeted fixes. Experiments on eight real-world C repositories ranging from 11K to 84K LoC show that RustPrint compiles every target under both an open-weight (Kimi-K2-Instruct) and a closed-weight (GPT-5.4) backbone, while prior LLM-based translators (Self-Repair, EvoC2Rust) fail repository-wide. With the open-weight Kimi-K2-Instruct backbone, RustPrint exceeds an agentic Claude Code baseline on feature preservation (93.26% vs. 52.52%) and on cross-evaluation test pass rate (95.17% vs. 79.85%). These results suggest that documentation-guided coordination is a useful direction for scalable codebase migration.

5. [Comparing Developer and LLM Biases in Code Evaluation](https://arxiv.org/abs/2603.24586)
   - Published：2026-05-15 12:00
   - 作者：Aditya Mittal，Ryan Shar，Zichu Wu，Shyam Agarwal，Tongshuang Wu，Chris Donahue 等
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "code editing"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / LLM
   - PDF：https://arxiv.org/pdf/2603.24586
   - 摘要：As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

6. [Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack](https://arxiv.org/abs/2605.12673)
   - Published：2026-05-15 12:00
   - 作者：Hao Wang，Hanchen Li，Qiuyang Mang，Alvin Cheung，Koushik Sen，Dawn Song
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CR
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Agent
   - PDF：https://arxiv.org/pdf/2605.12673
   - 摘要：Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.