# 每日论文简报

- 生成时间：2026-06-05 13:25:00 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=5, Terminal and SWE Agents=10
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 23 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》、《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》。
- 主题「Language Model」：命中 20 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》、《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》。
- 主题「Benchmark」：命中 11 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement》、《PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models》。
- 主题「Agent」：命中 4 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《Beyond Similarity: Trustworthy Memory Search for Personal AI Agents》、《RAT: RunAnyThing via Fully Automated Environment Configuration》。
- 主题「Coding Agent」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：5 篇
- Terminal and SWE Agents：10 篇

## 主题聚焦

### LLM

- 命中篇数：23
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》、《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》、《AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints》
- 主题速读：
  - 《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. W…
  - 《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》〔评测 / 应用 / 方法〕：Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through…

### Language Model

- 命中篇数：20
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》、《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》、《AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints》
- 主题速读：
  - 《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. W…
  - 《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》〔评测 / 应用 / 方法〕：Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through…

### Benchmark

- 命中篇数：11
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement》、《PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models》、《The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models》
- 主题速读：
  - 《CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement》〔评测 / 方法〕：While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-…
  - 《PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models》〔评测 / 数据 / 方法〕：Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decisio…

### Agent

- 命中篇数：4
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Beyond Similarity: Trustworthy Memory Search for Personal AI Agents》、《RAT: RunAnyThing via Fully Automated Environment Configuration》、《Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?》
- 主题速读：
  - 《Beyond Similarity: Trustworthy Memory Search for Personal AI Agents》〔应用 / 方法〕：Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely…
  - 《RAT: RunAnyThing via Fully Automated Environment Configuration》〔评测 / 方法〕：Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring exe…

### Coding Agent

- 命中篇数：1
- 覆盖分组：Terminal and SWE Agents
- 代表论文：《Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?》
- 主题速读：
  - 《Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?》〔应用 / 方法〕：AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases an…

## LM 观察

### 本组速览

- 《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. W…
- 《CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments》〔评测 / 应用 / 方法〕：Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through…
- 《AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints》〔评测 / 方法〕：Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressiv…
- 《The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models》〔评测 / 方法〕：Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dy…
- 《Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems》〔应用 / 方法〕：Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The d…

### 论文速览

1. [MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models](https://arxiv.org/abs/2606.05177)
   - Published：2026-06-05 12:00
   - 作者：Manh Luong，Tamas Abraham，Junae Kim，Amar Kaur，Rollin Omari，Gholamreza Haffari 等
   - 来源：arxiv
   - 相关性分数：214
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "benchmark"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, eess.AS
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05177
   - 摘要：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

2. [CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments](https://arxiv.org/abs/2606.06399)
   - Published：2026-06-05 12:00
   - 作者：Jiaju Chen，Bo Sun，Yuxuan Lu，Yun Wang，Dakuo Wang，Bingsheng Yao
   - 来源：arxiv
   - 相关性分数：210
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06399
   - 摘要：Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

3. [AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints](https://arxiv.org/abs/2606.05622)
   - Published：2026-06-05 12:00
   - 作者：Jiayu Liu，Cheng Qian，Zhenhailong Wang，Bingxuan Li，Jiateng Liu，Heng Wang 等
   - 来源：arxiv
   - 相关性分数：196
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "agent"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05622
   - 摘要：Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

4. [The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models](https://arxiv.org/abs/2606.06188)
   - Published：2026-06-05 12:00
   - 作者：Jinyang Zhang，Hongxin Ding，Yue Fang，Weibin Liao，Muyang Ye，Junfeng Zhao 等
   - 来源：arxiv
   - 相关性分数：196
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "reasoning"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06188
   - 摘要：Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The-Tell-Tale-Norm.

5. [Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems](https://arxiv.org/abs/2606.05711)
   - Published：2026-06-05 12:00
   - 作者：Yingzhuo Liu
   - 来源：arxiv
   - 相关性分数：192
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05711
   - 摘要：Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

6. [Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models](https://arxiv.org/abs/2606.05874)
   - Published：2026-06-05 12:00
   - 作者：Huiyuan Zheng，Houtao Zhang，Boyang Wang，Qingyi Si，Hongcheng Guo
   - 来源：arxiv
   - 相关性分数：192
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05874
   - 摘要：Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

7. [IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval](https://arxiv.org/abs/2606.06044)
   - Published：2026-06-05 12:00
   - 作者：Xiaoman Wang，Yaoze Zhang，Wenzhuo Fan，Hongwei Zhang，Ding Wang，Guohang Yan 等
   - 来源：arxiv
   - 相关性分数：192
   - 命中原因：title matched "reasoning"; title matched "RAG"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06044
   - 摘要：Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.

8. [A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation](https://arxiv.org/abs/2606.06420)
   - Published：2026-06-05 12:00
   - 作者：Petr Parshakov
   - 来源：arxiv
   - 相关性分数：174
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06420
   - 摘要：We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.

9. [CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement](https://arxiv.org/abs/2606.05793)
   - Published：2026-06-05 12:00
   - 作者：Hong Qian，Yuanhao Liu，Zihan Zhou，Zongbao Zhang，Hanjie Ge，Haotian Shi 等
   - 来源：arxiv
   - 相关性分数：174
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "reasoning"; summary matched "agent"
   - 分类：cs.CL, cs.AI, cs.CY, cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05793
   - 摘要：While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

10. [LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs](https://arxiv.org/abs/2606.06286)
   - Published：2026-06-05 12:00
   - 作者：Gianluca Barmina，Peter Schneider-Kamp，Lukas Galke Poech
   - 来源：arxiv
   - 相关性分数：174
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06286
   - 摘要：Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

11. [PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models](https://arxiv.org/abs/2606.05744)
   - Published：2026-06-05 12:00
   - 作者：Minxin Chen，He Zhu，Junyou Su，Wen Wang，Yijie Deng，Wenjia Zhang
   - 来源：arxiv
   - 相关性分数：174
   - 命中原因：title matched "language model"; title matched "benchmark"; summary matched "reasoning"; summary matched "agent"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05744
   - 摘要：Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.

12. [Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program](https://arxiv.org/abs/2606.05564)
   - Published：2026-06-05 12:00
   - 作者：Varun Aggarwal，Kay Kobak，John Howarter
   - 来源：arxiv
   - 相关性分数：174
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05564
   - 摘要：Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant's SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.

13. [EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation](https://arxiv.org/abs/2606.06025)
   - Published：2026-06-05 12:00
   - 作者：Xinpeng Qiu，Wang Yihu，Zhifeng Liu，Xiaochen Wang，Jimin Wang
   - 来源：arxiv
   - 相关性分数：170
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.06025
   - 摘要：Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

14. [YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition](https://arxiv.org/abs/2606.05868)
   - Published：2026-06-05 12:00
   - 作者：PSBC LLM Team，Huawei LLM Team，Ruihan Long，Junjie Wu，Tianan Zhang，Duo Zhang 等
   - 来源：arxiv
   - 相关性分数：170
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05868
   - 摘要：Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

15. [Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)](https://arxiv.org/abs/2606.05901)
   - Published：2026-06-05 12:00
   - 作者：Christopher J. Wedge，Joshua Stutter，Danny Dixon，Jacek Ca{\l}a
   - 来源：arxiv
   - 相关性分数：166
   - 命中原因：summary matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05901
   - 摘要：Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

## Agent Runtime Security 观察

### 本组速览

- 《GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection》〔评测 / 数据 / 应用 / 方法〕：Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In a…
- 《From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents》〔数据 / 方法〕：LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny d…
- 《Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack》〔评测 / 方法〕：Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recogniz…
- 《Beyond Similarity: Trustworthy Memory Search for Personal AI Agents》〔应用 / 方法〕：Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely…
- 《The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models》〔评测 / 数据 / 方法〕：Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introd…

### 论文速览

1. [GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection](https://arxiv.org/abs/2606.05566)
   - Published：2026-06-05 12:00
   - 作者：Paulo Ricardo Ferreira Neves，Edson Rodrigues da Cruz Filho，Paulo Henrique Eleuterio Falsetti，Jo\~ao Vitor Pavan，Ian Degaspari，Henrique Vieira Laturrague 等
   - 来源：arxiv
   - 相关性分数：138
   - 命中原因：title matched "prompt injection"; title matched "jailbreak"; summary matched "guardrail"; has PDF
   - 分类：cs.AI, cs.CR
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05566
   - 摘要：Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.

2. [From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents](https://arxiv.org/abs/2606.05805)
   - Published：2026-06-05 12:00
   - 作者：Yuhao Sun，Jiacheng Zhang，Shaanan Cohney，Zhexin Zhang，Feng Liu，Xingliang Yuan
   - 来源：arxiv
   - 相关性分数：80
   - 命中原因：title matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05805
   - 摘要：LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

3. [Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack](https://arxiv.org/abs/2606.05614)
   - Published：2026-06-05 12:00
   - 作者：Long P. Hoang，Hai V. Le，Shaoyang Xu，Wei Lu，Wenxuan Zhang
   - 来源：arxiv
   - 相关性分数：76
   - 命中原因：summary matched "jailbreak"; summary matched "guardrail"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.05614
   - 摘要：Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

4. [Beyond Similarity: Trustworthy Memory Search for Personal AI Agents](https://arxiv.org/abs/2606.06054)
   - Published：2026-06-05 12:00
   - 作者：Jiawen Zhang，Kejia Chen，Jiachen Ma，Yangfan Hu，Lipeng He，Yechao Zhang 等
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.06054
   - 摘要：Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

5. [The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models](https://arxiv.org/abs/2606.05183)
   - Published：2026-06-05 12:00
   - 作者：Patrick Keough
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL, cs.AI, cs.HC
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05183
   - 摘要：Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

## Terminal and SWE Agents 观察

### 本组速览

- 《ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer》〔评测 / 方法〕：The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understan…
- 《Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement》〔评测 / 方法〕：Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldo…
- 《Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation》〔评测 / 方法〕：Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maint…
- 《SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks》〔评测 / 方法〕：Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their…
- 《From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws》〔评测 / 应用 / 方法〕：LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verifica…

### 论文速览

1. [ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer](https://arxiv.org/abs/2606.05548)
   - Published：2026-06-05 12:00
   - 作者：Jintao Huang，Xiaomin Li，Gaurav Mittal，Yu Hu
   - 来源：arxiv
   - 相关性分数：94
   - 命中原因：summary matched "Terminal-Bench"; summary matched "SWE-bench"; summary matched "coding agent"; has PDF
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05548
   - 摘要：The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $\tau^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

2. [Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement](https://arxiv.org/abs/2606.05920)
   - Published：2026-06-05 12:00
   - 作者：Xin Wang，Liangtai Sun，Yaoming Zhu，Shuang Zhou，Jiaxing Liu，Fengjiao Chen 等
   - 来源：arxiv
   - 相关性分数：80
   - 命中原因：title matched "code agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05920
   - 摘要：Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

3. [Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation](https://arxiv.org/abs/2511.14224)
   - Published：2026-06-05 12:00
   - 作者：Anji Li，Mingwei Liu，Zhenxi Chen，Zheng Pei，Zike Li，Dekun Dai 等
   - 来源：arxiv
   - 相关性分数：80
   - 命中原因：title matched "test generation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2511.14224
   - 摘要：Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maintainable in real-world projects. This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge to enhance LLM-based test generation. Our approach first extracts project structure and usage knowledge through static analysis, which provides rich context for the model. It then employs a testing-domain-knowledge-guided separation of test case design and test method generation, combined with a multi-perspective prompting strategy that guides the LLM to consider diverse testing heuristics. The generated tests follow structured templates, improving clarity and maintainability. We evaluate KTester on multiple open-source projects, comparing it against state-of-the-art LLM-based baselines using automatic correctness and coverage metrics, as well as a human study assessing readability and maintainability. Results demonstrate that KTester significantly outperforms existing methods across six key metrics, improving execution pass rate by 5.69% and line coverage by 8.83% over the strongest baseline, while requiring less time and generating fewer test cases. Human evaluators also rate the tests produced by KTester significantly higher in terms of correctness, readability, and maintainability, confirming the practical advantages of our knowledge-driven framework.

4. [SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks](https://arxiv.org/abs/2606.05574)
   - Published：2026-06-05 12:00
   - 作者：Fake Lin，Binbin Hu，Xi Zhu，Ziwei Zhao，Zhi Zheng，Ziqi Liu 等
   - 来源：arxiv
   - 相关性分数：80
   - 命中原因：title matched "code agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05574
   - 摘要：Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents. In this paper, we propose SmellBench, an extensible code refactoring benchmark that proactively injects code smells into clean code snippets from real-world repositories. This design enables the generation of controlled, high-quality, and diverse refactoring cases with human-written ground truth. Specifically, it contains 294 cases spanning 7 popular smell types, 3 difficulty levels, 2 instruction settings across 7 real-world repositories. We further design 3 evaluation aspects covering functional correctness, localization ability, and refactoring quality assessment. Experiments with 2 popular agents and 6 large langauge models (LLMs) show that the best combination - Qwen Code + Claude Sonnet 4.5 - achieved only a 50.34 score of smell elimination. Further analysis reveals that this gap arises from a focus on local code smells and a lack of cross-file understanding, which hinders comprehensive smell elimination.

5. [From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws](https://arxiv.org/abs/2606.06324)
   - Published：2026-06-05 12:00
   - 作者：Mengzhuo Chen，Junjie Wang，Zhe Liu，Yawen Wang，Qing Wang
   - 来源：arxiv
   - 相关性分数：76
   - 命中原因：summary matched "Terminal-Bench"; summary matched "SWE-bench"; has PDF; has rich summary
   - 分类：cs.SE, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.06324
   - 摘要：LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

6. [RAT: RunAnyThing via Fully Automated Environment Configuration](https://arxiv.org/abs/2604.23190)
   - Published：2026-06-05 12:00
   - 作者：Renhong Huang，Dongdong Hua，Yifei Sun，Sitao Ding，Hanyang Yuan，Daixin Wang 等
   - 来源：arxiv
   - 相关性分数：76
   - 命中原因：summary matched "code agent"; summary matched "repository-level"; has PDF; has rich summary
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.23190
   - 摘要：Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to diverse real-world repositories. In this paper, we first propose RAT (RunAnyThing), a modular and extensible agent framework for fully automated configuration across programming languages on arbitrary repositories. RAT adopts a multi-stage pipeline that integrates language-aware abstraction, image initialization, specialized configuration toolset, and robust sandbox. Furthermore, to enable rigorous evaluation, we propose RATBench, a benchmark reflects the comprehensive coverage of real-world repositories. Extensive experiments demonstrate that RAT achieves state-of-the-art performance, improving Environment Setup Success Rate (ESSR) by an average of 36.1% over strong baselines.

7. [Closing the Loop on Latent Reasoning via Test-Time Reconstruction](https://arxiv.org/abs/2606.06252)
   - Published：2026-06-05 12:00
   - 作者：Xiaopeng Yuan，Haibo Jin，Ye Yu，Peng Kuang，Lijun Yu，Yushun Dong 等
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "code generation benchmark"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2606.06252
   - 摘要：Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question -> Latent Thought -> Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.

8. [Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution](https://arxiv.org/abs/2606.06492)
   - Published：2026-06-05 12:00
   - 作者：Liliana Hotsko，Yinxi Li，Yuntian Deng，Pengyu Nie
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "repository-level"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.06492
   - 摘要：Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

9. [Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?](https://arxiv.org/abs/2606.05647)
   - Published：2026-06-05 12:00
   - 作者：Jingheng Ye，Huiqi Zou，Simon Yu，Weiyan Shi
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL, cs.CY, cs.HC
   - 标签：应用 / 方法
   - 主题词：Agent / Coding Agent
   - PDF：https://arxiv.org/pdf/2606.05647
   - 摘要：AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

10. [Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence](https://arxiv.org/abs/2605.29054)
   - Published：2026-06-05 12:00
   - 作者：Linxin Song，Jiefeng Chen，Yue Huang，Bhavana Dalvi Mishra，Chi Wang，Jieyu Zhao 等
   - 来源：arxiv
   - 相关性分数：58
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2605.29054
   - 摘要：Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.