# 每日论文简报

- 生成时间：2026-06-04 14:02:06 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=6, Terminal and SWE Agents=6
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 22 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》、《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》。
- 主题「Benchmark」：命中 19 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》、《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》。
- 主题「Agent」：命中 7 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs》、《Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions》。
- 主题「Language Model」：命中 2 篇，覆盖 LM，代表论文包括 《Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas》、《GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards》。
- 主题「RAG」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《Latent Anchor-Driven Test Generation for Deep Neural Networks》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：6 篇
- Terminal and SWE Agents：6 篇

## 主题聚焦

### LLM

- 命中篇数：22
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》、《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》、《Self-Evolving Deep Research via Joint Generation and Evaluation》
- 主题速读：
  - 《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understoo…
  - 《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》〔评测 / 应用 / 方法〕：Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers c…

### Benchmark

- 命中篇数：19
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》、《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》、《Self-Evolving Deep Research via Joint Generation and Evaluation》
- 主题速读：
  - 《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understoo…
  - 《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》〔评测 / 应用 / 方法〕：Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers c…

### Agent

- 命中篇数：7
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs》、《Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions》、《AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning》
- 主题速读：
  - 《Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs》〔应用 / 方法〕：Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they…
  - 《Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions》〔应用 / 方法〕：How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interac…

### Language Model

- 命中篇数：2
- 覆盖分组：LM
- 代表论文：《Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas》、《GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards》
- 主题速读：
  - 《Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas》〔方法〕：As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use…
  - 《GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards》〔评测 / 方法〕：Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, curr…

### RAG

- 命中篇数：1
- 覆盖分组：Terminal and SWE Agents
- 代表论文：《Latent Anchor-Driven Test Generation for Deep Neural Networks》
- 主题速读：
  - 《Latent Anchor-Driven Test Generation for Deep Neural Networks》〔数据 / 应用 / 方法〕：Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to i…

## LM 观察

### 本组速览

- 《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understoo…
- 《Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases》〔评测 / 应用 / 方法〕：Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers c…
- 《Self-Evolving Deep Research via Joint Generation and Evaluation》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Un…
- 《Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents》〔评测 / 方法〕：Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Y…
- 《Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas》〔方法〕：As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use…

### 论文速览

1. [A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs](https://arxiv.org/abs/2606.04596)
   - Published：2026-06-04 12:00
   - 作者：Huangchen Xu，Yuan Wu，Yi Chang
   - 来源：arxiv
   - 相关性分数：191
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04596
   - 摘要：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

2. [Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases](https://arxiv.org/abs/2606.05112)
   - Published：2026-06-04 12:00
   - 作者：Cheng Liang，Pengcheng Qiu，Ya Zhang，Yanfeng Wang，Chaoyi Wu，Weidi Xie
   - 来源：arxiv
   - 相关性分数：191
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "agent"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05112
   - 摘要：Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

3. [Self-Evolving Deep Research via Joint Generation and Evaluation](https://arxiv.org/abs/2606.04507)
   - Published：2026-06-04 12:00
   - 作者：Han Zhu，Chengkun Cai，Yuanfeng Song，Xing Chen，Sirui Han，Yike Guo
   - 来源：arxiv
   - 相关性分数：187
   - 命中原因：title matched "evaluation"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04507
   - 摘要：Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

4. [Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents](https://arxiv.org/abs/2606.04874)
   - Published：2026-06-04 12:00
   - 作者：Haoyu Sun，Wenxuan Wang，Mingyang Song，Jujie He，Weinan Zhang，Yang Liu 等
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "LLM"; title matched "agent"; title matched "benchmark"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04874
   - 摘要：Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $\tau^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.

5. [Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas](https://arxiv.org/abs/2606.04846)
   - Published：2026-06-04 12:00
   - 作者：Lisa Korver，Tomo Lazovich，Sherief Reda
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.04846
   - 摘要：As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

6. [Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models](https://arxiv.org/abs/2606.04535)
   - Published：2026-06-04 12:00
   - 作者：Boyan Han，Yiwei Wang，Yi Song，Yujun Cai，Chi Zhang
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04535
   - 摘要：Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

7. [Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair](https://arxiv.org/abs/2606.05030)
   - Published：2026-06-04 12:00
   - 作者：Zehua Cheng，Wei Dai，Jiahao Sun，Thomas Lukasiewicz
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL, cs.SC
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05030
   - 摘要：Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

8. [Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data](https://arxiv.org/abs/2606.05122)
   - Published：2026-06-04 12:00
   - 作者：XiuYu Zhang，Yi Shan，Junfeng Fang，Zhenkai Liang
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.05122
   - 摘要：Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

9. [Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation](https://arxiv.org/abs/2606.04454)
   - Published：2026-06-04 12:00
   - 作者：Xin Zhang，Yang Cao，Baoxing Wu，Kai Song，Siying Li
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04454
   - 摘要：Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

10. [Caliper: Probing Lexical Anchors versus Causal Structure in LLMs](https://arxiv.org/abs/2606.04915)
   - Published：2026-06-04 12:00
   - 作者：Zhenyu Yu，Shuigeng Zhou
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.IR
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04915
   - 摘要：Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

11. [Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA](https://arxiv.org/abs/2606.04262)
   - Published：2026-06-04 12:00
   - 作者：Maroof Kousar，Yibo Hu
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04262
   - 摘要：Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

12. [SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction](https://arxiv.org/abs/2606.04691)
   - Published：2026-06-04 12:00
   - 作者：Kenfeng Huang，Yi Cai，Xin Wu，Zikun Deng，Li Yuan
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04691
   - 摘要：Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

13. [GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards](https://arxiv.org/abs/2606.04889)
   - Published：2026-06-04 12:00
   - 作者：Tej Deep Pala，Vernon Toh，Soujanya Poria
   - 来源：arxiv
   - 相关性分数：165
   - 命中原因：summary matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.04889
   - 摘要：Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

14. [Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs](https://arxiv.org/abs/2606.04360)
   - Published：2026-06-04 12:00
   - 作者：Xinyu Pang，Zhanke Zhou，Xuan Li，Fangrui Lv，Shanshan Wei，Sen Cui 等
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "LLM"; title matched "reasoning"; title matched "agent"; has PDF
   - 分类：cs.CL, cs.LG
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.04360
   - 摘要：Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they rely mainly on scalar feedback such as MSE. We identify a core limitation: existing methods conflate candidate proposal with search guidance, requiring the LLM to infer how to evolve an expression, diagnose its errors, and reuse past experience from a single score. To address this, we propose Deliberate Evolution (DE), an agentic framework that decouples symbolic generation from search control. DE guides LLM proposals with adaptive operators for search direction, analytical tools for structural diagnosis, and reflective memory for trajectory-level experience. Experiments on LLM-SRBench show that DE consistently outperforms representative LLM-based SR baselines across diverse scientific domains while using only 40% of the standard sample budget.

15. [Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions](https://arxiv.org/abs/2606.04197)
   - Published：2026-06-04 12:00
   - 作者：Aliakbar Mehdizadeh，Martin Hilbert
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "LLM"; title matched "agent"; title matched "RAG"; has PDF
   - 分类：cs.MA, cs.CL, cs.SI, physics.soc-ph
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.04197
   - 摘要：How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.

## Agent Runtime Security 观察

### 本组速览

- 《MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models》〔评测 / 应用 / 方法〕：Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surfac…
- 《What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems》〔评测 / 应用 / 方法〕：Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through…
- 《Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents》〔评测 / 方法〕：LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to…
- 《AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning》〔应用 / 方法〕：We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tig…
- 《From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents》〔评测 / 方法〕：Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduc…

### 论文速览

1. [MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models](https://arxiv.org/abs/2606.04027)
   - Published：2026-06-04 12:00
   - 作者：Yingzi Ma，Zhengyue Zhao，Xiaogeng Liu，Minhui Xue，Yue Zhao，Chaowei Xiao
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04027
   - 摘要：Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.

2. [What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems](https://arxiv.org/abs/2606.04425)
   - Published：2026-06-04 12:00
   - 作者：Yuanbo Xie，Tianyun Liu，Yingjie Zhang，Suchen Liu，Yulin Li，Liya Su 等
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04425
   - 摘要：Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

3. [Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents](https://arxiv.org/abs/2606.04141)
   - Published：2026-06-04 12:00
   - 作者：Kargi Chauhan，Pratibha Revankar
   - 来源：arxiv
   - 相关性分数：75
   - 命中原因：summary matched "prompt injection"; summary matched "indirect prompt injection"; has PDF; has rich summary
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04141
   - 摘要：LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

4. [AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning](https://arxiv.org/abs/2606.04484)
   - Published：2026-06-04 12:00
   - 作者：Qingxu Fu，Boyin Liu，Shuchang Tao，Zhaoyang Liu，Bolin Ding
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "agent runtime"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.LG, cs.MA
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.04484
   - 摘要：We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

5. [From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents](https://arxiv.org/abs/2606.04329)
   - Published：2026-06-04 12:00
   - 作者：Pritam Dash，Tongyu Ge，Aditi Jain，Tanmay Shah，Zhiwei Shang
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04329
   - 摘要：Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

6. [Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification](https://arxiv.org/abs/2606.04037)
   - Published：2026-06-04 12:00
   - 作者：Thanh Luong Tuan，Abhijit Sanyal
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.LG, cs.SE
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.04037
   - 摘要：Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a Trust Certificate carrying a machine-verifiable attestation with graduated deployment verdicts (Approved, Conditional, Rejected). A controlled pilot across four regulated industries (Fintech, Banking, Insurance, and Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam, generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation (G4) achieved 48.3% regulatory coverage versus 33.1% for the persona-based baseline (corrected p = .0006) and the highest domain specificity (4.77/5.0; p = 2e-6). The coverage advantage over baseline and retrieval-augmented prompting was not robust after Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The results establish ontology-grounded scenario generation as a credible complement to persona-based test suites for regulatory-intensive domains.

## Terminal and SWE Agents 观察

### 本组速览

- 《Latent Anchor-Driven Test Generation for Deep Neural Networks》〔数据 / 应用 / 方法〕：Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to i…
- 《Can Generalist Agents Automate Data Curation?》〔评测 / 应用 / 方法〕：Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evalua…
- 《Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation》〔评测 / 数据 / 应用 / 方法〕：Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. E…
- 《The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?》〔评测 / 应用 / 方法〕：Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level…
- 《The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents》〔方法〕：As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have…

### 论文速览

1. [Latent Anchor-Driven Test Generation for Deep Neural Networks](https://arxiv.org/abs/2606.04310)
   - Published：2026-06-04 12:00
   - 作者：Bin Duan，Matthew B. Dwyer，Guowei Yang
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "test generation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.SE
   - 标签：数据 / 应用 / 方法
   - 主题词：RAG / TEST Generation
   - PDF：https://arxiv.org/pdf/2606.04310
   - 摘要：Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches explore either the input space or a learned latent space. While latent-space generation can better maintain plausibility than direct input-space mutation, current methods still face a trade-off among exploration controllability, failure diversity, and seed-relative semantic drift. To overcome these limitations, we propose Latte, a black-box testing framework that generates semantically proximate, diverse, and fault-revealing test cases by leveraging the latent space. Specifically, Latte encodes each input seed with a pre-trained VQ-VAE and performs a seed-centered, one-step latent mutation along directions defined by anchors sampled from alternative classes, followed by quantization and decoding back to the input space. This explores local neighborhoods around each seed within the learned latent manifold, resulting in a larger number and broader diversity of oracle-triggering prediction discrepancies under the same budget. We evaluated Latte on 5 datasets and 10 DNN models in single-model and multi-model testing scenarios. Across the evaluated datasets and models, Latte improves fault exposure and behavioral diversity under matched testing budgets. Under the single-model setting, it also maintains low seed-relative semantic drift with respect to the source seeds.

2. [Can Generalist Agents Automate Data Curation?](https://arxiv.org/abs/2606.04261)
   - Published：2026-06-04 12:00
   - 作者：Feiyang Kang，Hanze Li，Adam Nguyen，Mahavir Dabas，Jiaqi W. Ma，Frederic Sala 等
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL, cs.CV, cs.ET, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.04261
   - 摘要：Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

3. [Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation](https://arxiv.org/abs/2606.04402)
   - Published：2026-06-04 12:00
   - 作者：Jingbo Wen，Liang He，Ziqi He
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2606.04402
   - 摘要：Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.

4. [The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?](https://arxiv.org/abs/2606.04455)
   - Published：2026-06-04 12:00
   - 作者：Xinyu Lu，Tianshu Wang，Pengbo Wang，zujie wen，Zhiqiang Zhang，Jun Zhou 等
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "code agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.04455
   - 摘要：Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

5. [The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents](https://arxiv.org/abs/2606.04296)
   - Published：2026-06-04 12:00
   - 作者：Manvendra Modgil
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.04296
   - 摘要：As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

6. [Trustworthy AI Software Engineers](https://arxiv.org/abs/2602.06310)
   - Published：2026-06-04 12:00
   - 作者：Aldeida Aleti，Baishakhi Ray，Rashina Hoda，Simin Chen
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：应用 / 方法
   - 主题词：Agent / Alignment
   - PDF：https://arxiv.org/pdf/2602.06310
   - 摘要：With the rapid rise of AI coding agents, the fundamental premise of what it means to be a software engineer is in question. In this vision paper, we examine what it means for an AI agent to be considered a software engineer and then critically think about what makes such an agent trustworthy. Grounded in established definitions of SE (SE) and informed by recent research on agentic AI systems, we conceptualise AI software engineers as participants in human-AI SE teams composed of human software engineers and AI agents, and we distinguish trustworthiness as a key property of these systems and actors rather than a subjective human attitude. Extending on historical perspectives and emerging visions, we identify key dimensions that contribute to the trustworthiness of AI software engineers, spanning technical quality, transparency and accountability, epistemic humility, and societal and ethical alignment. Beyond defining these dimensions, we address a critical but underexplored challenge: how trustworthiness can be operationalised in practice. We therefore introduce the notion of evidence-centric inspection, arguing that developers should evaluate selective signals and justifications of trustworthiness rather than raw outputs, and we outline implications for rethinking verification, validation, and code review in human-AI SE teams.
