# 每日论文简报

- 生成时间：2026-06-24 13:06:49 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=6, Terminal and SWE Agents=5
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 17 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》、《A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial》。
- 主题「Language Model」：命中 16 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》、《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》。
- 主题「Agent」：命中 12 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》、《Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity》。
- 主题「Benchmark」：命中 3 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning》、《PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models》。
- 主题「Coding Agent」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：6 篇
- Terminal and SWE Agents：5 篇

## 主题聚焦

### LLM

- 命中篇数：17
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》、《A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial》、《EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence》
- 主题速读：
  - 《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》〔评测 / 数据 / 应用 / 方法〕：The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured info…
  - 《A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial》〔评测 / 应用 / 方法〕：Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical ex…

### Language Model

- 命中篇数：16
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》、《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》、《A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial》
- 主题速读：
  - 《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded re…
  - 《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》〔评测 / 数据 / 应用 / 方法〕：The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured info…

### Agent

- 命中篇数：12
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》、《Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity》、《Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment》
- 主题速读：
  - 《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded re…
  - 《Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity》〔数据 / 方法〕：Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakag…

### Benchmark

- 命中篇数：3
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning》、《PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models》、《NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?》
- 主题速读：
  - 《CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning》〔评测 / 应用 / 方法〕：Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field,…
  - 《PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models》〔评测 / 数据 / 应用 / 方法〕：We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse,…

### Coding Agent

- 命中篇数：1
- 覆盖分组：Terminal and SWE Agents
- 代表论文：《Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories》
- 主题速读：
  - 《Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories》〔应用 / 方法〕：Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. W…

## LM 观察

### 本组速览

- 《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded re…
- 《AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach》〔评测 / 数据 / 应用 / 方法〕：The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured info…
- 《A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial》〔评测 / 应用 / 方法〕：Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical ex…
- 《EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence》〔评测 / 数据 / 应用 / 方法〕：Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing be…
- 《Are We Ready For An Agent-Native Memory System?》〔评测 / 数据 / 方法〕：Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persist…

### 论文速览

1. [AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning](https://arxiv.org/abs/2606.24526v1)
   - Published：2026-06-23 20:57
   - 作者：Honglin Guo，Qi Zhang，Yu Zhang，Weijie Li，Rui Zheng，Zhikai Lei 等
   - 来源：arxiv
   - 相关性分数：199
   - 命中原因：title matched "reasoning"; title matched "agent"; title matched "benchmark"; summary matched "language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Agent
   - PDF：https://arxiv.org/pdf/2606.24526v1
   - 摘要：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model's context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.

2. [AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach](https://arxiv.org/abs/2606.24655v1)
   - Published：2026-06-23 22:48
   - 作者：Murilo Gazzola，Hugo Gobato Souto，Samuel Silva，Júlia Schubert Peixoto，Felipe Siqueira，André Luis Pedroso de Morais 等
   - 来源：arxiv
   - 相关性分数：195
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "RAG"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, cs.LG, cs.PF
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24655v1
   - 摘要：The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large Language Models (LLMs) to perform high-accuracy PAVE specifically for Brazilian e-commerce catalogs. Second, to facilitate reproducible research and provide a definitive benchmark, we introduce and share the Golden Set, a new, meticulously curated, and manually annotated dataset for PAVE in Portuguese. We detail the creation process and structure (Entity, Category, Subcategories) of this high-quality reference set. Our experiments conclusively show that AI-PAVE-Br, leveraging targeted prompt engineering, dramatically outperforms conventional Named Entity Recognition (NER) baselines. This work not only delivers a superior, scalable solution for a major non-English market but also enriches the NLP community with a valuable, publicly available resource for future PAVE research.

3. [A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial](https://arxiv.org/abs/2606.24510v1)
   - Published：2026-06-23 20:42
   - 作者：Haichao Chen，Songchi Zhou，Zhengyun Zhao，Shikai Hu，Xianghong Jin，Hongwei Ji 等
   - 来源：arxiv
   - 相关性分数：181
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "reasoning"; summary matched "LLM"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24510v1
   - 摘要：Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, corresponding to a potential lead time of 1.87 months and 50.18 percent of the within-center interval. In a randomized physician-assistance trial, RaDaR assistance improved physicians' rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone. Synthetic-data ablations suggested that phenotype-anchored narratives provide useful training signal for long-tail rare diseases, with a monotonic scaling trend within the tested data range. Together, RaDaR and its development and validation framework provide a deployable rare-disease reasoning model and a reproducible development framework for diagnostic AI under data scarcity.

4. [EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence](https://arxiv.org/abs/2606.24797v1)
   - Published：2026-06-24 00:49
   - 作者：Linpeng Huang，Weixing Chen，Zexin Chen，Yang Liu，Liang Lin
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "benchmark"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24797v1
   - 摘要：Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.

5. [Are We Ready For An Agent-Native Memory System?](https://arxiv.org/abs/2606.24775v1)
   - Published：2026-06-24 00:34
   - 作者：Wei Zhou，Xuanhe Zhou，Shaokun Han，Hongming Xu，Guoliang Li，Zhiyu Li 等
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.DB, cs.IR
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24775v1
   - 摘要：Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.

6. [CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning](https://arxiv.org/abs/2606.24636v1)
   - Published：2026-06-23 22:29
   - 作者：Xinyu Mao，Yuhui Zeng，Xiaokun Liu，Wenyu Qin，Meng Wang，Xin Tao 等
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.24636v1
   - 摘要：Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified open-form description over multiple cinematographic dimensions. This task is challenging for two main reasons: the model must infer professional cinematographic concepts from subtle visual evidence, and it must generate captions that are both comprehensive and accurate. Accordingly, we propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning with comprehensiveness, accuracy, and gated coverage rewards. The former grounds professional cinematographic descriptions in explicit visual evidence and organizes them into compact atomic reasoning for supervised fine-tuning, while the latter improves the balance between descriptive completeness and factual correctness. In addition, we construct CineCap Bench, a benchmark of 472 manually annotated video-caption pairs for systematic evaluation. Extensive experiments show that CineCap consistently outperforms strong proprietary and open-source baselines, establishing a new state of the art for cinematographic captioning. The code, model checkpoint, and benchmark are publicly available in https://github.com/Hectormxy/CineCap.git.

7. [AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability](https://arxiv.org/abs/2606.24589v1)
   - Published：2026-06-23 21:50
   - 作者：Khanak Khandelwal
   - 来源：arxiv
   - 相关性分数：156
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24589v1
   - 摘要：Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen's kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at https://github.com/khanak0509/AdversaBench .

8. [Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity](https://arxiv.org/abs/2606.24623v1)
   - Published：2026-06-23 22:21
   - 作者：Yuanhe Zhao，Tianyu Zhang，Huafei Xing，Derek F. Wong，Jianbin Li，Tao Fang
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "agent"; title matched "RAG"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：数据 / 方法
   - 主题词：Language Model / Agent
   - PDF：https://arxiv.org/pdf/2606.24623v1
   - 摘要：Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We evaluate the framework on the ChatDoctor and Wiki-PII datasets across six large language models. Experimental results demonstrate a significant reduction in privacy leakage under targeted attacks. For instance, we reduced targeted information exposure in LLaMA-3-8B from 144 instances in the baseline to just 1. Furthermore, we maintain strong contextual fidelity with a BLEU-1 score of 0.122, outperforming the existing SAGE method's 0.117. Finally, the framework operates as an asynchronous preprocessing module, introducing no additional latency to online inference, as all rewriting is executed as a one-time offline preprocessing step. To promote reproducibility, the source code of this work is publicly available at https://github.com/foursoils/Privacy-Preserving-RAG.

9. [Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models](https://arxiv.org/abs/2606.24610v1)
   - Published：2026-06-23 22:10
   - 作者：Jory Alshaalan，Haya Albaker，Abeer Aldayel，Aljawharah Alabdullatif，Rehab Alahmadi
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24610v1
   - 摘要：The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework that integrates a cross-linguistic collection of 414 proverbs spanning 15 languages and uses four LLMs to generate 13k narratives. By employing semantically equivalent proverbs as culturally grounded prompts, the analysis assesses whether models preserve meaning across languages, how cross-lingual conditioning influences narrative realization, and whether different model families converge on similar interpretations. Results indicate that cross-lingual prompting largely preserves proverb-level semantic meaning while systematically redistributing agency, social positioning, and narrative structure. Additionally, strong inter-model convergence is observed in both monolingual and cross-lingual settings, suggesting that multilingual LLMs rely on shared semantic abstractions despite architectural and linguistic differences. These findings shed light on the need for more comprehensive evaluations of cultural grounding. Relying exclusively on semantic similarity in multilingual narrative assessments may overestimate cultural preservation by neglecting culturally meaningful variations in narrative expression.

10. [Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment](https://arxiv.org/abs/2606.24834v1)
   - Published：2026-06-24 01:15
   - 作者：Ali Pourghasemi Fatideh，Wilder Baldwin，Maria Dhakal，Collin McMillan，Sepideh Ghanavati
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "reasoning"; summary matched "agent"; summary matched "benchmark"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24834v1
   - 摘要：LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond single-turn accuracy to capture both the correctness of the system's outputs and the quality of the multi-turn interaction. In this paper, we investigate the accuracy and quality of multi-turn conversations between developers and an LLM-based agent in the domain of Health Insurance Portability and Accountability Act (HIPAA) regulatory compliance. We hired 49 programmers to interact with GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, a system designed to comply with HIPAA regulations, across three dimensions: requirement satisfaction level, reasoning, and code localization. We find that developers tend to agree with LLM assessments, but accuracy against expert ground truth is low. We model user satisfaction and find that longer system responses and more information-providing turns negatively affect user satisfaction, whereas proactive interactions positively affect it. Our findings provide insights for designing LLM-based dialogue systems that support NFR assessment.

11. [ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling](https://arxiv.org/abs/2606.24605v1)
   - Published：2026-06-23 22:05
   - 作者：Tianbao Ma，Chang Xi，Yichuan Zou，Chengen Li，Linxun Chen，Zilong Lu 等
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24605v1
   - 摘要：Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve reasoning reliability, ScaleToT constructs typed user-state chains with a bounded entropy-guided Tree-of-Thought (ToT) refinement procedure. To make this structured reasoning usable from sparse profiles, the teacher-curated chains are used to train a student model on static profiles through supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). ScaleToT then transfers the student's reasoning representations to a lightweight profile encoder, providing shared reasoning signals for the remaining users without LLM inference. We evaluate ScaleToT on lifetime value (LTV) prediction in a billion-scale advertising deployment. A randomized online A/B test increased LT30 by 6.738\%, while offline reasoning covered only 7.32\% of the potential population, greatly reducing compute cost compared with full-population reasoning.

12. [Scaling Laws for Task-Specific LLM Distillation](https://arxiv.org/abs/2606.24747v1)
   - Published：2026-06-24 00:09
   - 作者：Lavinia Ghita，Dhruv Desai，Ioana Boier
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.AI, cs.CE
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24747v1
   - 摘要：Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare logit-based and LoRA-based distillation under iterative structural pruning, introducing a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse well before the same point; supervision format is the key driver of this tradeoff, with chain-of-thought supervision actively recovering general knowledge that pruning erases. We release the headline dataset FinHeadlineMix, scaling law results, and practical recommendations to provide a reusable framework for domain-specific compression decisions.

13. [Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation](https://arxiv.org/abs/2606.24515v1)
   - Published：2026-06-23 20:46
   - 作者：Marta Sumyk，Oleksandr Kosovan
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "agent"; title matched "evaluation"; summary matched "language model"; summary matched "RAG"
   - 分类：cs.AI, cs.HC
   - 标签：评测 / 方法
   - 主题词：Language Model / Agent
   - PDF：https://arxiv.org/pdf/2606.24515v1
   - 摘要：Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalable supervision signal for GUI agents. Given a final screenshot and the original instruction, a Vision-Language Model judges task completion and provides terminal feedback without task-specific heuristics or manual labels during policy optimization. Because autonomous evaluators are imperfect, we model their feedback as a noisy binary reward channel and derive a noise-corrected reward estimator for Proximal Policy Optimization. Experiments across macOSWorld, Windows Agent Arena, and OSWorld show that corrected evaluator rewards outperform both zero-shot baselines and raw evaluator rewards, improving success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning. These results suggest that autonomous evaluation can serve as a practical reward signal for RL in GUI environments when evaluator noise is explicitly modeled and corrected.

14. [ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling](https://arxiv.org/abs/2606.24437v1)
   - Published：2026-06-23 19:14
   - 作者：Heng Ping，Arijit Bhattacharjee，Peiyu Zhang，Shixuan Li，Wei Yang，Ali Jannesari 等
   - 来源：arxiv
   - 相关性分数：140
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "LLM"; summary matched "benchmark"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24437v1
   - 摘要：Mixture-of-Agents (MoA) architectures improve inference-time scaling by organizing multiple LLM agents into layered reasoning pipelines. However, existing MoA variants fail to sustain gains as depth increases, exhibiting degradation, early plateauing, or saturation. We propose ReM-MoA, a memory-augmented MoA framework that sustains scaling through two mechanisms: (1) a Ranked Reasoning Memory that persistently stores and ranks reasoning traces from all layers using a comparative Reviewer Agent, and (2) a Curated Diversified Memory Routing scheme that exposes different agents to distinct combinations of successful and failed traces, preserving exploration diversity while propagating high-quality reasoning. We further introduce an optional multi-domain Reviewer distillation pipeline that improves ranking quality through frontier-model supervision. Across five reasoning benchmarks spanning math, formal logic, code, knowledge, and commonsense, ReM-MoA consistently outperforms prior MoA variants across both depth and width scaling, and its advantage widens with depth, establishing structured cross-layer reasoning memory as a key missing mechanism for scalable multi-agent inference.

15. [Qwen-AgentWorld: Language World Models for General Agents](https://arxiv.org/abs/2606.24597v1)
   - Published：2026-06-23 21:53
   - 作者：Yuxin Zuo，Zikai Xiao，Li Sheng，Fei Huang，Jianhong Tu，Yuxuan Liu 等
   - 来源：arxiv
   - 相关性分数：138
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "reasoning"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Agent
   - PDF：https://arxiv.org/pdf/2606.24597v1
   - 摘要：A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: https://github.com/QwenLM/Qwen-AgentWorld

## Agent Runtime Security 观察

### 本组速览

- 《Burnyard: Future of Malware Analysis》〔方法〕：Malware analysis is a critical aspect of modern cybersecurity. The prevailing industry practice, sandboxing, involves executing suspicious binaries within isol…
- 《LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context》〔应用 / 方法〕：While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal L…
- 《Red-Teaming the Agentic Red-Team》〔方法〕：The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the c…
- 《PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models》〔评测 / 数据 / 应用 / 方法〕：We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse,…
- 《Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees》〔评测 / 方法〕：LLM agents increasingly rely on persistent long-term memory, which creates a critical vulnerability that we study here: memory poisoning. An adversary can stor…

### 论文速览

1. [Burnyard: Future of Malware Analysis](https://arxiv.org/abs/2606.24778v1)
   - Published：2026-06-24 00:36
   - 作者：Rama Ramana Sharma Parnandi，Carter Yagemann
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "sandboxing"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR
   - 标签：方法
   - 主题词：Sandboxing
   - PDF：https://arxiv.org/pdf/2606.24778v1
   - 摘要：Malware analysis is a critical aspect of modern cybersecurity. The prevailing industry practice, sandboxing, involves executing suspicious binaries within isolated virtual machines in large-scale data centers. However, this approach can unintentionally expose samples to public platforms such as VirusTotal and MalwareBazaar, and it is both resource-intensive and time-consuming. Burnyard addresses these limitations through a lightweight binary emulation platform that captures observable runtime behavior and records it as structured CSV event traces.

2. [LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context](https://arxiv.org/abs/2606.24585v1)
   - Published：2026-06-23 21:47
   - 作者：Anastasiia Kucherenko，François Brouchoud，Dimitri Percia David，Andrei Kucharavy
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Jailbreak
   - PDF：https://arxiv.org/pdf/2606.24585v1
   - 摘要：While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistants, to assess the impact of overrefusal on legal prompts. Surprisingly, we find that authority-style prefixes (``you are acting as an assistant of the national supreme court'', ``[...] defense lawyer'') systematically increase refusal rates by 2--20x over the no-prefix baseline, while a known role-play jailbreak prefix shows mixed effects, sharply increasing refusals in some models and barely shifting them in others. The finding suggests that small on-prem deployable LLMs are unstable under contextual framings that a real institutional user might naturally introduce, and further investigation is essential to minimize opportunities for bias.

3. [Red-Teaming the Agentic Red-Team](https://arxiv.org/abs/2606.24496v1)
   - Published：2026-06-23 20:27
   - 作者：Dario Pasquini，Michal Bazyli，Taras Fedynyshyn，Artem Sorokin
   - 来源：arxiv
   - 相关性分数：43
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24496v1
   - 摘要：The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws that enable an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator's machine, even when the agent operates inside a sandboxed container. To support our analysis, we introduce a full cyber kill chain for such agentic systems, capturing the progression from initial LLM manipulation to lateral movement, persistence, guardrail bypass, and sandbox escape. Building on our security analysis, we derive a robust architecture for agentic offensive-security tools and propose actionable, broadly applicable design principles that mitigate the disclosed attack paths at the architectural level.

4. [PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models](https://arxiv.org/abs/2606.24388v1)
   - Published：2026-06-23 18:20
   - 作者：Simone Gallivanone，Hossein Khodadadi，Mauro Dore，Mauro Medda，Nicola Franco
   - 来源：arxiv
   - 相关性分数：41
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.24388v1
   - 摘要：We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adversarial samples, generated using state-of-the-art attack strategies from recent literature. Our work complements existing efforts by consolidating and extending prior benchmarks from multiple established sources, resulting in 7 826 intents, and introduce an additional category to broaden coverage. This provides realistic evaluation resources for studying model robustness and alignment. Our dataset intends to enable researchers and practitioners to systematically evaluate the robustness and safety of VLMs, fine-tune attack-generation models, and develop or stress-test defensive guardrails under diverse adversarial conditions. By releasing this resource, we aim to lower the barrier to adversarial research and foster more reproducible, comprehensive, and comparable evaluations of VLM safety.

5. [Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees](https://arxiv.org/abs/2606.24322v1)
   - Published：2026-06-23 16:57
   - 作者：Yedidel Louck
   - 来源：arxiv
   - 相关性分数：39
   - 命中原因：summary matched "data exfiltration"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24322v1
   - 摘要：LLM agents increasingly rely on persistent long-term memory, which creates a critical vulnerability that we study here: memory poisoning. An adversary can store untrusted content in one session that later steers a consequential action, such as a payment, a setting change, or data exfiltration, in a future session. Existing defenses base a memory item's authority to act on either its content (detection or trust-scoring) or its derivation history (lineage). We show that both signals are malleable. An attacker can launder an untrusted origin through three channels specific to LLM agents: the agent's own summarization, a trusted-tool echo, and manufactured corroboration. Each makes the content look benign and breaks or flips its derivation edge to ``trusted.'' We formalize malleability for the memory write-retrieve-act pipeline and prove a machine-checked separation theorem. No content- or lineage-based defense is sound under laundering (T1), write-time origin binding is necessary (T2), and non-malleable origin-bound authority with Sybil-resistant corroboration-gated elevation is sufficient (T3). Our construction, TMA-NM (Tamper-evident Memory Authority, Non-Malleable), instantiates non-malleable information-flow control (IFC) for LLM-agent memory. A cross-defense, cross-attack, and cross-model benchmark over eight frontier models shows that existing defenses fail exactly where the theory predicts (up to 68% laundering attack-success), while TMA-NM reaches 0% attack success on both direct and laundering attacks across all models and channels, at full legitimate utility. We release the benchmark, harness, and machine-checked TLA+ models to support reproducibility.

6. [Pigeonholing: Bad prompts hurt models to collapse and make mistakes](https://arxiv.org/abs/2606.24267v1)
   - Published：2026-06-23 15:52
   - 作者：Hyunji Nam，Keertana Chidambaram，Dorottya Demszky，Natasha Jaques
   - 来源：arxiv
   - 相关性分数：38
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL, cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24267v1
   - 摘要：While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution, and (2) when the conversation context includes the assistant's previous (incorrect) responses. Our experiments across 10 verifiable and open-ended tasks with 10 different models show that pigeonholing manifests in several ways: (1) repeating the incorrect answers from context (leading to 38-40% performance drop), (2) converging on a narrow set of answers in coding and text generation without exploring alternatives, and (3) flipping stance on controversial topics to align with the user or the assistant's previous claims. We find that pigeonholing worsens almost monotonically with the number of conversation turns (performance drops by additional 14+% as repeated mistakes increase from 1 to 5), and pigeonholing-induced mode collapse can happen even when the provided example is correct. As a step toward mitigation, we propose RLVR with synthetic errors which improves models by 43-60% under bad contexts compared to vanilla RLVR baselines.

## Terminal and SWE Agents 观察

### 本组速览

- 《SHERLOC: Structured Diagnostic Localization for Code Repair Agents》〔方法〕：LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localiza…
- 《NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?》〔评测 / 应用 / 方法〕：We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI cod…
- 《Bayesian control for coding agents》〔评测 / 方法〕：Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed…
- 《Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories》〔应用 / 方法〕：Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. W…
- 《LemonHarness Technical Report》〔方法〕：As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents…

### 论文速览

1. [SHERLOC: Structured Diagnostic Localization for Code Repair Agents](https://arxiv.org/abs/2606.24820v1)
   - Published：2026-06-24 01:05
   - 作者：Hovhannes Tamoyan，Sean Narenthiran，Erik Arakelyan，Mira Mezini，Boris Ginsburg
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "code repair"; summary matched "SWE-bench"; summary matched "repository-level"; has PDF
   - 分类：cs.CL
   - 标签：方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24820v1
   - 摘要：LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

2. [NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?](https://arxiv.org/abs/2606.24530v1)
   - Published：2026-06-23 20:58
   - 作者：Yuru Wang，Lejun Cheng，Yuxin Zuo，Sihang Zeng，Bingxiang He，Che Jiang 等
   - 来源：arxiv
   - 相关性分数：65
   - 命中原因：title matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.24530v1
   - 摘要：We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

3. [Bayesian control for coding agents](https://arxiv.org/abs/2606.24453v1)
   - Published：2026-06-23 19:41
   - 作者：Theodore Papamarkou，Vladislav Smirnov，Viktor Mazanov，Artem Vazhentsev，Preslav Nakov，Timothy Baldwin 等
   - 来源：arxiv
   - 相关性分数：64
   - 命中原因：title matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.24453v1
   - 摘要：Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and nine coding benchmarks, Bayesian control proves to be most valuable when verification is costly and critics are informative but imperfect. Beyond control, the belief state yields an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.

4. [Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories](https://arxiv.org/abs/2606.24429v1)
   - Published：2026-06-23 19:05
   - 作者：Arsham Khosravani，Audris Mockus
   - 来源：arxiv
   - 相关性分数：63
   - 命中原因：title matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：应用 / 方法
   - 主题词：Agent / Coding Agent
   - PDF：https://arxiv.org/pdf/2606.24429v1
   - 摘要：Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. We introduce a multi-layered detection framework that integrates configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup across World of Code (180M+ Git repositories), classifying agent traces into four behavioral types. No single method captures more than a fraction of activity: multi-method detection identifies 850,157 Claude Code commits in one snapshot, of which bot-account lookup_the signal most adoption studies rely on_recovers only 28,154 (3.3%), a 30x relative-recall gap, so single-signal prevalence estimates are biased low by at least this factor. Every detection pattern is hand-validated (495 labels) with per-cell precision and Wilson confidence intervals. Across snapshots from December 2024 to April 2026, commit-attributed agents generate over 320,000 commits per month; Claude Code leads (886,122 commits across 17,295 projects) and dominates silent, configuration-file-only adoption (21,078 projects). Compared against an independent pull-request census (AIDev), the two channels capture nearly disjoint agent populations_a PR census misses 79% of commit-detected Claude Code adopters and essentially all Codex adopters_and different kinds of work: PR-deployed cloud agents (Codex, Cursor) surface as feature work, while commit-deployed in-editor agents (Claude Code, OpenHands, Aider) surface as maintenance. The observed work profile follows deployment and detection mode rather than the tool itself, so no single channel is representative.

5. [LemonHarness Technical Report](https://arxiv.org/abs/2606.24311v1)
   - Published：2026-06-23 16:44
   - 作者：Kailong Ren，Fubo Sun，Jiachen Liu，Liu Yang，Zimo Yin，Jiaying Li 等
   - 来源：arxiv
   - 相关性分数：39
   - 命中原因：summary matched "Terminal-Bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.24311v1
   - 摘要：As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary. State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge. LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On Terminal-Bench 2.0, LemonHarness_GPT-5.3-CodeX reached 84.49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5.5 backbone raised the average accuracy to 86.52% across five jobs. The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.