# 每日论文简报

- 生成时间：2026-06-17 14:22:19 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=3, Terminal and SWE Agents=5
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 15 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》、《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》。
- 主题「Agent」：命中 14 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning》、《LLM Consumer Behavior Theory: Foundations of a Novel Research Field》。
- 主题「Benchmark」：命中 10 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》、《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》。
- 主题「Language Model」：命中 6 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》、《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》。
- 主题「Coding Agent」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：3 篇
- Terminal and SWE Agents：5 篇

## 主题聚焦

### LLM

- 命中篇数：15
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》、《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》、《WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning》
- 主题速读：
  - 《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but…
  - 《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》〔方法〕：Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for di…

### Agent

- 命中篇数：14
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning》、《LLM Consumer Behavior Theory: Foundations of a Novel Research Field》、《RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills》
- 主题速读：
  - 《WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning》〔评测 / 数据 / 应用 / 方法〕：Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions…
  - 《LLM Consumer Behavior Theory: Foundations of a Novel Research Field》〔应用 / 方法〕：Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental q…

### Benchmark

- 命中篇数：10
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》、《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》、《Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models》
- 主题速读：
  - 《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but…
  - 《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》〔评测 / 方法〕：Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning,…

### Language Model

- 命中篇数：6
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》、《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》、《From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning》
- 主题速读：
  - 《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》〔方法〕：Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for di…
  - 《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》〔评测 / 方法〕：Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning,…

### Coding Agent

- 命中篇数：1
- 覆盖分组：Terminal and SWE Agents
- 代表论文：《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》
- 主题速读：
  - 《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》〔方法〕：Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies r…

## LM 观察

### 本组速览

- 《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but…
- 《Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews》〔方法〕：Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for di…
- 《The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act》〔评测 / 方法〕：Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning,…
- 《WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning》〔评测 / 数据 / 应用 / 方法〕：Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions…
- 《From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning》〔评测 / 方法〕：Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large langua…

### 论文速览

1. [Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports](https://arxiv.org/abs/2606.18166v1)
   - Published：2026-06-17 01:04
   - 作者：Ahmed Ryan，Saad Sakib Noor，Md Erfan，Shaswata Mitra，Sudip Mittal，Md Rayhanur Rahman
   - 来源：arxiv
   - 相关性分数：176
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CR, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.18166v1
   - 摘要：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

2. [Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews](https://arxiv.org/abs/2606.18019v1)
   - Published：2026-06-16 23:01
   - 作者：Franziska Braun，Alea Rüggeberg，Thomas Ranzenberger，Hartmut Lehfeld，Thomas Hillemacher，Tobias Bocklet 等
   - 来源：arxiv
   - 相关性分数：164
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "RAG"; summary matched "LLM"
   - 分类：eess.AS, cs.CL, cs.SD
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.18019v1
   - 摘要：Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.

3. [The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act](https://arxiv.org/abs/2606.18158v1)
   - Published：2026-06-17 00:57
   - 作者：Michèle Finck
   - 来源：arxiv
   - 相关性分数：162
   - 命中原因：title matched "reasoning"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CY, cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2606.18158v1
   - 摘要：Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

4. [WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning](https://arxiv.org/abs/2606.18147v1)
   - Published：2026-06-17 00:45
   - 作者：Yuwei Zhang，Tong Xia，Bianca Emmerich，Yu Yvonne Wu，Dimitris Spathis，Xin Liu 等
   - 来源：arxiv
   - 相关性分数：162
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18147v1
   - 摘要：Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

5. [From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning](https://arxiv.org/abs/2606.18089v1)
   - Published：2026-06-16 23:55
   - 作者：Lingjing Kong，Xin Liu，Guangyi Chen，Martin Q. Ma，Xiangchen Song，Yuekai Sun 等
   - 来源：arxiv
   - 相关性分数：161
   - 命中原因：title matched "language model"; title matched "reasoning"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.18089v1
   - 摘要：Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). Within this model, we theoretically show that SFT and RL play asymmetric, complementary roles: SFT supplies the raw module materials in compositional traces, and RL decomposes those traces to identify the latent atomic modules and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces supplied by SFT and recombine them to solve new configurations. Moreover, we find that training on compound traces yields stronger generalization than training on isolated atomic modules. Finally, we investigate the relationship between SFT and RL data and identify an effective protocol in which SFT ensures coverage of all atomic modules through compositional traces, while RL focuses on novel compositions outside the SFT support to drive exploration.

6. [Small Initialization Matters for Large Language Models](https://arxiv.org/abs/2606.17945v1)
   - Published：2026-06-16 21:53
   - 作者：Liangkai Hang，Junjie Yao，Zhiyu Li，Feiyu Xiong，Hongkang Yang，Zhi-Qin John Xu
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.AI
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.17945v1
   - 摘要：Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $γ$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

7. [LLM Consumer Behavior Theory: Foundations of a Novel Research Field](https://arxiv.org/abs/2606.18005v1)
   - Published：2026-06-16 22:51
   - 作者：Manon Reusens，Sofie Goethals，David Martens
   - 来源：arxiv
   - 相关性分数：156
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "agent"
   - 分类：cs.AI, econ.GN
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18005v1
   - 摘要：Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.

8. [RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills](https://arxiv.org/abs/2606.18203v1)
   - Published：2026-06-17 01:34
   - 作者：Weizhi Zhang，Zechen Li，Hamid Palangi，Ben Graef，A. Ali Heydari，Simon A. Lee 等
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "agent"; title matched "evaluation"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18203v1
   - 摘要：The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

9. [Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models](https://arxiv.org/abs/2606.18142v1)
   - Published：2026-06-17 00:42
   - 作者：Jasmine Brazilek，Oliver Tulio，Joel Christoph，Miles Tidmarsh，Carol Kline，Arturs Kanepajs
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "reasoning"; summary matched "evaluation"
   - 分类：cs.AI, cs.CL, cs.CY
   - 标签：评测 / 应用 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.18142v1
   - 摘要：AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

10. [Querying an astronomical database using large language models: the ALeRCE text-to-SQL system](https://arxiv.org/abs/2606.18108v1)
   - Published：2026-06-17 00:12
   - 作者：P. A. Estevez，J. Espejo-Moreira，S. Sanfeliu-Alvarez，F. Forster，A. M. Munoz Arancibia，G. Cabrera-Vives 等
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "in-context learning"
   - 分类：astro-ph.IM, cs.AI
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.18108v1
   - 摘要：We develop a text-to-SQL (structured query language) system based on large language models (LLMs) using in-context learning and apply it to the Automatic Learning for the Rapid Classification of Events (ALeRCE) astronomical database. ALeRCE is a community broker for the Zwicky Transient Facility and the Vera C. Rubin Observatory. The system enables users to query the database in natural language (NL) and generates executable SQL queries. To develop and evaluate the system, we constructed a dataset of 110 NL/SQL pairs. We propose a step-by-step generation framework comprising four modules: schema linking, query classification, prompt decomposition, and self-correction. The performance of thirteen LLMs is evaluated using in-context learning and prompt engineering techniques. Text-to-SQL performance is assessed using the perfect-match (PM) rate for row identifiers (e.g., object identifiers) and column identifiers (i.e., column names). The proposed step-by-step framework consistently outperforms a direct-inference baseline, while the self-correction module consistently reduces execution errors. For Claude Opus 4.6, PM performance on row (column) identifiers is high for simple queries, reaching 0.97 (0.94), and decreases with query complexity to 0.44 (0.72) for medium queries and 0.59 (0.49) for hard queries. Among the thirteen evaluated models, the best-performing LLMs for the text-to-SQL task are Claude Opus 4.6, Gemini 2.5 Pro, Gemini 3 Flash, and GPT-5.2-Codex.

11. [Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose](https://arxiv.org/abs/2606.18051v1)
   - Published：2026-06-16 23:27
   - 作者：Xueping Gao
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18051v1
   - 摘要：LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill library, decompose the query into atomic sub-tasks, retrieve the appropriate skill for each sub-task, and compose an executable plan. We present SkillWeaver, a decompose-retrieve-compose framework combining an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner. To support evaluation, we introduce CompSkillBench, a benchmark of 300 compositional queries over 2,209 real MCP server skills spanning 24 functional categories, sourced from the public MCP ecosystem. Our experiments reveal that task decomposition quality is the primary bottleneck: standard LLM decomposition reaches only 34.2% category recall at the step level. To address this, we propose Iterative Skill-Aware Decomposition (SAD), a retrieval-augmented feedback loop that iteratively aligns decomposition with available skills. SAD improves decomposition accuracy from 51.0% to 67.7% (+32.7%, Wilcoxon p < 10^-6) in a single iteration; DA-conditioned analysis confirms that correct granularity is the prerequisite for effective retrieval (CatR@1 rises from 34% to 41% when DA=1). SkillWeaver reduces context window consumption by over 99%, and transfer experiments confirm generalization (+35.6% relative DA gain even when target categories are absent from the retrieval pool).

12. [ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents](https://arxiv.org/abs/2606.18037v1)
   - Published：2026-06-16 23:10
   - 作者：Ander Alvarez，Santhiya Rajan，Samuel Mugel，Román Orús
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "alignment"; summary matched "benchmark"
   - 分类：cs.AI, cs.CL, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18037v1
   - 摘要：Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

13. [Learning from the Self-future: On-policy Self-distillation for dLLMs](https://arxiv.org/abs/2606.18195v1)
   - Published：2026-06-17 01:24
   - 作者：Yifu Luo，Zeyu Chen，Haoyu Wang，Xinhao Hu，Yuxuan Zhang，Zhizhou Sha 等
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.18195v1
   - 摘要：On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

14. [How Inference Compute Shapes Frontier LLM Evaluation](https://arxiv.org/abs/2606.17930v1)
   - Published：2026-06-16 21:40
   - 作者：Jessica McFadyen，Ole Jorgensen，Harry Coppock，Kevin Wei，Cozmin Ududec
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "benchmark"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17930v1
   - 摘要：AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

15. [Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization](https://arxiv.org/abs/2606.17915v1)
   - Published：2026-06-16 21:34
   - 作者：Aueaphum Aueawatthanaphisut，Badri Raj Lamichhane
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.MA, cs.AI, cs.DB, cs.SE
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.17915v1
   - 摘要：Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

## Agent Runtime Security 观察

### 本组速览

- 《Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners》〔应用 / 方法〕：Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defe…
- 《A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models》〔应用 / 方法〕：We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of auto…
- 《PreAct: Computer-Using Agents that Get Faster on Repeated Tasks》〔评测 / 方法〕：Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent…

### 论文速览

1. [Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners](https://arxiv.org/abs/2606.18198v1)
   - Published：2026-06-17 01:29
   - 作者：Xiaojun Jia，Jie Liao，Simeng Qin，Ke Ma，Wenbo Guo，Yebo Feng 等
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "privilege escalation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.CV
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18198v1
   - 摘要：Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.

2. [A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models](https://arxiv.org/abs/2606.18193v1)
   - Published：2026-06-17 01:23
   - 作者：Nicola Franco
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI, cs.CL
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.18193v1
   - 摘要：We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

3. [PreAct: Computer-Using Agents that Get Faster on Repeated Tasks](https://arxiv.org/abs/2606.17929v1)
   - Published：2026-06-16 21:40
   - 作者：Bojie Li
   - 来源：arxiv
   - 相关性分数：43
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17929v1
   - 摘要：Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

## Terminal and SWE Agents 观察

### 本组速览

- 《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》〔方法〕：Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies r…
- 《LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling》〔评测 / 方法〕：Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop c…
- 《VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination》〔评测 / 方法〕：MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inh…
- 《GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?》〔评测 / 应用 / 方法〕：Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. U…
- 《Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering》〔评测 / 方法〕：Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model…

### 论文速览

1. [All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code](https://arxiv.org/abs/2606.18168v1)
   - Published：2026-06-17 01:06
   - 作者：Dipayan Banik，Kowshik Chowdhury，Shazibul Islam Shamim
   - 来源：arxiv
   - 相关性分数：46
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：方法
   - 主题词：Agent / Coding Agent
   - PDF：https://arxiv.org/pdf/2606.18168v1
   - 摘要：Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

2. [LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling](https://arxiv.org/abs/2606.18023v1)
   - Published：2026-06-16 23:03
   - 作者：Jian Yang，Shawn Guo，Wei Zhang，Tianyu Zheng，Yaxin Du，Haau-Sing Li 等
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.18023v1
   - 摘要：Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

3. [VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination](https://arxiv.org/abs/2606.17999v1)
   - Published：2026-06-16 22:46
   - 作者：Chunyu Liu，Zhengyang Fan，Kaisen Yang，Alex Lamb
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "code generation benchmark"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2606.17999v1
   - 摘要：MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by \(+17.84\) points over the original model and \(+6.95\) points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.

4. [GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?](https://arxiv.org/abs/2606.17861v1)
   - Published：2026-06-16 20:34
   - 作者：Tongxu Luo，Rongsheng Wang，Jiaxi Bi，Chenming Xu，Zhengyang Tang，Jianlong Chen 等
   - 来源：arxiv
   - 相关性分数：42
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17861v1
   - 摘要：Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

5. [Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering](https://arxiv.org/abs/2606.17799v1)
   - Published：2026-06-16 19:21
   - 作者：Maria I. Gorinova，Macey Baker，Amy Heineike，Maksim Shaposhnikov，Rob Willoughby，Dru Knox
   - 来源：arxiv
   - 相关性分数：40
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17799v1
   - 摘要：Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.
