# 每日论文简报

- 生成时间：2026-05-28 13:15:52 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=5, Terminal and SWE Agents=1
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 15 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》、《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》。
- 主题「Language Model」：命中 13 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》、《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》。
- 主题「Benchmark」：命中 6 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Agent Explorative Policy Optimization for Multimodal Agentic Reasoning》、《VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora》。
- 主题「Agent」：命中 3 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》、《Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem》。
- 主题「RAG」：命中 2 篇，覆盖 Agent Runtime Security，代表论文包括 《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》、《LACUNA: Safe Agents as Recursive Program Holes》。

## 主题聚焦

### LLM

- 命中篇数：15
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》、《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》、《TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning》
- 主题速读：
  - 《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug.…
  - 《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》〔评测 / 方法〕：Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains…

### Language Model

- 命中篇数：13
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》、《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》、《TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning》
- 主题速读：
  - 《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug.…
  - 《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》〔评测 / 方法〕：Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains…

### Benchmark

- 命中篇数：6
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Agent Explorative Policy Optimization for Multimodal Agentic Reasoning》、《VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora》、《Skill-Conditioned Gated Self-Distillation for LLM Reasoning》
- 主题速读：
  - 《Agent Explorative Policy Optimization for Multimodal Agentic Reasoning》〔评测 / 方法〕：Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone o…
  - 《VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora》〔评测 / 方法〕：Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agent…

### Agent

- 命中篇数：3
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》、《Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem》、《Calibrating Conservatism for Scalable Oversight》
- 主题速读：
  - 《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》〔数据 / 方法〕：Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small ope…
  - 《Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem》〔方法〕：We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and dat…

### RAG

- 命中篇数：2
- 覆盖分组：Agent Runtime Security
- 代表论文：《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》、《LACUNA: Safe Agents as Recursive Program Holes》
- 主题速读：
  - 《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》〔数据 / 方法〕：Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small ope…
  - 《LACUNA: Safe Agents as Recursive Program Holes》〔方法〕：LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the…

## LM 观察

### 本组速览

- 《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug.…
- 《Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability》〔评测 / 方法〕：Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains…
- 《TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning》〔评测 / 方法〕：Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficul…
- 《Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents》〔评测 / 方法〕：Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human…
- 《The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic》〔评测 / 数据 / 方法〕：The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generat…

### 论文速览

1. [MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems](https://arxiv.org/abs/2605.28732v1)
   - Published：2026-05-28 00:53
   - 作者：Xinle Deng，Ruobin Zhong，Hujin Peng，Xiaoben Lu，Yanzhe Wu，Guang Li 等
   - 来源：arxiv
   - 相关性分数：199
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28732v1
   - 摘要：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

2. [Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability](https://arxiv.org/abs/2605.28602v1)
   - Published：2026-05-27 23:18
   - 作者：Leizhen Zhang，Shuhan Chen，Sheng Chen
   - 来源：arxiv
   - 相关性分数：184
   - 命中原因：title matched "LLM"; title matched "reasoning"; title matched "evaluation"; summary matched "language model"
   - 分类：cs.AI, cs.CL, cs.LO
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28602v1
   - 摘要：Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

3. [TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning](https://arxiv.org/abs/2605.28699v1)
   - Published：2026-05-28 00:25
   - 作者：Chusen Li，Zhou Liu，Shuigeng Zhou，Wentao Zhang
   - 来源：arxiv
   - 相关性分数：181
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28699v1
   - 摘要：Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

4. [Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents](https://arxiv.org/abs/2605.28629v1)
   - Published：2026-05-27 23:37
   - 作者：Zheng Wu，Pengzhou Cheng，Zongru Wu，Yuan Guo，Tianjie Ju，Aston Zhang 等
   - 来源：arxiv
   - 相关性分数：180
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28629v1
   - 摘要：Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17\% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile-Aptus.

5. [The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic](https://arxiv.org/abs/2605.28700v1)
   - Published：2026-05-28 00:25
   - 作者：Dominika Agnieszka Długosz，Arlindo Oliveira，Natalia Díaz Rodríguez
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "evaluation"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28700v1
   - 摘要：The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

6. [Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay](https://arxiv.org/abs/2605.28782v1)
   - Published：2026-05-28 01:42
   - 作者：Mariah Al Giptiah Binte Yusoff，Jakin Tan，Bocheng Chen，Guangliang Liu，Xi Chen
   - 来源：arxiv
   - 相关性分数：164
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28782v1
   - 摘要：Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs' capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models' pragmatic competence.

7. [Agent Explorative Policy Optimization for Multimodal Agentic Reasoning](https://arxiv.org/abs/2605.28774v1)
   - Published：2026-05-28 01:36
   - 作者：Minki Kang，Shizhe Diao，Ryo Hachiuma，Sung Ju Hwang，Pavlo Molchanov，Yu-Chiang Frank Wang 等
   - 来源：arxiv
   - 相关性分数：164
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "language model"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28774v1
   - 摘要：Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

8. [VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora](https://arxiv.org/abs/2605.28683v1)
   - Published：2026-05-28 00:14
   - 作者：Yuting Xu，Jiayi Tian，Jian Liang，Xin Xiong，Hang Zhang，Mu Xu 等
   - 来源：arxiv
   - 相关性分数：162
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28683v1
   - 摘要：Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

9. [Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval](https://arxiv.org/abs/2605.28787v1)
   - Published：2026-05-28 01:46
   - 作者：Shiyu Chen，Tarfah Alrashed，Alon Halevy，Natasha Noy
   - 来源：arxiv
   - 相关性分数：160
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.IR, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28787v1
   - 摘要：In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

10. [MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation](https://arxiv.org/abs/2605.28579v1)
   - Published：2026-05-27 23:01
   - 作者：Xiaoyu Dong，Zhi Li，Xiao-Ming Wu
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "benchmark"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28579v1
   - 摘要：Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

11. [VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading](https://arxiv.org/abs/2605.28818v1)
   - Published：2026-05-28 01:59
   - 作者：Jinzhou Wu，Zhengwu Ma，Jixing Li，Baoping Tang，Zitong Lu
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "LLM"; title matched "alignment"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, q-bio.NC
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28818v1
   - 摘要：Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

12. [Skill-Conditioned Gated Self-Distillation for LLM Reasoning](https://arxiv.org/abs/2605.28791v1)
   - Published：2026-05-28 01:49
   - 作者：Jiazhen Huang，Xiao Chen，Xiao Luo，Yong Dai，Senkang Hu，Yuzhi Zhao
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "RAG"; summary matched "benchmark"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28791v1
   - 摘要：On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

13. [Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News](https://arxiv.org/abs/2605.28598v1)
   - Published：2026-05-27 23:16
   - 作者：Alejandro Buitrago López，Alberto Ortega Pastor，Javier Pastor-Galindo，José A. Ruipérez-Valiente
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "alignment"; summary matched "benchmark"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28598v1
   - 摘要：LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

14. [LLM Zeroth-Order Fine-Tuning is an Inference Workload](https://arxiv.org/abs/2605.28760v1)
   - Published：2026-05-28 01:19
   - 作者：Zelin Li，Caiwen Ding
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.28760v1
   - 摘要：Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.

15. [Self-Improving Language Models with Bidirectional Evolutionary Search](https://arxiv.org/abs/2605.28814v1)
   - Published：2026-05-28 01:59
   - 作者：Guowei Xu，Zhenting Qi，Huangyuan Su，Weirui Ye，Himabindu Lakkaraju，Sham M. Kakade 等
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "language model"; summary matched "agent"; summary matched "RAG"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28814v1
   - 摘要：Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

## Agent Runtime Security 观察

### 本组速览

- 《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》〔数据 / 方法〕：Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small ope…
- 《Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests》〔评测 / 数据 / 方法〕：A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapo…
- 《The Ethics of LLM Sandbox and Persona Dynamics》〔应用〕：It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to des…
- 《LACUNA: Safe Agents as Recursive Program Holes》〔方法〕：LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the…
- 《Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem》〔方法〕：We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and dat…

### 论文速览

1. [Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents](https://arxiv.org/abs/2605.28775v1)
   - Published：2026-05-28 01:37
   - 作者：Suji Kim，Kangsan Kim，Sung Ju Hwang
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "computer-use agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.AI, cs.CL
   - 标签：数据 / 方法
   - 主题词：RAG / Agent
   - PDF：https://arxiv.org/pdf/2605.28775v1
   - 摘要：Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

2. [Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests](https://arxiv.org/abs/2605.28734v1)
   - Published：2026-05-28 00:55
   - 作者：Richard J. Young，Gregory D. Moody
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.CL, cs.LG
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2605.28734v1
   - 摘要：A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

3. [The Ethics of LLM Sandbox and Persona Dynamics](https://arxiv.org/abs/2605.28647v1)
   - Published：2026-05-27 23:52
   - 作者：Tim Gebbie，Stewart Gebbie
   - 来源：arxiv
   - 相关性分数：46
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CY, q-fin.RM
   - 标签：应用
   - 主题词：LLM / Guardrail
   - PDF：https://arxiv.org/pdf/2605.28647v1
   - 摘要：It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI'' becomes substantively unethical when it substitutes institutional reassurance for contact with reality.

4. [LACUNA: Safe Agents as Recursive Program Holes](https://arxiv.org/abs/2605.28617v1)
   - Published：2026-05-27 23:27
   - 作者：Yaoyu Zhao，Yichen Xu，Oliver Bračevac，Cao Nguyen Pham，Frank Zhengqing Wu，Martin Odersky
   - 来源：arxiv
   - 相关性分数：46
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.PL
   - 标签：方法
   - 主题词：LLM / RAG
   - PDF：https://arxiv.org/pdf/2605.28617v1
   - 摘要：LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

5. [Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem](https://arxiv.org/abs/2605.28588v1)
   - Published：2026-05-27 23:10
   - 作者：Luca Beurer-Kellner，Aleksei Kudrinskii，Marco Milanta，Kristian Bonde Nielsen，Hemang Sarkar，Liran Tal
   - 来源：arxiv
   - 相关性分数：45
   - 命中原因：summary matched "data exfiltration"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：方法
   - 主题词：Agent / DATA Exfiltration
   - PDF：https://arxiv.org/pdf/2605.28588v1
   - 摘要：We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication. This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed. As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.

## Terminal and SWE Agents 观察

### 本组速览

- 《Calibrating Conservatism for Scalable Oversight》〔方法〕：Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful…

### 论文速览

1. [Calibrating Conservatism for Scalable Oversight](https://arxiv.org/abs/2605.28807v1)
   - Published：2026-05-28 01:56
   - 作者：William Overman，Mohsen Bayati
   - 来源：arxiv
   - 相关性分数：48
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：方法
   - 主题词：Agent / SWE Bench
   - PDF：https://arxiv.org/pdf/2605.28807v1
   - 摘要：Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.
