# 每日论文简报

- 生成时间：2026-06-09 13:12:49 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=4, Terminal and SWE Agents=3
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 18 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》、《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》。
- 主题「Language Model」：命中 16 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》、《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》。
- 主题「Agent」：命中 6 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation》、《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》。
- 主题「Large Language Model」：命中 1 篇，覆盖 LM，代表论文包括 《The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model》。
- 主题「Benchmark」：命中 1 篇，覆盖 Agent Runtime Security，代表论文包括 《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：4 篇
- Terminal and SWE Agents：3 篇

## 主题聚焦

### LLM

- 命中篇数：18
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》、《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》、《TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs》
- 主题速读：
  - 《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existin…
  - 《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》〔评测 / 方法〕：Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model rel…

### Language Model

- 命中篇数：16
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》、《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》、《TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs》
- 主题速读：
  - 《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existin…
  - 《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》〔评测 / 方法〕：Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model rel…

### Agent

- 命中篇数：6
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation》、《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》、《Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents》
- 主题速读：
  - 《AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation》〔评测 / 数据 / 方法〕：AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypoth…
  - 《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》〔评测 / 方法〕：Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external t…

### Large Language Model

- 命中篇数：1
- 覆盖分组：LM
- 代表论文：《The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model》
- 主题速读：
  - 《The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model》〔方法〕：The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLH…

### Benchmark

- 命中篇数：1
- 覆盖分组：Agent Runtime Security
- 代表论文：《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》
- 主题速读：
  - 《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》〔评测 / 方法〕：Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external t…

## LM 观察

### 本组速览

- 《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existin…
- 《Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving》〔评测 / 方法〕：Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model rel…
- 《TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs》〔评测 / 方法〕：Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remain…
- 《Gradient-Guided Reward Optimization for Inference-time Alignment》〔评测 / 应用 / 方法〕：Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods su…
- 《IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking》〔评测 / 数据 / 应用 / 方法〕：Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have de…

### 论文速览

1. [SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks](https://arxiv.org/abs/2606.09669v1)
   - Published：2026-06-08 23:51
   - 作者：Hongcheng Gao，Hailong Qu，Jingyi Tang，Jiahao Wang，Zihao Huang，Hengkang Qiao 等
   - 来源：arxiv
   - 相关性分数：238
   - 命中原因：title matched "reasoning"; title matched "agent"; title matched "benchmark"; summary matched "language model"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09669v1
   - 摘要：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

2. [Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving](https://arxiv.org/abs/2606.09644v1)
   - Published：2026-06-08 23:39
   - 作者：Yimu Wang，Yee Man Choi，Barry Zhang，Mozhgan Nasr Azadani，Sean Sedwards，Krzysztof Czarnecki
   - 来源：arxiv
   - 相关性分数：180
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.CV
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09644v1
   - 摘要：Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

3. [TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs](https://arxiv.org/abs/2606.09578v1)
   - Published：2026-06-08 22:52
   - 作者：Momina Ahsan，Sarfraz Ahmad，Ming Shan Hee，Roy Ka-Wei Lee，Preslav Nakov
   - 来源：arxiv
   - 相关性分数：179
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI, cs.CL, cs.IR
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09578v1
   - 摘要：Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

4. [Gradient-Guided Reward Optimization for Inference-time Alignment](https://arxiv.org/abs/2606.09635v1)
   - Published：2026-06-08 23:33
   - 作者：Hankun Lin，Ruqi Zhang
   - 来源：arxiv
   - 相关性分数：176
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09635v1
   - 摘要：Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

5. [IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking](https://arxiv.org/abs/2606.09709v1)
   - Published：2026-06-09 00:31
   - 作者：Zechen Sun，Yuyang Sun，Zecheng Tang，Juntao Li，Wenpeng Hu，Wenliang Chen 等
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：summary matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09709v1
   - 摘要：Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.

6. [The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model](https://arxiv.org/abs/2606.09735v1)
   - Published：2026-06-09 01:00
   - 作者：Wendy K. Tam
   - 来源：arxiv
   - 相关性分数：167
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2606.09735v1
   - 摘要：The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

7. [Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text](https://arxiv.org/abs/2606.09585v1)
   - Published：2026-06-08 22:58
   - 作者：Yutong Bian，Dongjie Cheng，Heming Xia，Yongqi Li，Wenjie Li
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09585v1
   - 摘要：Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

8. [OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics](https://arxiv.org/abs/2606.09826v1)
   - Published：2026-06-09 01:59
   - 作者：Mingxian Lin，Shengju Qian，Yuqi Liu，Yi-Hua Huang，Yiyu Wang，Wei Huang 等
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "language model"; summary matched "LLM"
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09826v1
   - 摘要：Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

9. [SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research](https://arxiv.org/abs/2606.09730v1)
   - Published：2026-06-09 00:52
   - 作者：Pu Ning，Quan Chen，Kun Tao，Xinyu Tang，Tianshu Wang，Qianggang Cao 等
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09730v1
   - 摘要：Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

10. [PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models](https://arxiv.org/abs/2606.09697v1)
   - Published：2026-06-09 00:19
   - 作者：Gianluca Barmina，Federico Torrielli，Sven Harms，Jacob Nielsen，Felix Mächtle，Stine Lyngsø Beltoft 等
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09697v1
   - 摘要：Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

11. [Civil Court Simulation with Large Language Models](https://arxiv.org/abs/2606.09632v1)
   - Published：2026-06-08 23:30
   - 作者：Yifan Chen，Haitao Li，Kaiyuan Zhang，Yueyue Wu，Qingyao Ai，Yiqun Liu
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "agent"
   - 分类：cs.CL
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09632v1
   - 摘要：Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practice and harder to simulate because its claims, liability, and remedies are more flexible. We present a multi-agent court simulation framework for Chinese civil cases. The framework organizes role-based interaction through a five-stage civil trial procedure and integrates memory module and statute retrieval to support long-process adjudication. Experiments show that the framework produces reliable civil judgments, with clear strengths in liability allocation and multi-item adjudication. Further experiments show that memory quality substantially affects downstream simulation quality. Through a five-layer factor framework, we analyze how legal grounding, information conditions, judicial capability and role orientation, organizational pressure, and social context affect the framework's reliability and behavior. These results support the effectiveness of the proposed framework for civil court simulation. The dataset and code are available at: https://github.com/foggpoy/Civil-Court.

12. [SecureClaw: Clawing Back Control of LLM Agents](https://arxiv.org/abs/2606.09549v1)
   - Published：2026-06-08 22:29
   - 作者：Yuhan Ma，Stefan Schmid
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CR, cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09549v1
   - 摘要：Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

13. [Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?](https://arxiv.org/abs/2606.09547v1)
   - Published：2026-06-08 22:27
   - 作者：Apratim Bhattacharyya，Shweta Mahajan，Sanjay Haresh，Rajeev Yasarla，Reza Pourreza，Litian Liu 等
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "benchmark"
   - 分类：cs.CV, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09547v1
   - 摘要：Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

14. [UXBench: Benchmarking User Experience in AI Assistants](https://arxiv.org/abs/2606.09570v1)
   - Published：2026-06-08 22:44
   - 作者：Mengze Hong，Xia Zeng，Zeyang Lei，Sheng Wang，Chen Jason Zhang，Di Jiang 等
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "benchmark"; summary matched "language model"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL, cs.HC
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09570v1
   - 摘要：As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

15. [AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation](https://arxiv.org/abs/2606.09556v1)
   - Published：2026-06-08 22:31
   - 作者：Yinan Wang
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "reasoning"; summary matched "LLM"; summary matched "agent"; summary matched "RAG"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.09556v1
   - 摘要：AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

## Agent Runtime Security 观察

### 本组速览

- 《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》〔评测 / 方法〕：Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external t…
- 《Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents》〔方法〕：BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt…
- 《What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks》〔数据 / 方法〕：Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily o…
- 《PRISM: Recovering Instruction Sets from Language Model Activations》〔方法〕：As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is dif…

### 论文速览

1. [WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces](https://arxiv.org/abs/2606.09426v1)
   - Published：2026-06-08 20:39
   - 作者：Wanli Li，Bowen Zhou，Yunyao Yu，Zhou Xu，Yifan Yang，Dongsheng Li 等
   - 来源：arxiv
   - 相关性分数：83
   - 命中原因：title matched "computer-use agent"; summary matched "agent runtime"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Agent / Benchmark
   - PDF：https://arxiv.org/pdf/2606.09426v1
   - 摘要：Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

2. [Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents](https://arxiv.org/abs/2606.09315v1)
   - Published：2026-06-08 18:19
   - 作者：Jianwei Tai
   - 来源：arxiv
   - 相关性分数：63
   - 命中原因：title matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.09315v1
   - 摘要：BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a minimal log schema, denominator hierarchy, and endpoint specification, and prove an audit-schema separation theorem together with a C3 attacked-dependence decomposition; clean agreement and marginal robustness do not identify the joint term that controls C3 routing. As a calibration layer on top of the contract, we apply split-conformal calibration to a non-oracle EEG confirmation channel and report the resulting false-accept frontier under an explicit threat-archetype matrix. We instantiate the contract on EEGMMI native left/right command-control over 5{,}400 events, harmless tool stubs, and seed/case denominators. Provenance blocks C2 routes ($0.000$); agreement-plus-provenance routes C3 flips ($1.000$); confirmation-plus-provenance routes them ($0.000$). The conformal frontier reaches FAR $0.000$ at clean utility $0.150$ for $α=.005$ and FAR $0.119$ at clean utility $0.452$ for $α=.10$ under acquisition isolation; an attacker-controllable confirmation channel breaks the bound to $\approx\!1$. Subject-cluster bootstrap confirms these intervals on $60$ subjects; cross-architecture (TinyEEGNet, EEGNetV4) and capacity-sweep results show within-regime saturation. Mediation and confirmation reduce risk; they are not intent certificates.

3. [What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks](https://arxiv.org/abs/2606.09700v1)
   - Published：2026-06-09 00:21
   - 作者：Qin Yang，Lu Malloy，Joshua Lee，Xiaohan Chang，Meisam Mohammady，Doowon Kim 等
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.HC, cs.LG
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09700v1
   - 摘要：Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.

4. [PRISM: Recovering Instruction Sets from Language Model Activations](https://arxiv.org/abs/2606.09563v1)
   - Published：2026-06-08 22:37
   - 作者：Gilad Gressel，Rahul Pankajakshan，Julia Diament，Efim Hudis，Krishnashree Achuthan，Yisroel Mirsky
   - 来源：arxiv
   - 相关性分数：45
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.LG
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.09563v1
   - 摘要：As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

## Terminal and SWE Agents 观察

### 本组速览

- 《SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation》〔方法〕：Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain…
- 《From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design》〔方法〕：Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI no…
- 《Self-Harness: Harnesses That Improve Themselves》〔方法〕：The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because diffe…

### 论文速览

1. [SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation](https://arxiv.org/abs/2606.09774v1)
   - Published：2026-06-09 01:35
   - 作者：Matthew Ho，Brian Liu，Jixuan Chen，Audrey Wang，Lianhui Qin
   - 来源：arxiv
   - 相关性分数：48
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL
   - 标签：方法
   - 主题词：Agent / Coding Agent
   - PDF：https://arxiv.org/pdf/2606.09774v1
   - 摘要：Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

2. [From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design](https://arxiv.org/abs/2606.09663v1)
   - Published：2026-06-08 23:45
   - 作者：Dun Li，Jiatao Li，Hongzhi Li
   - 来源：arxiv
   - 相关性分数：46
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：方法
   - 主题词：Agent / SWE Bench
   - PDF：https://arxiv.org/pdf/2606.09663v1
   - 摘要：Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.

3. [Self-Harness: Harnesses That Improve Themselves](https://arxiv.org/abs/2606.09498v1)
   - Published：2026-06-08 21:50
   - 作者：Hangfan Zhang，Shao Zhang，Kangcong Li，Chen Zhang，Yang Chen，Yiqun Zhang 等
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "Terminal-Bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.09498v1
   - 摘要：The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.
