# 每日论文简报

- 生成时间：2026-04-14 11:37:06 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LLM=15, Vision=10, PubMed AI=5, OpenAlex AI=1
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Benchmark」：命中 20 篇，覆盖 LLM、Vision 等，代表论文包括 《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》、《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》。
- 主题「Agent」：命中 15 篇，覆盖 LLM、PubMed AI 等，代表论文包括 《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》、《Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games》。
- 主题「Evaluation」：命中 10 篇，覆盖 LLM、Vision 等，代表论文包括 《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》、《FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning》。

## 主题聚焦

### Benchmark

- 命中篇数：20
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》、《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》、《Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games》
- 主题速读：
  - 《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕：Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, exist…
  - 《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》〔评测 / 数据 / 应用 / 方法〕：Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics…

### Agent

- 命中篇数：15
- 覆盖分组：LLM、PubMed AI、OpenAlex AI
- 代表论文：《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》、《Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games》、《FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning》
- 主题速读：
  - 《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕：Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, exist…
  - 《Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games》〔评测 / 数据 / 方法〕：Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game se…

### Evaluation

- 命中篇数：10
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》、《FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning》、《RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents》
- 主题速读：
  - 《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》〔评测 / 数据 / 应用 / 方法〕：Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics…
  - 《FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning》〔评测 / 方法〕：LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen…

### Language Model

- 命中篇数：3
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection》、《Anthropogenic Regional Adaptation in Multimodal Vision-Language Model》、《Text4Seg++: Advancing Image Segmentation via Generative Language Modeling.》
- 主题速读：
  - 《ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection》〔方法〕：Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulne…
  - 《Anthropogenic Regional Adaptation in Multimodal Vision-Language Model》〔评测 / 方法〕：While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, th…

### Multimodal

- 命中篇数：3
- 覆盖分组：Vision
- 代表论文：《GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth》、《Anthropogenic Regional Adaptation in Multimodal Vision-Language Model》、《GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays》
- 主题速读：
  - 《GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth》〔评测 / 应用 / 方法〕：Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupt…
  - 《Anthropogenic Regional Adaptation in Multimodal Vision-Language Model》〔评测 / 方法〕：While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, th…

## LLM 观察

### 本组速览

- 《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕：Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, exist…
- 《General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks》〔评测 / 数据 / 应用 / 方法〕：Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics…
- 《Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games》〔评测 / 数据 / 方法〕：Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game se…

### 论文速览

1. [UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents](https://arxiv.org/abs/2604.11557v1)
   - Published：2026-04-13 22:43
   - 作者：Yijuan Liang，Xinghao Chen，Yifan Ge，Ziyi Wu，Hao Wu，Changyu Zeng 等
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "agent"; title matched "evaluation"; summary matched "reasoning"; summary matched "benchmark"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11557v1
   - 摘要：Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

2. [General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks](https://arxiv.org/abs/2604.11778v1)
   - Published：2026-04-14 01:44
   - 作者：Junlin Liu，Shengnan An，Shuang Zhou，Dan Ma，Shixiong Luo，Ying Xie 等
   - 来源：arxiv
   - 相关性分数：130
   - 命中原因：title matched "reasoning"; title matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11778v1
   - 摘要：Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

3. [Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games](https://arxiv.org/abs/2604.11741v1)
   - Published：2026-04-14 01:16
   - 作者：Keyang Zhong，Junlin Xie，Hefeng Wu，Haofeng Li，Guanbin Li
   - 来源：arxiv
   - 相关性分数：129
   - 命中原因：title matched "agent"; title matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11741v1
   - 摘要：Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.

4. [FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning](https://arxiv.org/abs/2604.11556v1)
   - Published：2026-04-13 22:42
   - 作者：Haoran Ding，Zhaoguo Wang，Haibo Chen
   - 来源：arxiv
   - 相关性分数：127
   - 命中原因：title matched "agent"; title matched "reasoning"; summary matched "evaluation"; has PDF
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：Agent / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11556v1
   - 摘要：LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

5. [From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python](https://arxiv.org/abs/2604.11518v1)
   - Published：2026-04-13 22:21
   - 作者：Jinhua Wang，Biswa Sengupta
   - 来源：arxiv
   - 相关性分数：126
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11518v1
   - 摘要：Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.

6. [RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents](https://arxiv.org/abs/2604.11655v1)
   - Published：2026-04-14 00:08
   - 作者：Riccardo Rosati，Edoardo Colucci，Massimiliano Bolognini，Adriano Mancini，Paolo Sernani
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "alignment"; summary matched "evaluation"
   - 分类：cs.CL, cs.AI, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：Agent / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11655v1
   - 摘要：The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

7. [SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context](https://arxiv.org/abs/2604.11716v1)
   - Published：2026-04-14 00:52
   - 作者：Shuquan Lian，Juncheng Liu，Yazhe Chen，Yuhong Chen，Hui Li
   - 来源：arxiv
   - 相关性分数：111
   - 命中原因：title matched "agent"; title matched "reasoning"; has PDF; has rich summary
   - 分类：cs.AI, cs.CL
   - 标签：方法
   - 主题词：Agent / Reasoning
   - PDF：https://arxiv.org/pdf/2604.11716v1
   - 摘要：Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and ``Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a ``sliding window'' of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.

8. [Detecting Safety Violations Across Many Agent Traces](https://arxiv.org/abs/2604.11806v1)
   - Published：2026-04-14 01:59
   - 作者：Adam Stein，Davis Brown，Hamed Hassani，Mayur Naik，Eric Wong
   - 来源：arxiv
   - 相关性分数：108
   - 命中原因：title matched "agent"; summary matched "alignment"; summary matched "benchmark"; has PDF
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11806v1
   - 摘要：To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.

9. [ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents](https://arxiv.org/abs/2604.11784v1)
   - Published：2026-04-14 01:52
   - 作者：Fei Tang，Zhiqiong Lu，Boxuan Zhang，Weiming Lu，Jun Xiao，Yueting Zhuang 等
   - 来源：arxiv
   - 相关性分数：108
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.LG, cs.AI, cs.CL, cs.CV
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11784v1
   - 摘要：GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

10. [Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks](https://arxiv.org/abs/2604.11753v1)
   - Published：2026-04-14 01:26
   - 作者：Yoonsang Lee，Howard Yen，Xi Ye，Danqi Chen
   - 来源：arxiv
   - 相关性分数：107
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11753v1
   - 摘要：We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

11. [Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory](https://arxiv.org/abs/2604.11544v1)
   - Published：2026-04-13 22:35
   - 作者：Weixian Waylon Li，Jiaxin Zhang，Xianan Jim Yang，Tiejun Ma，Yiwen Guo
   - 来源：arxiv
   - 相关性分数：104
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11544v1
   - 摘要：Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation's text embedding to a volatility score, learning from data that evolving relations (e.g., "president of") should rotate fast while persistent ones (e.g., "born in") should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).

12. [PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints](https://arxiv.org/abs/2604.11523v1)
   - Published：2026-04-13 22:26
   - 作者：Minjun Park，Donghyun Kim，Hyeonjong Ju，Seungwon Lim，Dongwook Choi，Taeyoon Kwon 等
   - 来源：arxiv
   - 相关性分数：104
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.AI, cs.MA
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11523v1
   - 摘要：We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.

13. [METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models](https://arxiv.org/abs/2604.11502v1)
   - Published：2026-04-13 22:07
   - 作者：Pengfeng Li，Chen Huang，Chaoqun Hao，Hongyao Chen，Xiao-Yong Wei，Wenqiang Lei 等
   - 来源：arxiv
   - 相关性分数：104
   - 命中原因：title matched "reasoning"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11502v1
   - 摘要：Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

14. [ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection](https://arxiv.org/abs/2604.11790v1)
   - Published：2026-04-14 01:55
   - 作者：Wei Zhao，Zhe Li，Peixin Zhang，Jun Sun
   - 来源：arxiv
   - 相关性分数：90
   - 命中原因：title matched "agent"; summary matched "alignment"; has PDF; has rich summary
   - 分类：cs.CR, cs.AI
   - 标签：方法
   - 主题词：Agent / Language Model
   - PDF：https://arxiv.org/pdf/2604.11790v1
   - 摘要：Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.

15. [Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving](https://arxiv.org/abs/2604.11734v1)
   - Published：2026-04-14 01:13
   - 作者：Haojie Bai，Aimin Li，Ruoyu Yao，Xiongwei Zhao，Tingting Zhang，Xing Zhang 等
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "agent"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.RO, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.11734v1
   - 摘要：Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

## Vision 观察

### 本组速览

- 《OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation》〔评测 / 数据 / 应用 / 方法〕：In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on…
- 《LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation》〔评测 / 方法〕：Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring…
- 《GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth》〔评测 / 应用 / 方法〕：Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupt…

### 论文速览

1. [OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation](https://arxiv.org/abs/2604.11804v1)
   - Published：2026-04-14 01:59
   - 作者：Donghao Zhou，Guisheng Liu，Hao Yang，Jiatong Li，Jingyu Lin，Xiaohu Huang 等
   - 来源：arxiv
   - 相关性分数：112
   - 命中原因：title matched "video generation"; title matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11804v1
   - 摘要：In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

2. [LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation](https://arxiv.org/abs/2604.11789v1)
   - Published：2026-04-14 01:55
   - 作者：Yuqian Yuan，Wenqiao Zhang，Juekai Lin，Yu Zhong，Mingjian Gao，Binhe Yu 等
   - 来源：arxiv
   - 相关性分数：90
   - 命中原因：title matched "segmentation"; summary matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：评测 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11789v1
   - 摘要：Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

3. [GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth](https://arxiv.org/abs/2604.11585v1)
   - Published：2026-04-13 23:01
   - 作者：Krishna Jaganathan，Patricio Vela
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "segmentation"; summary matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV, cs.RO
   - 标签：评测 / 应用 / 方法
   - 主题词：Multimodal / Segmentation
   - PDF：https://arxiv.org/pdf/2604.11585v1
   - 摘要：Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.

4. [Anthropogenic Regional Adaptation in Multimodal Vision-Language Model](https://arxiv.org/abs/2604.11490v1)
   - Published：2026-04-13 21:56
   - 作者：Samuel Cahyawijaya，Peerat Limkonchotiwat，Tack Hwa Wong，Hitesh Laxmichand Patel，Amit Agarwal，Manuel Antonio Rufino 等
   - 来源：arxiv
   - 相关性分数：86
   - 命中原因：title matched "multimodal"; summary matched "diffusion"; has PDF; has rich summary
   - 分类：cs.AI, cs.CL, cs.CV
   - 标签：评测 / 方法
   - 主题词：Language Model / Multimodal
   - PDF：https://arxiv.org/pdf/2604.11490v1
   - 摘要：While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

5. [GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays](https://arxiv.org/abs/2604.11653v1)
   - Published：2026-04-14 00:05
   - 作者：David Wong，Zeynep Isik，Bin Wang，Marouane Tliba，Gorkem Durak，Elif Keles 等
   - 来源：arxiv
   - 相关性分数：78
   - 命中原因：summary matched "diffusion"; summary matched "multimodal"; has DOI; has PDF
   - 分类：cs.CV
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Multimodal
   - PDF：https://arxiv.org/pdf/2604.11653v1
   - 摘要：We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.

6. [Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net](https://arxiv.org/abs/2604.11798v1)
   - Published：2026-04-14 01:58
   - 作者：Ricardo Coimbra Brioso，Lorenzo Mondo，Damiano Dei，Nicola Lambri，Pietro Mancosu，Marta Scorsetti 等
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Clinical / Alignment
   - PDF：https://arxiv.org/pdf/2604.11798v1
   - 摘要：Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.

7. [HDR Video Generation via Latent Alignment with Logarithmic Encoding](https://arxiv.org/abs/2604.11788v1)
   - Published：2026-04-14 01:55
   - 作者：Naomi Ken Korem，Mohamed Oumoumad，Harel Cain，Matan Ben Yosef，Urska Jelercic，Ofir Bibi 等
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "video generation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：应用 / 方法
   - 主题词：Alignment / Video Generation
   - PDF：https://arxiv.org/pdf/2604.11788v1
   - 摘要：High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.

8. [Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation](https://arxiv.org/abs/2604.11775v1)
   - Published：2026-04-14 01:43
   - 作者：Ricardo Coimbra Brioso，Giulio Sichili，Damiano Dei，Nicola Lambri，Pietro Mancosu，Marta Scorsetti 等
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Clinical
   - PDF：https://arxiv.org/pdf/2604.11775v1
   - 摘要：Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net's fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.

9. [Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models](https://arxiv.org/abs/2604.11711v1)
   - Published：2026-04-14 00:46
   - 作者：Nhan Ho，Luu Le，Thanh-Huy Nguyen，Thien Nguyen，Xiaofeng Liu，Ulas Bagci
   - 来源：arxiv
   - 相关性分数：71
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.11711v1
   - 摘要：Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation models in clinical endoscopy. We introduce OccSAM-Bench, a benchmark designed to systematically evaluate SAM-family models under controlled, synthesized surgical occlusion. Our framework simulates two occlusion types (i.e., surgical tool overlay and cutout) across three calibrated severity levels on three public polyp datasets. We propose a novel three-region evaluation protocol that decomposes segmentation performance into full, visible-only, and invisible targets. This metric exposes behaviors that standard amodal evaluation obscures, revealing two distinct model archetypes: Occluder-Aware models (SAM, SAM 2, SAM 3, MedSAM3), which prioritize visible tissue delineation and reject instruments, and Occluder-Agnostic models (MedSAM, MedSAM2), which confidently predict into occluded regions. SAM-Med2D aligns with neither and underperforms across all conditions. Ultimately, our results demonstrate that occlusion robustness is not uniform across architectures, and model selection must be driven by specific clinical intent-whether prioritizing conservative visible-tissue segmentation or the amodal inference of hidden anatomy.

10. [Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT](https://arxiv.org/abs/2604.11559v1)
   - Published：2026-04-13 22:45
   - 作者：Tianqi Wang，Wenchao Du，Hongyu Yang
   - 来源：arxiv
   - 相关性分数：69
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, physics.med-ph
   - 标签：方法
   - 主题词：Diffusion
   - PDF：https://arxiv.org/pdf/2604.11559v1
   - 摘要：Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD$_{\textit{rec}}$ and a conditional diffusion module PTD$_{\textit{diff}}$. PTD$_{\textit{rec}}$ first learns a deterministic mapping to recover the majority of the underlying low-frequency signals (i.e., coarse content with smoothed textures), which serves as the initial estimation to enable fidelity. Moreover, PTD$_{\textit{diff}}$ aims to reconstruct high-fidelity details for coarse prediction, which explores a dual-domain guided conditional diffusion to generate reliable and consistent textures. Extensive experiments on sparse-view CT reconstruction demonstrate that our PTD achieves superior performance in terms of structure similarity and visual appeal with only a few sampling steps, which mitigates the randomness inherent in general diffusion models and enables a better trade-off between visual quality and fidelity of high-frequency details.

## PubMed AI 观察

### 本组速览

- 《Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.》〔评测 / 应用 / 方法〕：BACKGROUND: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populat…
- 《Toward Sustainable Clinical Analysis: Benchmarking Plastic Use in LC-MS Sample Preparation - Exemplified by Ketamine Analogues in Whole Blood.》〔评测 / 方法〕：The aim of this study was to assess and benchmark plastic consumption in sample preparation for forensic analysis, alongside the development of an LC-MS method…
- 《Text4Seg++: Advancing Image Segmentation via Generative Language Modeling.》〔评测 / 数据 / 方法〕：Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into…

### 论文速览

1. [Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.](https://pubmed.ncbi.nlm.nih.gov/41973653/)
   - Entered：2026-04-13 21:24
   - 作者：Andy Li，Wei Zhou，Rashina Hoda，Chris Bain，Peter Poon
   - 来源：pubmed
   - 相关性分数：107
   - 命中原因：title matched "language model"; summary matched "benchmark"; summary matched "clinical"; has DOI
   - 分类：Journal Article, Comparative Study
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - 摘要：BACKGROUND: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored. OBJECTIVE: This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese. METHODS: Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics. RESULTS: The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts. CONCLUSIONS: This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.

2. [Toward Sustainable Clinical Analysis: Benchmarking Plastic Use in LC-MS Sample Preparation - Exemplified by Ketamine Analogues in Whole Blood.](https://pubmed.ncbi.nlm.nih.gov/41972595/)
   - Entered：2026-04-13 16:53
   - 作者：Line Noreng，Åse Marit Leere Øiestad，Frederik André Hansen，Hanne Røberg-Larsen，Steven Ray Wilson，Elisabeth Leere Øiestad
   - 来源：pubmed
   - 相关性分数：107
   - 命中原因：title matched "benchmark"; title matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - 摘要：The aim of this study was to assess and benchmark plastic consumption in sample preparation for forensic analysis, alongside the development of an LC-MS method for ketamine analogues in whole blood, with various sustainability-related scores and parameters examined throughout. Ketamine analogues are emerging psychoactive substances associated with intoxication and fatalities globally. An analytical method was developed for determining ketamine and eight of its analogues in human whole blood. Focus was placed on traditional analytical parameters but also the consumption of plastics and reagents, an overlooked aspect of clinical and forensic chemistry. We evaluated three sample preparation techniques: protein precipitation (PPT), liquid-liquid extraction (LLE), and electromembrane extraction (EME). While PPT and LLE are well-established techniques used in clinical settings, EME (i.e., electrophoresis across an oil membrane) is a less established but highly promising approach. All three sample preparation approaches demonstrated similar performance with recoveries above 85% and matrix effects averaging approximately 100%. The EME approach was subsequently refined using a Box-Behnken-based design of experiments and was validated according to the American Academy of Forensic Sciences guidelines. All validated parameters were within the limits, suggesting that the tunable EME approach is a potentially valuable tool in clinical chemistry. The amount of plastic consumables per 100 samples was calculated as being 511 g for PPT, 864 g for LLE, and 303 g for EME. Correspondingly, the amount of organic solvent consumption per sample was 470, 1570, and 210 μL, respectively. The AGREEprep scores (ranges from 0 to 1, 1 is best) were 0.48 ± 0.04, 0.36 ± 0.04, and 0.55 ± 0.05, respectively. The rounded value of PPT (arguably the most used approach for related samples) is 0.5 kg per 100 samples, and we propose that this number be used as a benchmark for plastic consumption in today's sample preparation. This may serve as a practical reference when developing and evaluating future sample preparation strategies. For example, employing EME here allowed for a 40% reduction in plastics compared to the benchmark, illustrating a significant improvement. However, the investigated approaches have a significant use of single-use consumables, inviting sample preparation approaches that can reduce the plastic footprint.

3. [Text4Seg++: Advancing Image Segmentation via Generative Language Modeling.](https://pubmed.ncbi.nlm.nih.gov/41973591/)
   - Entered：2026-04-13 20:44
   - 作者：Mengcheng Lan，Chaofeng Chen，Jiaxing Xu，Zongrui Li，Yiping Ke，Xudong Jiang 等
   - 来源：pubmed
   - 相关性分数：89
   - 命中原因：title matched "language model"; summary matched "benchmark"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Language Model
   - 摘要：Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.

4. [Diversity in clinical Trials: The example of systemic lupus erythematosus.](https://pubmed.ncbi.nlm.nih.gov/41969623/)
   - Entered：2026-04-13 14:27
   - 作者：Andrew Bevan，Nora Carroll，Garrick Wallstrom，Sudhakar Sridharan
   - 来源：pubmed
   - 相关性分数：82
   - 命中原因：title matched "clinical"; summary matched "benchmark"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Benchmark / Clinical
   - 摘要：OBJECTIVE: The FDA requires clinical trials to reflect real-world diversity. Systemic lupus erythematosus (SLE) is a disease that disproportionately affects individuals of Black African descent that has not been assessed for diversity in clinical trials to date. This study compared demographics from two real-world data (RWD) sources and proposes parameters for representative trial populations. METHODS: Demographics of United States (US) SLE patients were extracted from electronic health records (EHR) and registry data. These were used to model statistically representative hypothetical trial cohorts and compared with completed SLE trials using statistical tests. RESULTS: Compared with both EHR and registry-derived populations, male participants were significantly underrepresented in US SLE trials (median z-score -1.04 and -0.82; P = .005; r = 0.89 for both). Asian participants were significantly underrepresented relative to registry estimates (median z-score -1.45; P = .03811; r = 0.85) but not EHR data, while Black or African American (BAA) representation was significantly higher than EHR-derived estimates (median z-score 0.97; P = .038; r = 0.69) and not significantly different from registry data. No significant differences were observed for White or Hispanic/Latino(a) populations. A SLE trial of 100 subjects would require 5-17 males, 45-65 White, 16-33 BAA, 0-7 Asian, and 4-15 Hispanic or Latino/a (HL) subjects per EHR data; or 4-16 males, 50-69 White, 22-40 BAA, and 13-29 HL subjects per registry data. Larger trials would require proportionally narrower ranges. CONCLUSION: This study demonstrates demographic disparities in SLE trials and offers actionable benchmarks for diversity planning per FDA guidance.

5. [Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions.](https://pubmed.ncbi.nlm.nih.gov/41970036/)
   - Entered：2026-04-13 14:30
   - 作者：Ryan S Shean，Jayanth Kumar Mallapu，Tathya Shah，Haroon Adam Rasheed，David N Younessi，Yih Chung Tham 等
   - 来源：pubmed
   - 相关性分数：78
   - 命中原因：summary matched "language model"; summary matched "benchmark"; summary matched "clinical"; has DOI
   - 分类：Journal Article
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Evaluation
   - 摘要：OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity, and question type. DESIGN: A cross-sectional evaluation of 12 distinct large language model (LLM) configurations using a standardized ophthalmology question set. SUBJECTS: Five hundred multiple-choice questions (250 from the American Academy of Ophthalmology's Basic and Clinical Science Course [BCSC]; 250 StatPearls). METHODS: Twelve configurations of the following LLMs: Gemini 3 Pro, Gemini 2.5 Pro, GPT-5.1 Pro, GPT-5 Pro, GPT-5.2, GPT-5.1, and GPT-5, interpreted the questions using standardized prompting procedures. Questions were categorized by subspecialty, multimodal content (image vs. text-only), and cognitive complexity (first, second, or third order). Accuracy, paired discordance (McNemar tests), and one-way analysis of variance with Tukey correction were used to compare performance. Human benchmarking used BCSC percent-correct data. MAIN OUTCOME MEASURES: Overall accuracy, subspecialty accuracy, image vs. nonimage accuracy, cognitive-complexity accuracy, and paired model-level discordance. RESULTS: Model accuracy ranged from 81.4% to 94.0%. Gemini 3 Pro High Reasoning achieved the highest accuracy (94.0%), followed by Gemini 3 Pro Low Reasoning (92.4%). GPT-5.1 Pro led the GPT family (90.4%), whereas GPT-5.2 Base Model performed lowest (81.4%). Analysis of variance showed significant heterogeneity ( P < 0.001), but most Tukey-corrected pairwise differences were nonsignificant. McNemar tests demonstrated significantly more correct paired responses for Gemini 3 Pro High Reasoning than for GPT-5.2 and all GPT-5/5.1 variants. Models performed markedly better on BCSC (mean 94.4%) than StatPearls (81.9%); human BCSC mean accuracy was 64.5%. Image-based items produced a 10- to 22-point accuracy decrement across all systems. Accuracy declined with increasing cognitive complexity, with the clearest separation on third-order management questions. CONCLUSIONS: Gemini 3 Pro had the best general-purpose LLM performance on ophthalmology board-style questions, providing near-perfect accuracy, while outperforming all GPT-5 family variants across domains and complexity levels. Significant deficits on image-based and third-order questions highlight persistent multimodal limitations and the need for ongoing benchmarking using challenging, clinically grounded datasets. FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

## OpenAlex AI 观察

### 本组速览

- 《ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles》〔方法〕：International audience

### 论文速览

1. [ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles](https://uphf.hal.science/hal-05491692)
   - Published：2026-04-14 08:00
   - 作者：Mathis Crinchon，Alaa Daoud，Emmanuel Adam，René Mandiau
   - 来源：openalex
   - 相关性分数：64
   - 命中原因：title matched "agent"; has complete metadata
   - 分类：Article, Engineering, Electrical and Electronic Engineering, Electric Vehicles and Infrastructure, Electric and Hybrid Vehicle Technologies
   - 标签：方法
   - 主题词：Agent
   - 摘要：International audience