# 每日论文简报

- 生成时间：2026-04-21 11:40:46 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LLM=15, Vision=10, PubMed AI=5, OpenAlex AI=0
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Benchmark」：命中 18 篇，覆盖 LLM、Vision 等，代表论文包括 《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》、《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》。
- 主题「Language Model」：命中 12 篇，覆盖 LLM、Vision 等，代表论文包括 《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》、《MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation》。
- 主题「Reasoning」：命中 7 篇，覆盖 LLM、Vision，代表论文包括 《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》、《OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation》。

## 本周该处理什么

1. [Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework](https://arxiv.org/abs/2604.06170v1)
   - 反馈状态：标星
   - 最晚处理：2026-04-18 (已逾期 3 天)
   - 下一步：compare planner design with newer agent benchmarks
   - 备注：Anchor paper for the multi-agent discovery workflow; compare its planner design with newer agent benchmarks.
   - 为什么现在看：已经逾期；已逾期至少 3 天；已设下一步动作但还没开始
   - 覆盖分组：LLM
   - 覆盖跨度：1 天 / 1 个 feed / 1 次命中
   - 历史窗口：2026-04-08 17:10 -> 2026-04-08 17:10
   - 来源：arxiv
   - 摘要：The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.


## 主题聚焦

### Benchmark

- 命中篇数：18
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》、《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》、《ClawEnvKit: Automatic Environment Generation for Claw-Like Agents》
- 主题速读：
  - 《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕：Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, lan…
  - 《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》〔评测 / 应用 / 方法〕：We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two pur…

### Language Model

- 命中篇数：12
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》、《MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation》、《StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning》
- 主题速读：
  - 《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》〔评测 / 应用 / 方法〕：We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two pur…
  - 《MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation》〔评测 / 方法〕：Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retriev…

### Reasoning

- 命中篇数：7
- 覆盖分组：LLM、Vision
- 代表论文：《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》、《OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation》、《StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning》
- 主题速读：
  - 《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕：Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, lan…
  - 《OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation》〔评测 / 应用 / 方法〕：Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a…

### Agent

- 命中篇数：4
- 覆盖分组：LLM
- 代表论文：《ClawEnvKit: Automatic Environment Generation for Claw-Like Agents》、《ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship》、《Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs》
- 主题速读：
  - 《ClawEnvKit: Automatic Environment Generation for Claw-Like Agents》〔评测 / 数据 / 方法〕：Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is nee…
  - 《ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship》〔评测 / 应用 / 方法〕：Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent w…

### Alignment

- 命中篇数：4
- 覆盖分组：LLM、Vision
- 代表论文：《IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters》、《Weakly-Supervised Referring Video Object Segmentation through Text Supervision》、《AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation》
- 主题速读：
  - 《IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters》〔应用 / 方法〕：Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems a…
  - 《Weakly-Supervised Referring Video Object Segmentation through Text Supervision》〔数据 / 应用 / 方法〕：Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly su…

## LLM 观察

### 本组速览

- 《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕：Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, lan…
- 《Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion》〔评测 / 应用 / 方法〕：We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two pur…
- 《ClawEnvKit: Automatic Environment Generation for Claw-Like Agents》〔评测 / 数据 / 方法〕：Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is nee…

### 论文速览

1. [MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval](https://arxiv.org/abs/2604.18584v1)
   - Published：2026-04-21 01:59
   - 作者：Shaden Alshammari，Kevin Wen，Abrar Zainal，Mark Hamilton，Navid Safaei，Sultan Albarakati 等
   - 来源：arxiv
   - 相关性分数：112
   - 命中原因：title matched "reasoning"; title matched "benchmark"; has PDF; has rich summary
   - 分类：cs.AI, cs.DL, cs.IR, cs.LG
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18584v1
   - 摘要：Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

2. [Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion](https://arxiv.org/abs/2604.18566v1)
   - Published：2026-04-21 01:53
   - 作者：Terry Leitch
   - 来源：arxiv
   - 相关性分数：108
   - 命中原因：title matched "benchmark"; summary matched "reasoning"; summary matched "evaluation"; has PDF
   - 分类：cs.AI, cs.HC, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.18566v1
   - 摘要：We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

3. [ClawEnvKit: Automatic Environment Generation for Claw-Like Agents](https://arxiv.org/abs/2604.18543v1)
   - Published：2026-04-21 01:36
   - 作者：Xirui Li，Ming Li，Derry Xu，Wei-Lin Chiang，Ion Stoica，Cho-Jui Hsieh 等
   - 来源：arxiv
   - 相关性分数：107
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.18543v1
   - 摘要：Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

4. [MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation](https://arxiv.org/abs/2604.18509v1)
   - Published：2026-04-21 01:00
   - 作者：Xingchen Xiao，Heyan Huang，Runheng Liu，Jincheng Xie
   - 来源：arxiv
   - 相关性分数：107
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.18509v1
   - 摘要：Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.

5. [OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation](https://arxiv.org/abs/2604.18486v1)
   - Published：2026-04-21 00:37
   - 作者：Jinghui Lu，Jiayi Guan，Zhijian Huang，Jinlong Li，Guang Li，Lingdong Kong 等
   - 来源：arxiv
   - 相关性分数：106
   - 命中原因：title matched "reasoning"; summary matched "agent"; summary matched "benchmark"; has PDF
   - 分类：cs.CV, cs.CL, cs.RO
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18486v1
   - 摘要：Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

6. [StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning](https://arxiv.org/abs/2604.18401v1)
   - Published：2026-04-20 23:22
   - 作者：Daoyu Wang，Qingchuan Li，Mingyue Cheng，Jie Ouyang，Shuo Yu，Qi Liu 等
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "alignment"; has PDF
   - 分类：cs.CL
   - 标签：应用 / 方法
   - 主题词：Language Model / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18401v1
   - 摘要：General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.

7. [ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship](https://arxiv.org/abs/2604.18356v1)
   - Published：2026-04-20 22:49
   - 作者：Zhaopei Huang，Yanfeng Jia，Jiayi Zhao，Xinjie Zhang，Wenxuan Wang，Qin Jin
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.18356v1
   - 摘要：Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substantive, human-like companionship. Specifically, we first design a dozen user-centric tools simulating various multimedia applications, which can cover different types of social support behaviors in human-agent interaction scenarios. We then construct ComPASS-Bench, the first personalized social support benchmark for LLM-based agents, via multi-step automated synthesis and manual refinement. Based on ComPASS-Bench, we further synthesize tool use records to fine-tune the Qwen3-8B model, yielding a task-specific ComPASS-Qwen. Comprehensive evaluations across two settings reveal that while the evaluated LLMs can generate valid tool-calling requests with high success rates, significant gaps remain in final response quality. Moreover, tool-augmented responses achieve better overall performance than directly producing conversational empathy. Notably, our trained ComPASS-Qwen demonstrates substantial improvements over its base model, achieving comparable performance to several large-scale models. Our code and data are available at https://github.com/hzp3517/ComPASS.

8. [HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents](https://arxiv.org/abs/2604.18349v1)
   - Published：2026-04-20 22:44
   - 作者：Shuqi Cao，Jingyi He，Fei Tan
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.18349v1
   - 摘要：Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer-stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM-Guided Memory System), a two-level event-turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high-level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A-Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at https://github.com/ZeroLoss-Lab/HiGMem.

9. [Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs](https://arxiv.org/abs/2604.18576v1)
   - Published：2026-04-21 01:57
   - 作者：Kevin Murphy
   - 来源：arxiv
   - 相关性分数：90
   - 命中原因：title matched "agent"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.18576v1
   - 摘要：We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

10. [Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data](https://arxiv.org/abs/2604.18493v1)
   - Published：2026-04-21 00:43
   - 作者：Zhenwen Liang，Yujun Zhou，Sidi Lu，Xiangliang Zhang，Haitao Mi，Dong Yu
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "reasoning"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.LG
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18493v1
   - 摘要：Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.

11. [Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling](https://arxiv.org/abs/2604.18464v1)
   - Published：2026-04-21 00:19
   - 作者：Yidi Yuan
   - 来源：arxiv
   - 相关性分数：88
   - 命中原因：title matched "reasoning"; summary matched "evaluation"; has PDF; has rich summary
   - 分类：cs.LG
   - 标签：评测 / 方法
   - 主题词：Language Model / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18464v1
   - 摘要：Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and hence affect its geometric impact. We applied STP at consecutive semantic reasoning step boundaries and achieved 168x more accurate multi-step latent prediction than frozen baselines on ProcessBench (3,400 samples), compared to only 4x for the random-token STP. Probing the latent manifold with a learned non-linear predictor reveals that STP-shaped trajectories are smooth curves, not straight lines: a 3-layer MLP reduces prediction error by a further 3-12x over linear extrapolation on step-boundary models. Removing the language modeling loss yields trajectories that are 2x more MLP-predictable than the combined loss, revealing a tradeoff between generation quality and geometric purity. Our results identify sampling position as the critical variable in geometric regularization and establish multi-step latent prediction MSE as a new evaluation metric for this class of methods.

12. [IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters](https://arxiv.org/abs/2604.18375v1)
   - Published：2026-04-20 23:02
   - 作者：Hongwei Zheng，Weiqi Wu，Zhengjia Wang，Guanyu Jiang，Haoming Li，Tianyu Wu 等
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "agent"; summary matched "alignment"; has PDF; has rich summary
   - 分类：cs.CL, cs.AI
   - 标签：应用 / 方法
   - 主题词：Agent / Alignment
   - PDF：https://arxiv.org/pdf/2604.18375v1
   - 摘要：Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world's largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.

13. [Training and Agentic Inference Strategies for LLM-based Manim Animation Generation](https://arxiv.org/abs/2604.18364v1)
   - Published：2026-04-20 22:54
   - 作者：Ravidu Suien Rammuni Silva，Ahmad Lotfi，Isibor Kennedy Ihianle，Golnaz Shahtahmassebi，Jordan J. Bird
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "agent"; summary matched "reasoning"; has PDF; has rich summary
   - 分类：cs.AI, cs.GR, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Reasoning
   - PDF：https://arxiv.org/pdf/2604.18364v1
   - 摘要：Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.

14. [Multilingual Training and Evaluation Resources for Vision-Language Models](https://arxiv.org/abs/2604.18347v1)
   - Published：2026-04-20 22:42
   - 作者：Daniela Baiamonte，Elena Fano，Matteo Gabburo，Stefano Simonazzi，Leonardo Rigutini，Andrea Zugarini
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "evaluation"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.18347v1
   - 摘要：Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

15. [On the Importance and Evaluation of Narrativity in Natural Language AI Explanations](https://arxiv.org/abs/2604.18311v1)
   - Published：2026-04-20 22:17
   - 作者：Mateusz Cedro，David Martens
   - 来源：arxiv
   - 相关性分数：86
   - 命中原因：title matched "evaluation"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.18311v1
   - 摘要：Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.

## Vision 观察

### 本组速览

- 《AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation》〔应用 / 方法〕：Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either o…
- 《DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery》〔方法〕：Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore…
- 《Weakly-Supervised Referring Video Object Segmentation through Text Supervision》〔数据 / 应用 / 方法〕：Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly su…

### 论文速览

1. [AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation](https://arxiv.org/abs/2604.18348v1)
   - Published：2026-04-20 22:43
   - 作者：Haoyue Tan，Shengnan Wang，Yulin Qiao，Juncheng Zhang，Youhui Bai，Ping Gong 等
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "video generation"; summary matched "diffusion"; has PDF; has rich summary
   - 分类：cs.CV, cs.AI
   - 标签：应用 / 方法
   - 主题词：Diffusion / Video Generation
   - PDF：https://arxiv.org/pdf/2604.18348v1
   - 摘要：Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.

2. [DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery](https://arxiv.org/abs/2604.18201v1)
   - Published：2026-04-20 20:50
   - 作者：Geet Sethi，Panav Shah，Ashutosh Gandhe，Soumitra Darshan Nayak
   - 来源：arxiv
   - 相关性分数：85
   - 命中原因：title matched "diffusion"; summary matched "segmentation"; has PDF; has rich summary
   - 分类：cs.CV, cs.LG
   - 标签：方法
   - 主题词：Diffusion / Segmentation
   - PDF：https://arxiv.org/pdf/2604.18201v1
   - 摘要：Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundational segmentation models, our approach enables robust and adaptive object localization across complex scenes. Experiments demonstrate that our pipeline significantly improves localization performance, achieving over a 14% increase in Acc@0.5 compared to existing state-of-the-art methods.

3. [Weakly-Supervised Referring Video Object Segmentation through Text Supervision](https://arxiv.org/abs/2604.17797v1)
   - Published：2026-04-20 12:38
   - 作者：Miaojing Shi，Jun Huang，Zijie Yue，Hanli Wang
   - 来源：arxiv
   - 相关性分数：76
   - 命中原因：title matched "segmentation"; summary matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：数据 / 应用 / 方法
   - 主题词：Language Model / Alignment
   - PDF：https://arxiv.org/pdf/2604.17797v1
   - 摘要：Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi-directional vision-language feature selection and interaction to enable fine-grained multimodal alignment. Next, we propose an instance-aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive-prediction fusion strategy to generate high-quality pseudo-masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-YouTube-VOS, and Ref-DAVIS17, demonstrate the superiority of our method. Code is available at \href{https://github.com/viscom-tongji/WSRVOS}{https://github.com/viscom-tongji/WSRVOS}.

4. [AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation](https://arxiv.org/abs/2604.18562v1)
   - Published：2026-04-21 01:49
   - 作者：Rui Qian，Chuanhang Deng，Qiang Huang，Jian Xiong，Mingxuan Li，Yingbo Zhou 等
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：方法
   - 主题词：Reasoning / Alignment
   - PDF：https://arxiv.org/pdf/2604.18562v1
   - 摘要：Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.

5. [UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models](https://arxiv.org/abs/2604.18518v1)
   - Published：2026-04-21 01:16
   - 作者：Jiaqi Wang，Haoge Deng，Ting Pan，Yang Liu，Chengyuan Wang，Fan Zhang 等
   - 来源：arxiv
   - 相关性分数：71
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.LG
   - 标签：评测 / 方法
   - 主题词：Benchmark / Diffusion
   - PDF：https://arxiv.org/pdf/2604.18518v1
   - 摘要：Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at \href{https://github.com/Yovecent/UDM-GRPO}{https://github.com/Yovecent/UDM-GRPO}.

6. [Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models](https://arxiv.org/abs/2604.18429v1)
   - Published：2026-04-20 23:47
   - 作者：Yakoub Bazi，Mohamad M. Al Rahhal，Mansour Zuair，Faroun Mohamed
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "multimodal"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.18429v1
   - 摘要：Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.

7. [One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection](https://arxiv.org/abs/2604.18393v1)
   - Published：2026-04-20 23:16
   - 作者：Boan Zhang，Wen Li，Guanhua Yu，Xiyang Liu，Wenchao Chen，Long Tian
   - 来源：arxiv
   - 相关性分数：69
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 方法
   - 主题词：Benchmark / Diffusion
   - PDF：https://arxiv.org/pdf/2604.18393v1
   - 摘要：Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse residual fields, to address this limitation for uIAD task. We first train a deep diffusion probabilistic model (DDPM) on normal data without any conditioning. Then, for a test sample, we predict its inverse residual fields (IRF) based on the noise estimated by the well-trained parametric noise function of the DDPM. Finally, uIAD is performed by evaluating the probability density of the IRF under a Gaussian distribution and comparing it with a threshold. Our key observation is that anomalies become distinguishable in this IRF space, a finding that has seldom been reported in prior works. Moreover, OSD-IRF requires only single step diffusion for uIAD, thanks to the property that IRF holds for any neighboring time step in the denoising process. Extensive experiments on three widely used uIAD benchmarks show that our model achieves SOTA or competitive performance across six metrics, along with roughly a 2X inference speedup without distillation.

8. [DSA-CycleGAN: A Domain Shift Aware CycleGAN for Robust Multi-Stain Glomeruli Segmentation](https://arxiv.org/abs/2604.18368v1)
   - Published：2026-04-20 22:57
   - 作者：Zeeshan Nisar，Friedrich Feuerhake，Thomas Lampert
   - 来源：arxiv
   - 相关性分数：69
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：方法
   - 主题词：Segmentation
   - PDF：https://arxiv.org/pdf/2604.18368v1
   - 摘要：A key challenge in segmentation in digital histopathology is inter- and intra-stain variations as it reduces model performance. Labelling each stain is expensive and time-consuming so methods using stain transfer via CycleGAN, have been developed for training multi-stain segmentation models using labels from a single stain. Nevertheless, CycleGAN tends to introduce noise during translation because of the one-to-many nature of some stain pairs, which conflicts with its cycle consistency loss. To address this, we propose the Domain Shift Aware CycleGAN, which reduces the presence of such noise. Furthermore, we evaluate several advances from the field of machine learning aimed at resolving similar problems and compare their effectiveness against DSA-CycleGAN in the context of multi-stain glomeruli segmentation. Experiments demonstrate that DSA-CycleGAN not only improves segmentation performance in glomeruli segmentation but also outperforms other methods in reducing noise. This is particularly evident when translating between biologically distinct stains. The code is publicly available at https://github.com/zeeshannisar/DSA-CycleGAN.

9. [OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation](https://arxiv.org/abs/2604.18326v1)
   - Published：2026-04-20 22:28
   - 作者：Lei Zhu，Xing Cai，Yingjie Chen，Yiheng Li，Binxin Yang，Hao Liu 等
   - 来源：arxiv
   - 相关性分数：68
   - 命中原因：title matched "video generation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.18326v1
   - 摘要：Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

10. [Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection](https://arxiv.org/abs/2604.18313v1)
   - Published：2026-04-20 22:18
   - 作者：Sa Zhu，Wanqian Zhang，Lin Wang，Jinchao Zhang，Cong Wang，Bo Li
   - 来源：arxiv
   - 相关性分数：68
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Alignment
   - PDF：https://arxiv.org/pdf/2604.18313v1
   - 摘要：Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.

## PubMed AI 观察

### 本组速览

- 《Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.》〔评测 / 数据 / 应用 / 方法〕：BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complex…
- 《Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database.》〔评测 / 应用 / 方法〕：BACKGROUND: The KEYNOTE-522 trial showed that neoadjuvant chemotherapy (NAC) plus adjuvant pembrolizumab improved overall survival, event-free survival (EFS),…
- 《Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents.》〔评测 / 数据 / 应用 / 方法〕：BACKGROUND: Unstructured oncology consultation notes contain rich clinical information that may support survival prediction. Open-weight large language models…

### 论文速览

1. [Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.](https://pubmed.ncbi.nlm.nih.gov/42004487/)
   - Entered：2026-04-20 15:29
   - 作者：A Loaiza-Bonilla，C Yost，S Kurnaz，E Tuysuz，N G Thaker，D Giritlioglu 等
   - 来源：pubmed
   - 相关性分数：101
   - 命中原因：title matched "clinical"; summary matched "language model"; summary matched "benchmark"; has DOI
   - 分类：Journal Article
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - 摘要：BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated extraction and reasoning can safely improve trial identification, efficiency, and equity at scale. METHODS: Consecutive patients [ N = 3804; Eastern Cooperative Oncology Group (ECOG) 0-2] balanced for cancer type incidence with metastatic or progressive malignancies were screened across a 12-month period. A multi-agent architecture-OncoAgents (LLM-based extraction and reasoning agents), OncoGraph (oncology knowledge graph), OncoRecommend (prioritization engine), and OncoSet (expert-curated corpus)-carried out automated data extraction, harmonization, and trial matching over 157 367 clinical pages (∼86.5 M tokens). Dual oncologists produced a gold standard of trial eligibility labels (Cohen's κ = 0.92). The primary unit of analysis was the patient-trial pair. Baselines included manual screening, GPT-4 zero-shot prompting, GPT-4 chain-of-thought, and frontier GPT-4o extraction/matching benchmarks. Outcomes included sensitivity, specificity, precision, F1 score, calibration of eligibility confidence scores, time-to-recommendation, fairness across demographic subgroups, and operational burden. RESULTS: The multi-agent neuro-symbolic system achieved an F1 score of 0.82 (95% confidence interval 0.81-0.83). In comparison, the GPT-4 zero-shot baseline achieved an F1 of 0.47, and the GPT-4 chain-of-thought baseline achieved an F1 of 0.67. Per-patient screening time decreased from a median of 120 min (manual review) to ∼30 min total (15 min automated processing + 15 min clinical review). Across the cohort, the system processed 157 000 pages, screened 23 912 candidate patient-trial pairs, and produced 17 912 oncologist-confirmed matches, with median time-to-recommendation <7 days. No demographic subgroup exceeded a 10-percentage point F1 gap; the largest observed difference was ∼7 points between white and black/African American patients. Ablation experiments showed that both knowledge graph grounding and multi-agent decomposition contributed materially to performance and efficiency. Eligibility confidence scores exhibited reasonable calibration in the clinically relevant operating range. CONCLUSIONS: A neuro-symbolic, multi-agent architecture that couples LLM-based extraction with ontology-grounded, deterministic eligibility reasoning improved the accuracy, throughput, and timeliness of oncology clinical trial matching versus LLM-only baselines, while preserving clinician oversight and maintaining modest subgroup performance gaps. These results support scalable, equity-aware deployment of AI-assisted trial screening in routine oncology practice.

2. [Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database.](https://pubmed.ncbi.nlm.nih.gov/42004488/)
   - Entered：2026-04-20 15:29
   - 作者：X Hu，J R Earla，G I Cruz，H Mohammed，T Privette，A Hernandez 等
   - 来源：pubmed
   - 相关性分数：83
   - 命中原因：title matched "clinical"; summary matched "benchmark"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Clinical
   - 摘要：BACKGROUND: The KEYNOTE-522 trial showed that neoadjuvant chemotherapy (NAC) plus adjuvant pembrolizumab improved overall survival, event-free survival (EFS), and pathological complete response (pCR) in high-risk early-stage triple-negative breast cancer. As treatments evolve, evaluating real-world (RW) effectiveness is key to understanding trial generalizability. This study benchmarked RW efficacy endpoints in early-stage triple-negative breast cancer patients treated with NAC. MATERIALS AND METHODS: This retrospective study used RW data from United States community practices abstracted by oncology data specialists. Eligible patients received NAC regimens similar to the KEYNOTE-522 control arm. Control arm patients received no adjuvant therapy, while the RW cohort could receive adjuvant capecitabine. Two RW endpoints were assessed: rwpCR (ypT0/Tis ypN0 or clinical pCR with in situ disease) and rwEFS (time from NAC start to first event). Outcomes were compared with those from the KEYNOTE-522 control arm. RESULTS: In the RW cohort ( n = 128), rwpCR was 37.5%, compared with 51.2% pCR in the KEYNOTE-522 control arm ( n = 390). rwEFS over 36 months was comparable: 75.0% (95% confidence interval 67.1% to 83.8%) in the RW cohort versus 76.8% (95% confidence interval 72.2% to 80.7%) in the KEYNOTE-522 control arm (log-rank P = 0.97). The incidence rates of first events were also similar (22.0% RW versus 23.8% trial). CONCLUSIONS: Although pCR rates were higher in the KEYNOTE-522 control arm compared with the RW cohort, rwEFS was comparable with EFS in the KEYNOTE-522 control arm. This study highlights the value of combining structured data with custom abstraction to assess RW endpoints and support future research.

3. [Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents.](https://pubmed.ncbi.nlm.nih.gov/42004490/)
   - Entered：2026-04-20 15:29
   - 作者：T Phaterpekar，Z Zeng，Y Mali，B Leung，C Ho，R T Ng 等
   - 来源：pubmed
   - 相关性分数：83
   - 命中原因：title matched "language model"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Clinical
   - 摘要：BACKGROUND: Unstructured oncology consultation notes contain rich clinical information that may support survival prediction. Open-weight large language models (LLMs) can utilize these notes with zero-shot inference or fine-tuning, but their relative value for this setting remains unclear. The objective of this study is to evaluate open-weight LLMs for predicting 60-month survival from initial oncology consultation notes, comparing (i) zero-shot performance, (ii) performance after fine-tuning, and (iii) smaller natural language processing models trained on the same dataset in prior work. MATERIALS AND METHODS: We used Meta's Llama models to predict patients' 60-month survival using oncology consultation notes from a dataset of 59 800 patients. We tested both zero-shot and fine-tuning approaches. Metrics included balanced accuracy (BA) and weighted F1. RESULTS: Zero-shot performance was limited. Llama-2-13B performed best among the zero-shot configurations (average performance across prompts: BA 0.596, weighted F1 0.644; performance on Prompt 4: BA 0.766, weighted F1 0.802). Fine-tuning improved performance across models: Llama-2-13B achieved BA 0.842, weighted F1 0.846, area under the receiver operating characteristic curve (AUC) 0.905; Llama-2-7B achieved BA 0.840, weighted F1 0.843, AUC 0.911; Llama-3.1-8B achieved BA 0.829, weighted F1 0.829, AUC 0.881. Performance was numerically similar to smaller models trained on the same task and data. CONCLUSIONS: For predicting 60-month survival from initial oncology consultation documents, fine-tuning open-weight LLMs meaningfully improves performance compared with zero-shot use, but does not consistently outperform smaller language models. This may suggest that both fine-tuned LLMs and smaller models merit continued investigation, with the most appropriate approach likely to depend on the outcome of interest, clinical context, and practical considerations such as hardware, privacy, and deployment feasibility.

4. [A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa.](https://pubmed.ncbi.nlm.nih.gov/42003757/)
   - Entered：2026-04-20 14:53
   - 作者：Celal Yeşilkaya，Hande Kırışman Keleş，Esra Rabia Taşpolat，Caner Mutlu，Serkan Turan
   - 来源：pubmed
   - 相关性分数：83
   - 命中原因：title matched "language model"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Language Model / Evaluation
   - 摘要：BACKGROUND: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), where early recognition and timely intervention are critical, the accuracy of AI-generated information available to parents may have important clinical implications. This study evaluated the performance of LLMs in responding to parent-oriented questions about AN. METHODS: A comparative model evaluation was conducted using three conversational AI systems: ChatGPT (GPT-4o), Google Gemini, and DeepSeek. Twenty questions representative of those frequently asked by parents of adolescents with AN were identified through online content exploration and expert review. Each question was submitted using standardized prompts in separate chat sessions. Responses were anonymized and independently evaluated by two board-certified child and adolescent psychiatrists across three dimensions: quality, usefulness, and reliability. Reproducibility was assessed through repeated queries in separate sessions. RESULTS: All three models demonstrated generally high levels of reliability and overall informational performance. ChatGPT achieved the highest overall accuracy (≈92%) and reproducibility (≈90%), followed by Gemini (≈88% accuracy; ≈85% reproducibility) and DeepSeek (≈86% accuracy; ≈83% reproducibility). Domain-level analysis showed lower accuracy across models for diagnosis and clinical assessment questions. Qualitative error analysis indicated that the omission of clinically relevant information was the most common limitation, while DeepSeek produced more factual inaccuracies and Gemini more generalized guidance. CONCLUSIONS: LLMs may provide broadly accurate preliminary information for parents seeking guidance about AN. However, persistent omissions and domain-specific variability highlight important limitations. AI-generated information should therefore be regarded as a complementary resource rather than a replacement for professional clinical guidance.

5. [Impacts of Multidisciplinary Lung Cancer Meeting Presentation in a Clinical Quality Registry.](https://pubmed.ncbi.nlm.nih.gov/42006279/)
   - Entered：2026-04-20 15:46
   - 作者：Rob G Stirling，Sanuki Tissera，Jessie Zeng，Mike Lloyd，Krupa Krishnaprasad，Lisa Briggs 等
   - 来源：pubmed
   - 相关性分数：66
   - 命中原因：title matched "clinical"; has DOI; has rich summary; has complete metadata
   - 分类：Journal Article
   - 标签：方法
   - 主题词：Clinical
   - 摘要：BACKGROUND: Lung cancer is a heterogeneous and complex disease requiring multidisciplinary input for optimal management planning, with guidelines recommending that all patients be discussed in a multidisciplinary setting. Multidisciplinary meeting (MDM) discussion aims to enhance evidence-based management, improve treatment access, and optimize complex management plans. METHODS: We aimed to assess the extent and impacts of MDM discussion in patients with lung cancer described by the Victorian Lung Cancer Registry from 2011 to 2023 in Victoria, Australia. We identified MDM-presented and nonpresented patients and assessed the impacts of MDM presentation. OR and survival hazard ratios were assessed using Cox proportional regression analysis. Survival analysis was determined using the Kaplan-Meier product-limit method. Sensitivity analyses were conducted using landmark analysis and propensity score matching methods. RESULTS: A total of 18,597 patients were included, of whom 67% had evidence of presentation to a lung cancer MDM, with MDM presentation increasing from 59.1% to 80.6% during the study period. MDM presentation was associated with higher levels of provision of guideline-concordant treatment in NSCLC (56.2% versus 44.5%, p < 0.001), and lower levels of no treatment (10.0% versus 21.4%, p < 0.001). Modifiable factors that could increase MDM presentation include referral of patients of older age, stage IV disease, SCLC, and diagnosis at a private or regional hospital. Propensity-matched survival analysis in NSCLC revealed a median survival of 1.1 years for MDM-presented versus 0.86 years for nonpresented individuals, providing a 12% reduction in mortality hazard (hazard ratio 0.88 [0.82-0.95], p < 0.001). CONCLUSION: During the period of activity of the Victorian Lung Cancer Registry, MDM presentation increased from 59.1% to 80.6%. Management outcomes in MDM-presented patients identified multiple underserved cohorts and revealed considerable increases in treatment modalities and guideline-concordant treatment in NSCLC in this observational study, with an associated 12% improvement in survival advantage overall.

## OpenAlex AI 观察

今日没有新的命中文献。