# 每日论文简报

- 生成时间：2026-04-17 11:39:21 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LLM=15, Vision=9, PubMed AI=5, OpenAlex AI=0
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Benchmark」：命中 14 篇，覆盖 LLM、Vision 等，代表论文包括 《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》、《IE as Cache: Information Extraction Enhanced Agentic Reasoning》。
- 主题「Evaluation」：命中 13 篇，覆盖 LLM、Vision 等，代表论文包括 《QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies》、《From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench》。
- 主题「Reasoning」：命中 10 篇，覆盖 LLM、Vision 等，代表论文包括 《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》、《IE as Cache: Information Extraction Enhanced Agentic Reasoning》。

## 主题聚焦

### Benchmark

- 命中篇数：14
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》、《IE as Cache: Information Extraction Enhanced Agentic Reasoning》、《QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies》
- 主题速读：
  - 《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕：It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs…
  - 《IE as Cache: Information Extraction Enhanced Agentic Reasoning》〔评测 / 应用 / 方法〕：Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding a…

### Evaluation

- 命中篇数：13
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies》、《From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench》、《An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics》
- 主题速读：
  - 《QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies》〔评测 / 方法〕：Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading s…
  - 《From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench》〔评测 / 方法〕：Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchm…

### Reasoning

- 命中篇数：10
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》、《IE as Cache: Information Extraction Enhanced Agentic Reasoning》、《AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment》
- 主题速读：
  - 《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕：It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs…
  - 《IE as Cache: Information Extraction Enhanced Agentic Reasoning》〔评测 / 应用 / 方法〕：Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding a…

### Language Model

- 命中篇数：6
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography》、《RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models》、《Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.》
- 主题速读：
  - 《RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography》〔应用 / 方法〕：Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, e…
  - 《RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models》〔数据 / 应用 / 方法〕：Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and speci…

### Segmentation

- 命中篇数：5
- 覆盖分组：Vision
- 代表论文：《SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation》、《Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization》、《Boundary-Centric Active Learning for Temporal Action Segmentation》
- 主题速读：
  - 《SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation》〔应用 / 方法〕：Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision suppo…
  - 《Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization》〔评测 / 方法〕：We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a…

## LLM 观察

### 本组速览

- 《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕：It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs…
- 《IE as Cache: Information Extraction Enhanced Agentic Reasoning》〔评测 / 应用 / 方法〕：Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding a…
- 《QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies》〔评测 / 方法〕：Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading s…

### 论文速览

1. [CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas](https://arxiv.org/abs/2604.15267v1)
   - Published：2026-04-17 01:40
   - 作者：Emanuel Tewolde，Xiao Zhang，David Guzman Piedrahita，Vincent Conitzer，Zhijing Jin
   - 来源：arxiv
   - 相关性分数：130
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "reasoning"; has PDF
   - 分类：cs.GT, cs.AI, cs.CL, cs.CY, cs.MA
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.15267v1
   - 摘要：It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

2. [IE as Cache: Information Extraction Enhanced Agentic Reasoning](https://arxiv.org/abs/2604.14930v1)
   - Published：2026-04-16 20:18
   - 作者：Hang Lv，Sheng Liang，Hongchao Gu，Wei Guo，Defu Lian，Yong Liu 等
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "agent"; title matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.14930v1
   - 摘要：Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.

3. [QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies](https://arxiv.org/abs/2604.15151v1)
   - Published：2026-04-16 23:31
   - 作者：Alexey Khoroshilov，Alexey Chernysh，Orkhan Ekhtibarov，Nini Kamkia，Dmitry Zmitrovich
   - 来源：arxiv
   - 相关性分数：123
   - 命中原因：title matched "benchmark"; summary matched "agent"; summary matched "alignment"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15151v1
   - 摘要：Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.

4. [From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench](https://arxiv.org/abs/2604.15037v1)
   - Published：2026-04-16 22:06
   - 作者：Ke Xu，Yuhao Wang，Yu Wang
   - 来源：arxiv
   - 相关性分数：122
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.AI, cs.CL, cs.SD
   - 标签：评测 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15037v1
   - 摘要：Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

5. [An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics](https://arxiv.org/abs/2604.15145v1)
   - Published：2026-04-16 23:19
   - 作者：Miri Liu，ChengXiang Zhai
   - 来源：arxiv
   - 相关性分数：109
   - 命中原因：title matched "benchmark"; title matched "evaluation"; has PDF; has rich summary
   - 分类：cs.AI, cs.DL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15145v1
   - 摘要：The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics for scientific papers generally validate their results against noisy, confounded signals such as citation counts or peer review scores. These proxies can conflate novelty with impact, quality, or reviewer preference, which in turn makes it harder to assess how well a given metric actually evaluates novelty. We therefore propose an axiomatic benchmark for scientific novelty metrics. We first define a set of axioms that a well-behaved novelty metric should satisfy, grounded in human scientific norms and practice, then evaluate existing metrics across ten tasks spanning three domains of AI research. Our results reveal that no existing metric satisfies all axioms consistently, and that metrics fail on systematically different axioms, reflecting their underlying architectures. Additionally, we show that combining metrics of complementary architectures leads to consistent improvements on the benchmark, with per-axiom weighting achieving 90.1% versus 71.5% for the best individual metric, suggesting that developing architecturally diverse metrics is a promising direction for future work. We release the benchmark code as supplementary material to encourage the development of more robust scientific literature novelty metrics.

6. [MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation](https://arxiv.org/abs/2604.15309v1)
   - Published：2026-04-17 01:59
   - 作者：Yan Li，Zezi Zeng，Yifan Yang，Yuqing Yang，Ning Liao，Weiwei Guo 等
   - 来源：arxiv
   - 相关性分数：108
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CV, cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15309v1
   - 摘要：The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

7. [Context Over Content: Exposing Evaluation Faking in Automated Judges](https://arxiv.org/abs/2604.15224v1)
   - Published：2026-04-17 00:55
   - 作者：Manan Gupta，Inderjeet Nair，Lu Wang，Dhruv Kumar
   - 来源：arxiv
   - 相关性分数：107
   - 命中原因：title matched "evaluation"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.AI, cs.CL, cs.LG
   - 标签：评测 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15224v1
   - 摘要：The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $ΔV = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

8. [AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment](https://arxiv.org/abs/2604.15222v1)
   - Published：2026-04-17 00:53
   - 作者：Oz Levy，Ilya Dikman，Natan Levy，Michael Winokur
   - 来源：arxiv
   - 相关性分数：107
   - 命中原因：title matched "evaluation"; summary matched "reasoning"; summary matched "alignment"; has PDF
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Reasoning
   - PDF：https://arxiv.org/pdf/2604.15222v1
   - 摘要：Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remain heavily dependent on expert judgment. While recent AI tools demonstrate promising capabilities in analyzing and generating requirements, their role within formal systems engineering processes-and their alignment with established INCOSE criteria-remains insufficiently understood. This paper investigates the extent to which AI-based tools can support systems engineers in evaluating requirement quality, without replacing professional expertise. The research adopts a structured systems engineering methodology to compare AI-assisted requirement evaluation with human expert assessment. A controlled study was conducted in which system requirements were evaluated against established INCOSE ``good requirement'' criteria by both experienced systems engineers and an AI-based assessment tool. The evaluation focused on consistency, completeness, clarity, and testability, examining not only accuracy but also the decision logic underlying each assessment. Results indicate that AI tools can provide consistent and rapid preliminary assessments, particularly for syntactic and structural quality attributes. However, expert judgment remains essential for contextual interpretation, ambiguity resolution, and trade-off reasoning. Rather than positioning AI as a replacement for systems engineers, the findings support its role as a decision-support mechanism within the RE lifecycle. From a systems engineering perspective, this study contributes empirical evidence on how AI can be integrated into RE workflows while preserving traceability, accountability, and engineering consistency.

9. [MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events](https://arxiv.org/abs/2604.15203v1)
   - Published：2026-04-17 00:28
   - 作者：Raunak Agarwal，Markus Wenzel，Simon Baur，Jonas Zimmer，George Harvey，Jackie Ma
   - 来源：arxiv
   - 相关性分数：106
   - 命中原因：title matched "benchmark"; summary matched "reasoning"; summary matched "evaluation"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15203v1
   - 摘要：Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

10. [Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC](https://arxiv.org/abs/2604.15082v1)
   - Published：2026-04-16 22:42
   - 作者：Cunxi Yu，Haoxing Ren
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "agent"; summary matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.AR, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.15082v1
   - 摘要：This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis components, covering flow tuning, logic minimization, and technology mapping, but without manually injecting new heuristics. On top of this foundation, a team of LLM-based agents iteratively rewrites and evolves specific sub-components of ABC following our ``programming guidance`` prompts under a unified correctness and QoR-driven evaluation loop. Each evolution cycle proposes code modifications, compiles the integrated binary, validates correctness, and evaluates quality-of-results (QoR) on \emph{multi-suite benchmarks including ISCAS~85/89/99, VTR, EPFL, and IWLS~2005}. Through continuous feedback, the system discovers optimizations beyond human-designed heuristics, effectively \emph{learning new synthesis strategies} that enhance QoR. We detail the architecture of this self-improving system, its integration with \textsc{ABC}, and results demonstrating that the framework can autonomously and progressively improve EDA tool at full million-line scale.

11. [ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints](https://arxiv.org/abs/2604.14902v1)
   - Published：2026-04-16 19:46
   - 作者：Pei-An Chen，Yong-Ching Liang，Jia-Fong Yeh，Hung-Ting Su，Yi-Ting Chen，Min Sun 等
   - 来源：arxiv
   - 相关性分数：102
   - 命中原因：title matched "benchmark"; summary matched "agent"; summary matched "reasoning"; has PDF
   - 分类：cs.AI, cs.CL, cs.CV, cs.RO
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.14902v1
   - 摘要：Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

12. [Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation](https://arxiv.org/abs/2604.15190v1)
   - Published：2026-04-17 00:23
   - 作者：Ziyang Chen，Renbing Chen，Daowei Li，Jinzhi Liao，Jiashen Sun，Ke Zeng 等
   - 来源：arxiv
   - 相关性分数：96
   - 命中原因：summary matched "reasoning"; summary matched "alignment"; summary matched "evaluation"; has DOI
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Evaluation / Reasoning
   - PDF：https://arxiv.org/pdf/2604.15190v1
   - 摘要：Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, which no single paradigm achieves alone. We propose Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that mines transferable decision policies from behavioral trajectories and uses them as a shared alignment layer. This layer anchors an LLM-based reasoning branch that prevents over-rationalization and an ML-based fitting branch that absorbs implicit regularities. Group-level predictions from both branches are fused for complementary correction. We deploy PGHS on Meituan with 101 merchants and over 26,000 trajectories. PGHS achieves a group simulation error of 8.80%, improving over the best reasoning-based and fitting-based baselines by 45.8% and 40.9% respectively.

13. [From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning](https://arxiv.org/abs/2604.15244v1)
   - Published：2026-04-17 01:20
   - 作者：Kiran Purohit，Ramasuri Narayanam，Soumyabrata Pal
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "reasoning"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.15244v1
   - 摘要：Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

14. [Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications](https://arxiv.org/abs/2604.15233v1)
   - Published：2026-04-17 01:10
   - 作者：Moin Aminnaseri，Farima Fatahi Bayat，Nikita Bhutani，Jean-Flavien Bussotti，Kevin Chan，Rafael Li Chen 等
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "agent"; summary matched "reasoning"; has PDF; has rich summary
   - 分类：cs.AI, cs.DB
   - 标签：应用 / 方法
   - 主题词：Reasoning / Agent
   - PDF：https://arxiv.org/pdf/2604.15233v1
   - 摘要：NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data. In this paper, we present Blue's Data Intelligence Layer (DIL) designed to support multi-source, multi-modal, and data-centric applications. Blue is a compound AI system that orchestrates agents and data for enterprise settings. DIL serves as the data intelligence layer for agentic data processing, to bridge the semantic gap between user intent and available information by unifying structured enterprise data, world knowledge accessible through LLMs, and personal context obtained through interaction. At the core of DIL is a data registry that stores metadata for diverse data sources and modalities to enable both native and natural language queries. DIL treats LLMs, the Web, and the User as source 'databases', each with their own query interface, elevating them to first-class data sources. DIL relies on data planners to transform user queries into executable query plans. These plans are declarative abstractions that unify relational operators with other operators spanning multiple modalities. DIL planners support decomposition of complex requests into subqueries, retrieval from diverse sources, and finally reasoning and integration to produce final results. We demonstrate DIL through two interactive scenarios in which user queries dynamically trigger multi-source retrieval, cross-modal reasoning, and result synthesis, illustrating how compound AI systems can move beyond single database NL2SQL.

15. [RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography](https://arxiv.org/abs/2604.15231v1)
   - Published：2026-04-17 01:09
   - 作者：Mélanie Roschewitz，Kenneth Styppa，Yitian Tao，Jiwoong Sohn，Jean-Benoit Delbrouck，Benjamin Gundersen 等
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "agent"; summary matched "reasoning"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：Reasoning / Language Model
   - PDF：https://arxiv.org/pdf/2604.15231v1
   - 摘要：Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

## Vision 观察

### 本组速览

- 《SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation》〔应用 / 方法〕：Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision suppo…
- 《Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization》〔评测 / 方法〕：We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a…
- 《Boundary-Centric Active Learning for Temporal Action Segmentation》〔评测 / 数据 / 应用 / 方法〕：Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining ac…

### 论文速览

1. [SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation](https://arxiv.org/abs/2604.15271v1)
   - Published：2026-04-17 01:42
   - 作者：Tianhao Fu，Austin Wang，Charles Chen，Roby Aldave-Garza，Yucheng Chen
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI, cs.LG
   - 标签：应用 / 方法
   - 主题词：Clinical / Segmentation
   - PDF：https://arxiv.org/pdf/2604.15271v1
   - 摘要：Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

2. [Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization](https://arxiv.org/abs/2604.15196v1)
   - Published：2026-04-17 00:24
   - 作者：Umer Ahmed，Syed Ahmed Mahmood，Fawad Javed Fateh，M. Shaheer Luqman，M. Zeeshan Zia，Quoc-Huy Tran
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 方法
   - 主题词：Benchmark / Segmentation
   - PDF：https://arxiv.org/pdf/2604.15196v1
   - 摘要：We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

3. [Boundary-Centric Active Learning for Temporal Action Segmentation](https://arxiv.org/abs/2604.15173v1)
   - Published：2026-04-16 23:50
   - 作者：Halil Ismail Helvaci，Sen-ching Samson Cheung
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Segmentation
   - PDF：https://arxiv.org/pdf/2604.15173v1
   - 摘要：Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

4. [An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation](https://arxiv.org/abs/2604.15171v1)
   - Published：2026-04-16 23:48
   - 作者：Onno Niemann，Gonzalo Martínez Muñoz，Alberto Suárez Gonzalez
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.LG
   - 标签：方法
   - 主题词：Diffusion
   - PDF：https://arxiv.org/pdf/2604.15171v1
   - 摘要：Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.

5. [RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework](https://arxiv.org/abs/2604.15308v1)
   - Published：2026-04-17 01:59
   - 作者：Hao Gao，Shaoyu Chen，Yifan Zhu，Yuehao Song，Wenyu Liu，Qian Zhang 等
   - 来源：arxiv
   - 相关性分数：68
   - 命中原因：summary matched "diffusion"; summary matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Multimodal
   - PDF：https://arxiv.org/pdf/2604.15308v1
   - 摘要：High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

6. [Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation](https://arxiv.org/abs/2604.15003v1)
   - Published：2026-04-16 21:27
   - 作者：Yuzhuo Chen，Zehua Ma，Han Fang，Hengyi Wang，Guanjie Wang，Weiming Zhang
   - 来源：arxiv
   - 相关性分数：67
   - 命中原因：title matched "video generation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：应用 / 方法
   - 主题词：Video Generation
   - PDF：https://arxiv.org/pdf/2604.15003v1
   - 摘要：The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

7. [RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models](https://arxiv.org/abs/2604.14951v1)
   - Published：2026-04-16 20:47
   - 作者：Gabriele Mattioli，Evelyn Turri，Sara Sarto，Lorenzo Baraldi，Marcella Cornia，Rita Cucchiara
   - 来源：arxiv
   - 相关性分数：67
   - 命中原因：title matched "multimodal"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI, cs.CL, cs.MM
   - 标签：数据 / 应用 / 方法
   - 主题词：Reasoning / Language Model
   - PDF：https://arxiv.org/pdf/2604.14951v1
   - 摘要：Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

8. [Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation](https://arxiv.org/abs/2604.14849v1)
   - Published：2026-04-16 18:34
   - 作者：Emil Benedykciuk，Marcin Denkowski，Grzegorz M. Wójcik
   - 来源：arxiv
   - 相关性分数：64
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Segmentation
   - PDF：https://arxiv.org/pdf/2604.14849v1
   - 摘要：Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.

9. [From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation](https://arxiv.org/abs/2604.14805v1)
   - Published：2026-04-16 17:28
   - 作者：Yili Ren，Shiqi Wen，Li Hou，Dingwen Xiao，Weiming Zhang，Caleb Chen Cao 等
   - 来源：arxiv
   - 相关性分数：63
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：数据 / 方法
   - 主题词：Segmentation / Alignment
   - PDF：https://arxiv.org/pdf/2604.14805v1
   - 摘要：Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.

## PubMed AI 观察

### 本组速览

- 《Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.》〔评测 / 数据 / 应用 / 方法〕：OBJECTIVES: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diag…
- 《Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals.》〔评测 / 方法〕：ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical p…
- 《From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.》〔评测 / 应用 / 方法〕：Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understandi…

### 论文速览

1. [Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.](https://pubmed.ncbi.nlm.nih.gov/41990239/)
   - Entered：2026-04-16 23:14
   - 作者：Seungjun Kim，Yiliang Zhou，Yawen Guo，Changrui Xiao，Kai Zheng
   - 来源：pubmed
   - 相关性分数：113
   - 命中原因：title matched "language model"; title matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Evaluation / Language Model
   - 摘要：OBJECTIVES: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and diagnosis from clinical text. MATERIALS AND METHODS: A systematic search was conducted in PubMed, ACM Digital Library, and IEEE Xplore. Two reviewers independently screened papers and extracted data. Methodological rigor and quality of the studies were evaluated using the MI-CLAIM framework. RESULTS: The search resulted in 135 studies; 27 of them met the inclusion criteria. Methods used spanned rule-based systems, classical ML/DL models, transformer architectures, and LLMs. Transformer- and LLM-based approaches outperformed earlier methods in entity recognition, phenotype extraction, and diagnostic ranking. Several studies demonstrated clinical impact, such as increased genetic testing and identification of undiagnosed cases. However, most studies relied on retrospective and single-center datasets. Reporting of preprocessing, evaluation, and reproducibility was largely inconsistent, and interpretability, fairness, and privacy were rarely addressed. DISCUSSION: Natural language processing and LLMs show strong potential to accelerate rare disease diagnosis. However, heterogeneity in methods and metrics hinders cross-study comparability. Data scarcity, lack of generalization, and limited transparency remain significant challenges. CONCLUSIONS: Natural language processing/LLM methods can support timely diagnosis of rare diseases using unstructured clinical text. Future research should prioritize multicenter studies, standardized evaluation frameworks, transparency, and fairness safeguards to enable reliable, equitable deployment.

2. [Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals.](https://pubmed.ncbi.nlm.nih.gov/41989203/)
   - Entered：2026-04-16 17:14
   - 作者：William Clackett，Hatim Alsusa，Hannah Watson，Antanas Kascenas，David Scott，Avinash K Kanodia 等
   - 来源：pubmed
   - 相关性分数：107
   - 命中原因：title matched "language model"; title matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Evaluation / Language Model
   - 摘要：ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical planning. This study evaluates the accuracy and speed of large language models (LLMs) in automatically vetting lumbar spine MRI referrals from general practice.MethodsThree LLMs (GPT-4, Claude Opus, Gemini) were tasked with assigning an outcome (Accept - Routine, Accept - Urgent, Reject) and flagging MRI contraindications for lumbar spine referrals. Three prompts of increasing detail, including clinical guidelines and training examples, were used. Two radiology registrars synthesised 120 referrals, vetted by two board-certified radiologists, with a third resolving disagreements. Performance was assessed using accuracy, precision, recall and F1 scores.ResultsInter-rater agreement between radiologists was substantial for vetting outcome (Cohen's κ = 0.76) and contraindication detection ( κ = 0.68). Claude Opus with the full prompt achieved the highest accuracy (0.86) for vetting outcomes. GPT-4 with the instruction-only prompt achieved the highest F1 score (0.88) for contraindication detection. LLMs completed the task substantially faster than radiologists (9.8 ± 1.0 vs 135.0 ± 45.0 min).ConclusionsLLMs demonstrate promising performance in vetting radiological referrals for sciatica, particularly with detailed context. All models identified all urgent referrals, suggesting potential for prioritising vetting worklists and improving timeliness of care.

3. [From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.](https://pubmed.ncbi.nlm.nih.gov/41989909/)
   - Entered：2026-04-16 20:32
   - 作者：Lingdong Shen，Xiaoshuang Huang，Fangxin Shang，Xudong Zhang，Yehui Yang，Bin Fan 等
   - 来源：pubmed
   - 相关性分数：106
   - 命中原因：title matched "language model"; summary matched "benchmark"; summary matched "clinical"; has DOI
   - 分类：Journal Article
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Reasoning
   - 摘要：Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eight medical imaging modalities and four core tasks, supporting both spatially-grounded reasoning and fine grained diagnostic comprehension. Building on this, we introduce MedPLIB, an end-to-end biomedical MLLM equipped with pixel level visual understanding. MedPLIB supports diverse multi modal tasks-including VQA, point- and region-based querying, grounding, and segmentation-through unified modeling. To further accommodate the heterogeneous nature of biomedical tasks, we design a task-specialized Mixture-of-Experts (MoE) architecture, where each expert is tailored to a specific task and jointly optimized via unified fine-tuning. This modular design accommodates diverse biomedical tasks while maintaining a unified and efficient architecture. By integrating retrieval-augmented generation (RAG) and in-context learning (ICL), MedPLIB also demonstrates strong generalization on out-of-distribution (OOD) medical image segmentation. Experiments across multiple benchmarks show that MedPLIB sets a new state-of-the-art on biomedical vision-language tasks; notably, it outperforms the best existing small and large models by 19.7 and 15.6 mDice in zero shot pixel-level grounding, highlighting its clinical utility and generalization strength. Code and data are publicly available at GitHub: https://github.com/ShawnHuang497/MedPLIB.

4. [Targeted use of large language models for EHR-based computable phenotyping.](https://pubmed.ncbi.nlm.nih.gov/41990328/)
   - Entered：2026-04-17 00:43
   - 作者：Dylan Owens，Jing Cao，Mehak Gupta，Danh Nguyen，Eric Peterson，Ann Marie Navar
   - 来源：pubmed
   - 相关性分数：93
   - 命中原因：title matched "language model"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：应用 / 方法
   - 主题词：Language Model / Clinical
   - 摘要：OBJECTIVE: Computable phenotypes derived from electronic health records (EHRs) are central to clinical research and quality reporting. Although large language models (LLMs) can extract clinically rich information from unstructured notes, routine application to all patients is computationally expensive. We evaluated whether uncertainty-guided selective use of LLMs can improve phenotyping accuracy while preserving scalability. MATERIALS AND METHODS: We developed a selective augmentation framework integrating structured and unstructured EHR data using uncertainty-guided triage. An ensemble of heterogeneous classifiers trained on structured data generated probabilistic phenotype predictions and uncertainty measures to identify patients at elevated risk of misclassification. Only flagged patients underwent LLM-based analysis of unstructured clinical notes using retrieval-augmented generation. LLM-derived outputs were incorporated as additional predictors in a final probabilistic model. Performance was evaluated for two registry-based phenotypes: diabetes mellitus and peripheral arterial disease (PAD), using internal cross-registry and external validation cohorts. RESULTS: For diabetes mellitus, selective augmentation improved sensitivity in the internal validation cohort from 0.81 to 0.90 without loss of specificity (0.92). More than 70% of triage-flagged patients represented misclassifications by structured data alone. For PAD, selective augmentation markedly increased sensitivity from 0.18 to 0.97 while maintaining high specificity (0.99), requiring LLM analysis for only 10% of patients. DISCUSSION: Uncertainty-guided triage efficiently concentrated LLM use on patients most likely to benefit, improving case identification-particularly for phenotypes poorly captured by structured data-while minimizing computational burden. CONCLUSION: Selective, uncertainty-guided integration of LLMs enables scalable, interpretable, and accurate EHR-based phenotyping, offering a practical alternative to universal LLM deployment in real-world informatics workflows.

5. [Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3.](https://pubmed.ncbi.nlm.nih.gov/41989204/)
   - Entered：2026-04-16 17:22
   - 作者：Bulent Akyuz，Ilhan Sezer，Ali Nail Demir，Cahit Kacar
   - 来源：pubmed
   - 相关性分数：85
   - 命中原因：title matched "language model"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Evaluation / Language Model
   - 摘要：OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-perceived usability across thematically stratified queries. MATERIALS AND METHODS: In a blinded, within-subject design, 16 frequently asked questions about biologic therapies were categorized into three domains: treatment/drug selection, safety/adverse effects, and special conditions/daily life. Responses were standardized, generated without external retrieval, anonymized as A/B pairs. Thirty physicians assessed clinical appropriateness, scientific accuracy, comprehensiveness, while 60 patients rated readability, understandability, actionability, perceived adequacy, decision support, and trust on 5-point Likert scales. Analyses included paired t-tests, Holm/FDR corrections and two one-sided tests (TOST) to distinguish statistical non-difference from practical equivalence. RESULTS: Physicians rated GPT higher across all domains ( p < .002), with largest gaps in safety/side effects and treatment/drug selection. Patients favored GPT for understandability, actionability, and decision support ( p < .001), while readability, adequacy, trust, and reading time were statistically and clinically equivalent. CONCLUSION: Findings highlight the need for topic-aware governance: guideline-dense queries suited to retrieval-augmented generation and checklist compliance, and context-sensitive queries requiring uncertainty signaling and human oversight. This layered approach advances health informatics by defining where LLMs may substitute versus where they require verification, supporting safe and auditable integration into patient education.

## OpenAlex AI 观察

今日没有新的命中文献。