# 每日论文简报

- 生成时间：2026-04-23 11:42:13 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LLM=15, Vision=9, PubMed AI=5, OpenAlex AI=0
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Benchmark」：命中 18 篇，覆盖 LLM、Vision 等，代表论文包括 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》。
- 主题「Language Model」：命中 13 篇，覆盖 LLM、Vision 等，代表论文包括 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》。
- 主题「Reasoning」：命中 4 篇，覆盖 LLM、Vision，代表论文包括 《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》、《The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm》。

## 主题聚焦

### Benchmark

- 命中篇数：18
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》、《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》
- 主题速读：
  - 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…
  - 《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…

### Language Model

- 命中篇数：13
- 覆盖分组：LLM、Vision、PubMed AI
- 代表论文：《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》、《Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation》
- 主题速读：
  - 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…
  - 《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…

### Reasoning

- 命中篇数：4
- 覆盖分组：LLM、Vision
- 代表论文：《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》、《The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm》、《LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model》
- 主题速读：
  - 《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》〔评测 / 应用 / 方法〕：Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, vi…
  - 《The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm》〔评测 / 数据 / 方法〕：The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates…

### Clinical

- 命中篇数：4
- 覆盖分组：PubMed AI
- 代表论文：《Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.》、《Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.》、《Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma.》
- 主题速读：
  - 《Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.》〔评测 / 应用 / 方法〕：Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/M…
  - 《Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.》〔评测 / 应用 / 方法〕：BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PAS…

### Diffusion

- 命中篇数：4
- 覆盖分组：Vision
- 代表论文：《Hallucination Early Detection in Diffusion Models》、《ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control》、《GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers》
- 主题速读：
  - 《Hallucination Early Detection in Diffusion Models》〔数据 / 方法〕：Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficult…
  - 《ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control》〔方法〕：Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scal…

## LLM 观察

### 本组速览

- 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…
- 《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…
- 《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》〔评测 / 应用 / 方法〕：Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, vi…

### 论文速览

1. [OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model](https://arxiv.org/abs/2604.20806v1)
   - Published：2026-04-23 01:37
   - 作者：Qiguang Chen，Chengyu Luan，Jiajun Wu，Qiming Yu，Yi Yang，Yizhuo Li 等
   - 来源：arxiv
   - 相关性分数：129
   - 命中原因：title matched "reasoning"; title matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CV, cs.AI, cs.CL
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20806v1
   - 摘要：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

2. [V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization](https://arxiv.org/abs/2604.20755v1)
   - Published：2026-04-23 00:44
   - 作者：Yubo Jiang，Yitong An，Xin Yang，Abudukelimu Wuerkaixi，Xuxin Cheng，Fengying Xie 等
   - 来源：arxiv
   - 相关性分数：125
   - 命中原因：title matched "reasoning"; summary matched "alignment"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.AI, cs.LG
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20755v1
   - 摘要：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

3. [ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence](https://arxiv.org/abs/2604.20719v1)
   - Published：2026-04-23 00:06
   - 作者：Menghe Ma，Siqing Wei，Yuecheng Xing，Yaheng Wang，Fanhong Meng，Peijun Han 等
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "benchmark"; summary matched "reasoning"; summary matched "alignment"; summary matched "evaluation"
   - 分类：cs.SD, cs.AI, cs.MM, eess.AS
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2604.20719v1
   - 摘要：Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.

4. [Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation](https://arxiv.org/abs/2604.20749v1)
   - Published：2026-04-23 00:39
   - 作者：Dongding Lin，Jian Wang，Yongqi Li，Wenjie Li
   - 来源：arxiv
   - 相关性分数：106
   - 命中原因：title matched "reasoning"; summary matched "alignment"; summary matched "benchmark"; has PDF
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20749v1
   - 摘要：Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.

5. [Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows](https://arxiv.org/abs/2604.20658v1)
   - Published：2026-04-22 23:07
   - 作者：Shivani Kumar，Adarsh Bharathwaj，David Jurgens
   - 来源：arxiv
   - 相关性分数：105
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20658v1
   - 摘要：Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

6. [Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem](https://arxiv.org/abs/2604.20805v1)
   - Published：2026-04-23 01:36
   - 作者：Travis LaCroix
   - 来源：arxiv
   - 相关性分数：101
   - 命中原因：title matched "alignment"; summary matched "agent"; has DOI; has PDF
   - 分类：cs.CY, cs.AI, cs.MA
   - 标签：方法
   - 主题词：Agent / Alignment
   - PDF：https://arxiv.org/pdf/2604.20805v1
   - 摘要：The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

7. [Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems](https://arxiv.org/abs/2604.20795v1)
   - Published：2026-04-23 01:19
   - 作者：Pavel Salovskii，Iuliia Gorshkova
   - 来源：arxiv
   - 相关性分数：97
   - 命中原因：summary matched "agent"; summary matched "reasoning"; summary matched "benchmark"; has DOI
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20795v1
   - 摘要：This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

8. [SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation](https://arxiv.org/abs/2604.20842v1)
   - Published：2026-04-23 01:59
   - 作者：Ruohan Liu，Shukang Yin，Tao Wang，Dong Zhang，Weiji Zhuang，Shuhuai Ren 等
   - 来源：arxiv
   - 相关性分数：90
   - 命中原因：title matched "benchmark"; summary matched "evaluation"; has PDF; has rich summary
   - 分类：cs.CL, cs.AI, cs.SD
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20842v1
   - 摘要：Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

9. [Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs](https://arxiv.org/abs/2604.20791v1)
   - Published：2026-04-23 01:17
   - 作者：Mariano Barone，Francesco Di Serio，Roberto Moio，Marco Postiglione，Giuseppe Riccio，Antonio Romano 等
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "alignment"; summary matched "evaluation"; has PDF; has rich summary
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Evaluation
   - PDF：https://arxiv.org/pdf/2604.20791v1
   - 摘要：Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

10. [SWE-chat: Coding Agent Interactions From Real Users in the Wild](https://arxiv.org/abs/2604.20779v1)
   - Published：2026-04-23 01:08
   - 作者：Joachim Baumann，Vishakh Padmakumar，Xiang Li，John Yang，Diyi Yang，Sanmi Koyejo
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "agent"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.AI, cs.CY, cs.SE
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.20779v1
   - 摘要：AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

11. [Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation](https://arxiv.org/abs/2604.20763v1)
   - Published：2026-04-23 00:49
   - 作者：Andrew Klearman，Radu Revutchi，Rohin Garg，Rishav Chakravarti，Samuel Marc Denton，Yuan Xue
   - 来源：arxiv
   - 相关性分数：89
   - 命中原因：title matched "evaluation"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.IR, cs.AI, cs.LG
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Evaluation
   - PDF：https://arxiv.org/pdf/2604.20763v1
   - 摘要：Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

12. [RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering](https://arxiv.org/abs/2604.20738v1)
   - Published：2026-04-23 00:24
   - 作者：Marisa Hudspeth，Patrick J. Burns，Brendan O'Connor
   - 来源：arxiv
   - 相关性分数：88
   - 命中原因：title matched "benchmark"; summary matched "reasoning"; has PDF; has rich summary
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20738v1
   - 摘要：We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

13. [Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization](https://arxiv.org/abs/2604.20714v1)
   - Published：2026-04-23 00:00
   - 作者：Shan He，Runze Wang，Zhuoyun Du，Huiyu Bai，Zouying Cao，Yu Cheng 等
   - 来源：arxiv
   - 相关性分数：88
   - 命中原因：title matched "agent"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2604.20714v1
   - 摘要：Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.

14. [The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm](https://arxiv.org/abs/2604.20665v1)
   - Published：2026-04-22 23:15
   - 作者：Karan Goyal，Dikshant Kukreja
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "reasoning"; summary matched "evaluation"; has PDF; has rich summary
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Reasoning
   - PDF：https://arxiv.org/pdf/2604.20665v1
   - 摘要：The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

15. [GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning](https://arxiv.org/abs/2604.20659v1)
   - Published：2026-04-22 23:08
   - 作者：Jingyi Wang，Lei Zhu，Tengjin Weng，Song-Li Wu，Haochen Tan，Jierun Chen 等
   - 来源：arxiv
   - 相关性分数：87
   - 命中原因：title matched "reasoning"; summary matched "benchmark"; has PDF; has rich summary
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20659v1
   - 摘要：Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.

## Vision 观察

### 本组速览

- 《LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model》〔方法〕：We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integ…
- 《Hallucination Early Detection in Diffusion Models》〔数据 / 方法〕：Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficult…
- 《ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control》〔方法〕：Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scal…

### 论文速览

1. [LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model](https://arxiv.org/abs/2604.20796v1)
   - Published：2026-04-23 01:20
   - 作者：Inclusion AI，Tiwei Bie，Haoxing Chen，Tieyuan Chen，Zhenglin Cheng，Long Cui 等
   - 来源：arxiv
   - 相关性分数：111
   - 命中原因：title matched "diffusion"; title matched "multimodal"; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：方法
   - 主题词：Language Model / Reasoning
   - PDF：https://arxiv.org/pdf/2604.20796v1
   - 摘要：We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

2. [Hallucination Early Detection in Diffusion Models](https://arxiv.org/abs/2604.20354v1)
   - Published：2026-04-22 16:57
   - 作者：Federico Betti，Lorenzo Baraldi，Rita Cucchiara，Nicu Sebe
   - 来源：arxiv
   - 相关性分数：75
   - 命中原因：title matched "diffusion"; has DOI; has PDF; has rich summary
   - 分类：cs.CV
   - 标签：数据 / 方法
   - 主题词：Diffusion
   - PDF：https://arxiv.org/pdf/2604.20354v1
   - 摘要：Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

3. [ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control](https://arxiv.org/abs/2604.20816v1)
   - Published：2026-04-23 01:44
   - 作者：Shelly Golan，Michael Finkelson，Ariel Bereslavsky，Yotam Nitzan，Or Patashnik
   - 来源：arxiv
   - 相关性分数：72
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.CV
   - 标签：方法
   - 主题词：Diffusion
   - PDF：https://arxiv.org/pdf/2604.20816v1
   - 摘要：Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

4. [Amodal SAM: A Unified Amodal Segmentation Framework with Generalization](https://arxiv.org/abs/2604.20748v1)
   - Published：2026-04-23 00:39
   - 作者：Bo Zhang，Zhuotao Tian，Xin Tao，Songlin Tang，Jun Yu，Wenjie Pei
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Segmentation
   - PDF：https://arxiv.org/pdf/2604.20748v1
   - 摘要：Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.

5. [GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers](https://arxiv.org/abs/2604.20715v1)
   - Published：2026-04-23 00:01
   - 作者：Yuxuan Xue，Ruofan Liang，Egor Zakharov，Timur Bagautdinov，Chen Cao，Giljoo Nam 等
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 方法
   - 主题词：Diffusion
   - PDF：https://arxiv.org/pdf/2604.20715v1
   - 摘要：Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

6. [SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models](https://arxiv.org/abs/2604.20705v1)
   - Published：2026-04-22 23:46
   - 作者：Jiahao Xie，Alessio Tonioni，Nathalie Rauschmayr，Federico Tombari，Bernt Schiele
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "multimodal"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2604.20705v1
   - 摘要：Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

7. [Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging](https://arxiv.org/abs/2604.20594v1)
   - Published：2026-04-22 22:11
   - 作者：Qian Chen，Yuehao Chen，Qiang Wang，Lei Zhu，Yanye Lu，Qiushi Ren
   - 来源：arxiv
   - 相关性分数：68
   - 命中原因：title matched "diffusion"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：方法
   - 主题词：Alignment / Diffusion
   - PDF：https://arxiv.org/pdf/2604.20594v1
   - 摘要：Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at https://github.com/QianChen113/RetinaDiff.

8. [On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection](https://arxiv.org/abs/2604.20585v1)
   - Published：2026-04-22 22:02
   - 作者：Eduarda Caldeira，Guray Ozgur，Fadi Boutros，Naser Damer
   - 来源：arxiv
   - 相关性分数：68
   - 命中原因：title matched "segmentation"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 方法
   - 主题词：Segmentation
   - PDF：https://arxiv.org/pdf/2604.20585v1
   - 摘要：This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.

9. [ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards](https://arxiv.org/abs/2604.20486v1)
   - Published：2026-04-22 20:20
   - 作者：Wentao Yan，Shengqin Wang，Huichi Zhou，Yihang Chen，Kun Shao，Yuan Xie 等
   - 来源：arxiv
   - 相关性分数：66
   - 命中原因：title matched "multimodal"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CV
   - 标签：评测 / 应用 / 方法
   - 主题词：Reasoning / Multimodal
   - PDF：https://arxiv.org/pdf/2604.20486v1
   - 摘要：Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

## PubMed AI 观察

### 本组速览

- 《Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.》〔评测 / 应用 / 方法〕：OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize repo…
- 《Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.》〔评测 / 应用 / 方法〕：Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/M…
- 《Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.》〔评测 / 应用 / 方法〕：BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PAS…

### 论文速览

1. [Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.](https://pubmed.ncbi.nlm.nih.gov/42018072/)
   - Entered：2026-04-22 19:06
   - 作者：Giovanni Lorusso，Giorgio Ruscino，Alessia Spitaleri，Chiara Morelli，Sara Greco，Ilaria Villanova 等
   - 来源：pubmed
   - 相关性分数：87
   - 命中原因：title matched "language model"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Evaluation
   - 摘要：OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve clinical decision-making, the CAD-RADS 2.0 system was introduced. This study evaluates the performance of four LLMs, GPT-4o, Gemini 2.0 Flash, DeepSeek V, and Copilot in generating CAD-RADS 2.0-compliant conclusions from standardized CCTA reports. MATERIALS AND METHODS: A total of 196 anonymized CCTA reports were retrospectively analyzed. Each LLM was prompted to provide CAD-RADS 2.0 classifications and follow-up recommendations. Ground truth labels were assigned by a senior radiologist. Performance metrics (accuracy, precision, recall, F1-score), execution times, and agreement (Cohen's kappa) with expert interpretation were computed. Interobserver agreement between junior and senior radiologists was also assessed. RESULTS: LLMs demonstrated good-to-excellent agreement with expert classifications: DeepSeek V (κ = 0.771), Copilot (κ = 0.761), GPT-4o (κ = 0.759), and Gemini 2.0 Flash (κ = 0.634). DeepSeek V achieved the highest accuracy (91.83%). Intra-model consistency was perfect (κ = 1). However, LLMs failed to assign CAD-RADS modifiers. ChatGPT-4o provided the most accurate follow-up recommendations (71.94%). All LLMs outperformed radiologists in execution time (3-9 s vs. 15-20 s; p < 0.05). CONCLUSIONS: Generic LLMs demonstrate promising performance in automating CAD-RADS 2.0 classification from CCTA reports. However, limitations in modifier assignment and recommendation accuracy highlight areas for refinement before clinical integration. CRITICAL RELEVANCE STATEMENT: This study explores the potential of large language models to facilitate standardized CAD-RADS 2.0 reporting from coronary CT angiography, highlighting a possible avenue to support workflow efficiency and clinical decision-making in non-invasive coronary artery disease evaluation. KEY POINTS: LLMs demonstrated strong potential in automating CAD-RADS 2.0-compliant structured reporting for CCTA. LLMs could significantly enhance efficiency in radiological reporting. LLMs need further optimization before clinical integration.

2. [Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.](https://pubmed.ncbi.nlm.nih.gov/42017297/)
   - Entered：2026-04-22 13:32
   - 作者：Lei Jiang，Zhongxia Yang，Xiaojun Liu
   - 来源：pubmed
   - 相关性分数：81
   - 命中原因：title matched "clinical"; summary matched "benchmark"; has DOI; has rich summary
   - 分类：Journal Article, Review
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Clinical
   - 摘要：Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/MSI-H population, a molecularly distinct, hyper-immunogenic subset-governed by pathogenic aberrations in the exonuclease domains of POLE/POLD1 -has emerged as a pivotal clinical entity. Characterized by an ultra-hypermutated phenotype, these tumors harbor a mutational load that typically dwarfs the benchmarks established by dMMR/MSI-H malignancies. In this review, we synthesize the molecular underpinnings of POLE/POLD1 deficiency, emphasizing a "threshold effect" where extreme neoantigen density triggers a self-reinforcing inflammatory loop, fundamentally reshaping the tumor immune microenvironment (TIME). To ensure a robust synthesis of the field, a systematic literature search was conducted using the PubMed and Web of Science databases until December 2025, with additional manual screening of reference lists from key studies. Our analysis underscores superior, often durable, responses in this subgroup, while addressing a formidable obstacle: the interpretation of Variants of Uncertain Significance (VUS). We highlight the critical need to distinguish passenger mutations from true proofreading defects, as therapeutic benefit is strictly tethered to functional pathogenicity. Finally, we propose an integrated biomarker framework that moves beyond binary genomic screening toward a functional hierarchy of polymerase variants, providing a definitive roadmap for the next generation of precision immunotherapy in colorectal cancer.

3. [Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.](https://pubmed.ncbi.nlm.nih.gov/42017018/)
   - Entered：2026-04-22 12:56
   - 作者：Chang Hee Baek，Jung Gon Kim，Bo Taek Kim，Chaemoon Lim，Seung Jin Kim
   - 来源：pubmed
   - 相关性分数：81
   - 命中原因：title matched "clinical"; summary matched "benchmark"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Clinical
   - 摘要：BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PASS) thresholds for anterior combined latissimus dorsi and teres major (LDTM) tendon transfer in irreparable anterosuperior rotator cuff tears (IASRCTs). This study aimed to establish these patient-centered benchmarks in a cohort with a minimum 5-year follow-up. METHODS: We retrospectively reviewed 31 patients (33 shoulders) who underwent a single-stage anterior LDTM transfer for IASRCTs and completed a minimum 5-year follow-up. Patient-reported outcome measures (PROMs) included the American Shoulder and Elbow Surgeons (ASES) score, visual analog scale (VAS) for pain, Constant score, and activities of daily living requiring internal rotation (ADLIR) score. The MCID was calculated as one-half of the standard deviation of the change score for each PROMs. PASS thresholds were derived from receiver operating characteristic analysis, using postoperative satisfaction as the external anchor. RESULTS: At a mean follow-up of 83.0 ± 7.4 months, all PROMs improved significantly ( P < .001). Distribution-based MCID thresholds were 10.5 (ASES), 0.9 (VAS), 10.5 (Constant), and 8.6 (ADLIR). Corresponding MCID achievement rates were 77.4%, 87.1%, 74.2%, and 87.1%, respectively. Anchor-based PASS thresholds were ASES ≥75, VAS ≤2, Constant ≥60, and ADLIR ≥78; these were achieved by 64.5%, 80.6%, 77.4%, and 71.0% of patients, respectively. Age showed a significant negative correlation with ASES MCID (r_pb = -0.53, P = .002) and ADLIR MCID (r_pb = -0.41, P = .021). Male sex correlated positively with ASES PASS attainment (φ = 0.46, P = .010). No other baseline variables were significantly associated with MCID or PASS (all P > .05). CONCLUSION: This study is the first to establish clinically meaningful MCID and PASS thresholds for anterior LDTM transfer in patients with IASRCTs at a minimum 5-year follow-up. Most patients achieved substantial improvements that were deemed acceptable by the patients. These procedure-specific benchmarks provide practical targets for clinical assessment and patient counseling and serve as reference values for future outcome research.

4. [Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma.](https://pubmed.ncbi.nlm.nih.gov/42018290/)
   - Entered：2026-04-22 19:32
   - 作者：Carla Ferrándiz-Pulido，Álvaro Gómez-Tomás，Sahyly Siurana，Carles Tortajada，Rafael Salido-Vallejo，Rafael S Aguayo-Ortiz 等
   - 来源：pubmed
   - 相关性分数：65
   - 命中原因：summary matched "benchmark"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测
   - 主题词：Benchmark / Clinical
   - 摘要：IMPORTANCE: Early detection of nodal metastases in high-risk cutaneous squamous cell carcinoma (cSCC) is crucial, yet the optimal baseline staging approach remains uncertain. OBJECTIVE: To compare the diagnostic performance of physical examination, ultrasonography, and contrast-enhanced computed tomography (CT) in detecting nodal metastases at baseline staging of high-risk cSCC, both overall and stratified by patients' immune status. DESIGN, SETTING, AND PARTICIPANTS: This was a prospective, multicenter, paired diagnostic study conducted from January 2022 to April 2025 across 13 tertiary dermato-oncology centers in Spain. The study included patients with histologically confirmed high-risk cSCC (stage T2b/T3 or T2a with additional high-risk features). Data were analyzed from July to September 2025. MAIN OUTCOMES AND MEASURES: Sensitivity, specificity, predictive values, and area under the receiver operating characteristic curve (AUROC) of each diagnostic modality, benchmarked against histology or short-term clinical follow-up as reference standard. RESULTS: The analysis included 155 patients (median [IQR] age, 80.3 [74.4-85.5] years; 34 [21.9%] female and 121 [78.1%] male; 64 [41.3%] immunosuppressed), of whom 12 patients (7.7%; 95% CI, 4.3%-13.4%) developed nodal metastases within 3 months after surgery. Ultrasonography results showed the highest overall sensitivity (63.6%; 95% CI, 30.8%-89.1%), followed by CT (54.5%; 95% CI, 23.4%-83.3%) and physical examination (8.3%; 95% CI, 0.2%-38.5%). Specificities were 95.6% (95% CI, 90.6%-98.4%), 95.0% (95% CI, 90.0%-98.0%), and 99.3% (95% CI, 96.2%-100%), respectively. Ultrasonography and CT demonstrated almost perfect agreement (κ = 0.87; 95% CI, 0.72-1.00), whereas concordance with physical examination was poor. Subgroup analysis by immune status revealed marked disparities in diagnostic performance. In patients with immunocompetence, both ultrasonography and CT achieved 100% sensitivity (95% CI, 54.1%-100% and 47.8%-100%, respectively) and excellent AUROC (0.98; 95% CI, 0.96-1.00 for both). In contrast, sensitivity declined markedly among patients who were immunosuppressed (20.0% [95% CI, 0.5%-71.6%] for ultrasonography and 16.7% [95% CI, 0.4%-64.1%] for CT; AUROCs, 0.57 ([95% CI, 0.37-0.77] and 0.55 [95% CI, 0.38-0.72], respectively), with metastases often emerging abruptly during follow-up despite negative baseline staging. CONCLUSIONS AND RELEVANCE: This diagnostic study found that ultrasonography and CT significantly outperformed physical examination for detecting baseline nodal metastases in high-risk cSCC and can be used interchangeably depending on clinical context and resource availability. However, their poor performance in patients with immunosuppression reveals a need for tailored recommendations in future clinical practice guidelines and emphasizes the importance of close clinical follow-up in this subgroup.

5. [Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis.](https://pubmed.ncbi.nlm.nih.gov/42017544/)
   - Entered：2026-04-22 15:53
   - 作者：Barış Timur，Zinar Apaydın，Batuhan Yazıcı，Alper Selim Kocaoğlu，Mehmet Emin Öner，Fatime Üçdağ 等
   - 来源：pubmed
   - 相关性分数：62
   - 命中原因：summary matched "benchmark"; summary matched "clinical"; has DOI; has rich summary
   - 分类：Journal Article
   - 标签：评测 / 方法
   - 主题词：Benchmark / Clinical
   - 摘要：BackgroundMinimally invasive direct coronary artery bypass (MIDCAB) has emerged as an alternative to conventional coronary artery bypass grafting; however, its adoption remains limited due to technical complexity and a steep learning curve, particularly in patients requiring multi-vessel revascularization. Objective data defining the learning curve of multi-vessel MIDCAB are scarce.MethodsThis retrospective study included consecutive patients who underwent multi-vessel MIDCAB between January 2020 and December 2025. Patients requiring single-vessel revascularization were intentionally excluded to ensure procedural homogeneity. The learning curve was evaluated using cumulative sum (CUSUM) analysis, and cases were stratified into three phases based on CUSUM inflection points. Perioperative and postoperative outcomes were compared across learning curve phases.ResultsA total of 169 patients were analyzed. CUSUM analysis identified three distinct learning phases: an initial learning phase (cases 1-48), a transition phase (cases 49-107), and a proficiency phase (cases 108-169). With increasing surgical experience, cardiopulmonary bypass time, aortic cross-clamp time, and total operative duration decreased significantly. The rate of conversion to open surgery declined markedly across learning phases, whereas in-hospital mortality and major postoperative complications remained low and comparable. These findings indicate improved procedural efficiency without compromising early clinical outcomes.ConclusionsMulti-vessel MIDCAB is associated with a substantial learning curve that can be objectively characterized using CUSUM analysis. Surgical proficiency is achieved only after a considerable number of cases, emphasizing the importance of adequate case volume and structured performance monitoring. These results provide a practical benchmark for centers aiming to adopt or expand multi-vessel MIDCAB programs.

## OpenAlex AI 观察

今日没有新的命中文献。