{
  "generated_at": "2026-04-23T11:42:13.017905+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LLM",
        "sort_by": "hybrid"
      },
      {
        "name": "Vision",
        "sort_by": "hybrid"
      },
      {
        "name": "PubMed AI",
        "sort_by": "hybrid"
      },
      {
        "name": "OpenAlex AI",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「Benchmark」：命中 18 篇，覆盖 LLM、Vision 等，代表论文包括 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》。",
    "主题「Language Model」：命中 13 篇，覆盖 LLM、Vision 等，代表论文包括 《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》、《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》。",
    "主题「Reasoning」：命中 4 篇，覆盖 LLM、Vision，代表论文包括 《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》、《The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "Benchmark",
      "paper_count": 18,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model",
        "V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization",
        "ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence",
        "Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation",
        "Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows",
        "Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems",
        "SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation",
        "SWE-chat: Coding Agent Interactions From Real Users in the Wild",
        "Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation",
        "RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering",
        "Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization",
        "GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning",
        "Amodal SAM: A Unified Amodal Segmentation Framework with Generalization",
        "SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models",
        "Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.",
        "Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.",
        "Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma.",
        "Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis."
      ],
      "key_points": [
        "《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…",
        "《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 13,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model",
        "V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization",
        "Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation",
        "Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows",
        "Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems",
        "SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation",
        "Can \"AI\" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs",
        "RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering",
        "The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm",
        "GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning",
        "LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model",
        "SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models",
        "Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports."
      ],
      "key_points": [
        "《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…",
        "《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 4,
      "feed_names": [
        "LLM",
        "Vision"
      ],
      "paper_titles": [
        "ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence",
        "The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm",
        "LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model",
        "ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards"
      ],
      "key_points": [
        "《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》〔评测 / 应用 / 方法〕：Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, vi…",
        "《The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm》〔评测 / 数据 / 方法〕：The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates…"
      ]
    },
    {
      "name": "Clinical",
      "paper_count": 4,
      "feed_names": [
        "PubMed AI"
      ],
      "paper_titles": [
        "Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.",
        "Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.",
        "Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma.",
        "Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis."
      ],
      "key_points": [
        "《Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.》〔评测 / 应用 / 方法〕：Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/M…",
        "《Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.》〔评测 / 应用 / 方法〕：BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PAS…"
      ]
    },
    {
      "name": "Diffusion",
      "paper_count": 4,
      "feed_names": [
        "Vision"
      ],
      "paper_titles": [
        "Hallucination Early Detection in Diffusion Models",
        "ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control",
        "GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers",
        "Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging"
      ],
      "key_points": [
        "《Hallucination Early Detection in Diffusion Models》〔数据 / 方法〕：Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficult…",
        "《ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control》〔方法〕：Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scal…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LLM",
      "key_points": [
        "《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal r…",
        "《V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization》〔评测 / 方法〕：We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models…",
        "《ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence》〔评测 / 应用 / 方法〕：Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, vi…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model",
          "summary": "Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.",
          "authors": [
            "Qiguang Chen",
            "Chengyu Luan",
            "Jiajun Wu",
            "Qiming Yu",
            "Yi Yang",
            "Yizhuo Li",
            "Jingqi Tong",
            "Xiachong Feng",
            "Libo Qin",
            "Wanxiang Che"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20806v1",
          "abstract_url": "https://arxiv.org/abs/2604.20806v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20806v1",
          "published_at": "2026-04-22T17:37:40+00:00",
          "updated_at": "2026-04-22T17:37:40+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20806",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20806v1"
          },
          "relevance_score": 129,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20806"
        },
        {
          "title": "V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization",
          "summary": "We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline",
          "authors": [
            "Yubo Jiang",
            "Yitong An",
            "Xin Yang",
            "Abudukelimu Wuerkaixi",
            "Xuxin Cheng",
            "Fengying Xie",
            "Zhiguo Jiang",
            "Cao Liu",
            "Ke Zeng",
            "Haopeng Zhang"
          ],
          "categories": [
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20755v1",
          "abstract_url": "https://arxiv.org/abs/2604.20755v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20755v1",
          "published_at": "2026-04-22T16:44:33+00:00",
          "updated_at": "2026-04-22T16:44:33+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20755",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20755v1"
          },
          "relevance_score": 125,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20755"
        },
        {
          "title": "ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence",
          "summary": "Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of \"LLM-as-a-judge\" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.",
          "authors": [
            "Menghe Ma",
            "Siqing Wei",
            "Yuecheng Xing",
            "Yaheng Wang",
            "Fanhong Meng",
            "Peijun Han",
            "Luu Anh Tuan",
            "Haoran Luo"
          ],
          "categories": [
            "cs.SD",
            "cs.AI",
            "cs.MM",
            "eess.AS"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20719v1",
          "abstract_url": "https://arxiv.org/abs/2604.20719v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20719v1",
          "published_at": "2026-04-22T16:06:48+00:00",
          "updated_at": "2026-04-22T16:06:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.20719",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20719v1"
          },
          "relevance_score": 124,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20719"
        },
        {
          "title": "Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation",
          "summary": "Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.",
          "authors": [
            "Dongding Lin",
            "Jian Wang",
            "Yongqi Li",
            "Wenjie Li"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20749v1",
          "abstract_url": "https://arxiv.org/abs/2604.20749v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20749v1",
          "published_at": "2026-04-22T16:39:52+00:00",
          "updated_at": "2026-04-22T16:39:52+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20749",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20749v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20749"
        },
        {
          "title": "Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows",
          "summary": "Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.",
          "authors": [
            "Shivani Kumar",
            "Adarsh Bharathwaj",
            "David Jurgens"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20658v1",
          "abstract_url": "https://arxiv.org/abs/2604.20658v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20658v1",
          "published_at": "2026-04-22T15:07:54+00:00",
          "updated_at": "2026-04-22T15:07:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20658",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20658v1"
          },
          "relevance_score": 105,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20658"
        },
        {
          "title": "Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem",
          "summary": "The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be \"solved\" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.",
          "authors": [
            "Travis LaCroix"
          ],
          "categories": [
            "cs.CY",
            "cs.AI",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20805v1",
          "abstract_url": "https://arxiv.org/abs/2604.20805v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20805v1",
          "published_at": "2026-04-22T17:36:52+00:00",
          "updated_at": "2026-04-22T17:36:52+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Agent",
            "Alignment"
          ],
          "doi": "10.1145/3805689.3812420",
          "arxiv_id": "2604.20805",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20805v1",
            "doi": "https://doi.org/10.1145/3805689.3812420"
          },
          "relevance_score": 101,
          "match_reasons": [
            "title matched \"alignment\"",
            "summary matched \"agent\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1145/3805689.3812420"
        },
        {
          "title": "Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems",
          "summary": "This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.",
          "authors": [
            "Pavel Salovskii",
            "Iuliia Gorshkova"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20795v1",
          "abstract_url": "https://arxiv.org/abs/2604.20795v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20795v1",
          "published_at": "2026-04-22T17:19:43+00:00",
          "updated_at": "2026-04-22T17:19:43+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": "10.5281/zenodo.19696042",
          "arxiv_id": "2604.20795",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20795v1",
            "doi": "https://doi.org/10.5281/zenodo.19696042"
          },
          "relevance_score": 97,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.5281/zenodo.19696042"
        },
        {
          "title": "SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation",
          "summary": "Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.",
          "authors": [
            "Ruohan Liu",
            "Shukang Yin",
            "Tao Wang",
            "Dong Zhang",
            "Weiji Zhuang",
            "Shuhuai Ren",
            "Ran He",
            "Caifeng Shan",
            "Chaoyou Fu"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.SD"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20842v1",
          "abstract_url": "https://arxiv.org/abs/2604.20842v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20842v1",
          "published_at": "2026-04-22T17:59:58+00:00",
          "updated_at": "2026-04-22T17:59:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20842",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20842v1"
          },
          "relevance_score": 90,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20842"
        },
        {
          "title": "Can \"AI\" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs",
          "summary": "Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.",
          "authors": [
            "Mariano Barone",
            "Francesco Di Serio",
            "Roberto Moio",
            "Marco Postiglione",
            "Giuseppe Riccio",
            "Antonio Romano",
            "Vincenzo Moscato"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20791v1",
          "abstract_url": "https://arxiv.org/abs/2604.20791v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20791v1",
          "published_at": "2026-04-22T17:17:27+00:00",
          "updated_at": "2026-04-22T17:17:27+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.20791",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20791v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20791"
        },
        {
          "title": "SWE-chat: Coding Agent Interactions From Real Users in the Wild",
          "summary": "AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code (\"vibe coding\"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.",
          "authors": [
            "Joachim Baumann",
            "Vishakh Padmakumar",
            "Xiang Li",
            "John Yang",
            "Diyi Yang",
            "Sanmi Koyejo"
          ],
          "categories": [
            "cs.AI",
            "cs.CY",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20779v1",
          "abstract_url": "https://arxiv.org/abs/2604.20779v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20779v1",
          "published_at": "2026-04-22T17:08:19+00:00",
          "updated_at": "2026-04-22T17:08:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.20779",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20779v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20779"
        },
        {
          "title": "Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation",
          "summary": "Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \\emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.",
          "authors": [
            "Andrew Klearman",
            "Radu Revutchi",
            "Rohin Garg",
            "Rishav Chakravarti",
            "Samuel Marc Denton",
            "Yuan Xue"
          ],
          "categories": [
            "cs.IR",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20763v1",
          "abstract_url": "https://arxiv.org/abs/2604.20763v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20763v1",
          "published_at": "2026-04-22T16:49:30+00:00",
          "updated_at": "2026-04-22T16:49:30+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.20763",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20763v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"evaluation\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20763"
        },
        {
          "title": "RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering",
          "summary": "We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA",
          "authors": [
            "Marisa Hudspeth",
            "Patrick J. Burns",
            "Brendan O'Connor"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20738v1",
          "abstract_url": "https://arxiv.org/abs/2604.20738v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20738v1",
          "published_at": "2026-04-22T16:24:46+00:00",
          "updated_at": "2026-04-22T16:24:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20738",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20738v1"
          },
          "relevance_score": 88,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20738"
        },
        {
          "title": "Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization",
          "summary": "Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of \"Agent Engineering.\" Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive \"textual gradients,\" structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.",
          "authors": [
            "Shan He",
            "Runze Wang",
            "Zhuoyun Du",
            "Huiyu Bai",
            "Zouying Cao",
            "Yu Cheng",
            "Bo Zheng"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20714v1",
          "abstract_url": "https://arxiv.org/abs/2604.20714v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20714v1",
          "published_at": "2026-04-22T16:00:46+00:00",
          "updated_at": "2026-04-22T16:00:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.20714",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20714v1"
          },
          "relevance_score": 88,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20714"
        },
        {
          "title": "The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm",
          "summary": "The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of \"multimodal gain\". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.",
          "authors": [
            "Karan Goyal",
            "Dikshant Kukreja"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20665v1",
          "abstract_url": "https://arxiv.org/abs/2604.20665v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20665v1",
          "published_at": "2026-04-22T15:15:32+00:00",
          "updated_at": "2026-04-22T15:15:32+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.20665",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20665v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"multimodal\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20665"
        },
        {
          "title": "GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning",
          "summary": "Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.",
          "authors": [
            "Jingyi Wang",
            "Lei Zhu",
            "Tengjin Weng",
            "Song-Li Wu",
            "Haochen Tan",
            "Jierun Chen",
            "Chaofan Tao",
            "Haoli Bai",
            "Lu Hou",
            "Lifeng Shang",
            "Xiao-Ping Zhang"
          ],
          "categories": [
            "cs.LG",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20659v1",
          "abstract_url": "https://arxiv.org/abs/2604.20659v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20659v1",
          "published_at": "2026-04-22T15:08:58+00:00",
          "updated_at": "2026-04-22T15:08:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20659",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20659v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20659"
        }
      ]
    },
    {
      "name": "Vision",
      "key_points": [
        "《LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model》〔方法〕：We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integ…",
        "《Hallucination Early Detection in Diffusion Models》〔数据 / 方法〕：Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficult…",
        "《ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control》〔方法〕：Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scal…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model",
          "summary": "We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.",
          "authors": [
            "Inclusion AI",
            "Tiwei Bie",
            "Haoxing Chen",
            "Tieyuan Chen",
            "Zhenglin Cheng",
            "Long Cui",
            "Kai Gan",
            "Zhicheng Huang",
            "Zhenzhong Lan",
            "Haoquan Li",
            "Jianguo Li",
            "Tao Lin",
            "Qi Qin",
            "Hongjun Wang",
            "Xiaomei Wang",
            "Haoyuan Wu",
            "Yi Xin",
            "Junbo Zhao"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20796v1",
          "abstract_url": "https://arxiv.org/abs/2604.20796v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20796v1",
          "published_at": "2026-04-22T17:20:42+00:00",
          "updated_at": "2026-04-22T17:20:42+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Language Model",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.20796",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20796v1"
          },
          "relevance_score": 111,
          "match_reasons": [
            "title matched \"diffusion\"",
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20796"
        },
        {
          "title": "Hallucination Early Detection in Diffusion Models",
          "summary": "Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.",
          "authors": [
            "Federico Betti",
            "Lorenzo Baraldi",
            "Rita Cucchiara",
            "Nicu Sebe"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20354v1",
          "abstract_url": "https://arxiv.org/abs/2604.20354v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20354v1",
          "published_at": "2026-04-22T08:57:19+00:00",
          "updated_at": "2026-04-22T08:57:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": "10.1007/s11263-025-02622-0",
          "arxiv_id": "2604.20354",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20354v1",
            "doi": "https://doi.org/10.1007/s11263-025-02622-0"
          },
          "relevance_score": 75,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1007/s11263-025-02622-0"
        },
        {
          "title": "ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control",
          "summary": "Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.",
          "authors": [
            "Shelly Golan",
            "Michael Finkelson",
            "Ariel Bereslavsky",
            "Yotam Nitzan",
            "Or Patashnik"
          ],
          "categories": [
            "cs.LG",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20816v1",
          "abstract_url": "https://arxiv.org/abs/2604.20816v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20816v1",
          "published_at": "2026-04-22T17:44:56+00:00",
          "updated_at": "2026-04-22T17:44:56+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.20816",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20816v1"
          },
          "relevance_score": 72,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20816"
        },
        {
          "title": "Amodal SAM: A Unified Amodal Segmentation Framework with Generalization",
          "summary": "Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.",
          "authors": [
            "Bo Zhang",
            "Zhuotao Tian",
            "Xin Tao",
            "Songlin Tang",
            "Jun Yu",
            "Wenjie Pei"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20748v1",
          "abstract_url": "https://arxiv.org/abs/2604.20748v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20748v1",
          "published_at": "2026-04-22T16:39:44+00:00",
          "updated_at": "2026-04-22T16:39:44+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.20748",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20748v1"
          },
          "relevance_score": 70,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20748"
        },
        {
          "title": "GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers",
          "summary": "Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.",
          "authors": [
            "Yuxuan Xue",
            "Ruofan Liang",
            "Egor Zakharov",
            "Timur Bagautdinov",
            "Chen Cao",
            "Giljoo Nam",
            "Shunsuke Saito",
            "Gerard Pons-Moll",
            "Javier Romero"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20715v1",
          "abstract_url": "https://arxiv.org/abs/2604.20715v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20715v1",
          "published_at": "2026-04-22T16:01:04+00:00",
          "updated_at": "2026-04-22T16:01:04+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.20715",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20715v1"
          },
          "relevance_score": 70,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20715"
        },
        {
          "title": "SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models",
          "summary": "Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.",
          "authors": [
            "Jiahao Xie",
            "Alessio Tonioni",
            "Nathalie Rauschmayr",
            "Federico Tombari",
            "Bernt Schiele"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20705v1",
          "abstract_url": "https://arxiv.org/abs/2604.20705v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20705v1",
          "published_at": "2026-04-22T15:46:42+00:00",
          "updated_at": "2026-04-22T15:46:42+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.20705",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20705v1"
          },
          "relevance_score": 70,
          "match_reasons": [
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20705"
        },
        {
          "title": "Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging",
          "summary": "Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at https://github.com/QianChen113/RetinaDiff.",
          "authors": [
            "Qian Chen",
            "Yuehao Chen",
            "Qiang Wang",
            "Lei Zhu",
            "Yanye Lu",
            "Qiushi Ren"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20594v1",
          "abstract_url": "https://arxiv.org/abs/2604.20594v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20594v1",
          "published_at": "2026-04-22T14:11:27+00:00",
          "updated_at": "2026-04-22T14:11:27+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Alignment",
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.20594",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20594v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20594"
        },
        {
          "title": "On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection",
          "summary": "This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.",
          "authors": [
            "Eduarda Caldeira",
            "Guray Ozgur",
            "Fadi Boutros",
            "Naser Damer"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20585v1",
          "abstract_url": "https://arxiv.org/abs/2604.20585v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20585v1",
          "published_at": "2026-04-22T14:02:30+00:00",
          "updated_at": "2026-04-22T14:02:30+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.20585",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20585v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20585"
        },
        {
          "title": "ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards",
          "summary": "Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.",
          "authors": [
            "Wentao Yan",
            "Shengqin Wang",
            "Huichi Zhou",
            "Yihang Chen",
            "Kun Shao",
            "Yuan Xie",
            "Zhizhong Zhang"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.20486v1",
          "abstract_url": "https://arxiv.org/abs/2604.20486v1",
          "pdf_url": "https://arxiv.org/pdf/2604.20486v1",
          "published_at": "2026-04-22T12:20:46+00:00",
          "updated_at": "2026-04-22T12:20:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Multimodal"
          ],
          "doi": null,
          "arxiv_id": "2604.20486",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.20486v1"
          },
          "relevance_score": 66,
          "match_reasons": [
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.20486"
        }
      ]
    },
    {
      "name": "PubMed AI",
      "key_points": [
        "《Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.》〔评测 / 应用 / 方法〕：OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize repo…",
        "《Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.》〔评测 / 应用 / 方法〕：Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/M…",
        "《Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.》〔评测 / 应用 / 方法〕：BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PAS…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.",
          "summary": "OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve clinical decision-making, the CAD-RADS 2.0 system was introduced. This study evaluates the performance of four LLMs, GPT-4o, Gemini 2.0 Flash, DeepSeek V, and Copilot in generating CAD-RADS 2.0-compliant conclusions from standardized CCTA reports. MATERIALS AND METHODS: A total of 196 anonymized CCTA reports were retrospectively analyzed. Each LLM was prompted to provide CAD-RADS 2.0 classifications and follow-up recommendations. Ground truth labels were assigned by a senior radiologist. Performance metrics (accuracy, precision, recall, F1-score), execution times, and agreement (Cohen's kappa) with expert interpretation were computed. Interobserver agreement between junior and senior radiologists was also assessed. RESULTS: LLMs demonstrated good-to-excellent agreement with expert classifications: DeepSeek V (κ = 0.771), Copilot (κ = 0.761), GPT-4o (κ = 0.759), and Gemini 2.0 Flash (κ = 0.634). DeepSeek V achieved the highest accuracy (91.83%). Intra-model consistency was perfect (κ = 1). However, LLMs failed to assign CAD-RADS modifiers. ChatGPT-4o provided the most accurate follow-up recommendations (71.94%). All LLMs outperformed radiologists in execution time (3-9 s vs. 15-20 s; p < 0.05). CONCLUSIONS: Generic LLMs demonstrate promising performance in automating CAD-RADS 2.0 classification from CCTA reports. However, limitations in modifier assignment and recommendation accuracy highlight areas for refinement before clinical integration. CRITICAL RELEVANCE STATEMENT: This study explores the potential of large language models to facilitate standardized CAD-RADS 2.0 reporting from coronary CT angiography, highlighting a possible avenue to support workflow efficiency and clinical decision-making in non-invasive coronary artery disease evaluation. KEY POINTS: LLMs demonstrated strong potential in automating CAD-RADS 2.0-compliant structured reporting for CCTA. LLMs could significantly enhance efficiency in radiological reporting. LLMs need further optimization before clinical integration.",
          "authors": [
            "Giovanni Lorusso",
            "Giorgio Ruscino",
            "Alessia Spitaleri",
            "Chiara Morelli",
            "Sara Greco",
            "Ilaria Villanova",
            "Nicola Maria Lucarelli",
            "Michele Mariano",
            "Amato Antonio Stabile Ianora",
            "Nicola Maggialetti"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42018072",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42018072/",
          "pdf_url": null,
          "published_at": "2026-04-22T11:06:00+00:00",
          "updated_at": "2026-04-22T11:06:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": "10.1186/s13244-026-02285-6",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42018072/",
            "doi": "https://doi.org/10.1186/s13244-026-02285-6"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1186/s13244-026-02285-6"
        },
        {
          "title": "Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions.",
          "summary": "Immune checkpoint inhibitors (ICIs) have fundamentally reshaped the therapeutic paradigm for metastatic colorectal cancer (mCRC). Beyond the established dMMR/MSI-H population, a molecularly distinct, hyper-immunogenic subset-governed by pathogenic aberrations in the exonuclease domains of POLE/POLD1 -has emerged as a pivotal clinical entity. Characterized by an ultra-hypermutated phenotype, these tumors harbor a mutational load that typically dwarfs the benchmarks established by dMMR/MSI-H malignancies. In this review, we synthesize the molecular underpinnings of POLE/POLD1 deficiency, emphasizing a \"threshold effect\" where extreme neoantigen density triggers a self-reinforcing inflammatory loop, fundamentally reshaping the tumor immune microenvironment (TIME). To ensure a robust synthesis of the field, a systematic literature search was conducted using the PubMed and Web of Science databases until December 2025, with additional manual screening of reference lists from key studies. Our analysis underscores superior, often durable, responses in this subgroup, while addressing a formidable obstacle: the interpretation of Variants of Uncertain Significance (VUS). We highlight the critical need to distinguish passenger mutations from true proofreading defects, as therapeutic benefit is strictly tethered to functional pathogenicity. Finally, we propose an integrated biomarker framework that moves beyond binary genomic screening toward a functional hierarchy of polymerase variants, providing a definitive roadmap for the next generation of precision immunotherapy in colorectal cancer.",
          "authors": [
            "Lei Jiang",
            "Zhongxia Yang",
            "Xiaojun Liu"
          ],
          "categories": [
            "Journal Article",
            "Review"
          ],
          "paper_id": "pubmed:42017297",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42017297/",
          "pdf_url": null,
          "published_at": "2026-04-22T05:32:00+00:00",
          "updated_at": "2026-04-22T05:32:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Clinical"
          ],
          "doi": "10.1080/1750743x.2026.2662823",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42017297/",
            "doi": "https://doi.org/10.1080/1750743X.2026.2662823"
          },
          "relevance_score": 81,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"benchmark\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1080/1750743x.2026.2662823"
        },
        {
          "title": "Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes.",
          "summary": "BACKGROUND: To date, no prior study has established procedure-specific minimal clinically important difference (MCID) and patient acceptable symptom state (PASS) thresholds for anterior combined latissimus dorsi and teres major (LDTM) tendon transfer in irreparable anterosuperior rotator cuff tears (IASRCTs). This study aimed to establish these patient-centered benchmarks in a cohort with a minimum 5-year follow-up. METHODS: We retrospectively reviewed 31 patients (33 shoulders) who underwent a single-stage anterior LDTM transfer for IASRCTs and completed a minimum 5-year follow-up. Patient-reported outcome measures (PROMs) included the American Shoulder and Elbow Surgeons (ASES) score, visual analog scale (VAS) for pain, Constant score, and activities of daily living requiring internal rotation (ADLIR) score. The MCID was calculated as one-half of the standard deviation of the change score for each PROMs. PASS thresholds were derived from receiver operating characteristic analysis, using postoperative satisfaction as the external anchor. RESULTS: At a mean follow-up of 83.0 ± 7.4 months, all PROMs improved significantly ( P < .001). Distribution-based MCID thresholds were 10.5 (ASES), 0.9 (VAS), 10.5 (Constant), and 8.6 (ADLIR). Corresponding MCID achievement rates were 77.4%, 87.1%, 74.2%, and 87.1%, respectively. Anchor-based PASS thresholds were ASES ≥75, VAS ≤2, Constant ≥60, and ADLIR ≥78; these were achieved by 64.5%, 80.6%, 77.4%, and 71.0% of patients, respectively. Age showed a significant negative correlation with ASES MCID (r_pb = -0.53, P = .002) and ADLIR MCID (r_pb = -0.41, P = .021). Male sex correlated positively with ASES PASS attainment (φ = 0.46, P = .010). No other baseline variables were significantly associated with MCID or PASS (all P > .05). CONCLUSION: This study is the first to establish clinically meaningful MCID and PASS thresholds for anterior LDTM transfer in patients with IASRCTs at a minimum 5-year follow-up. Most patients achieved substantial improvements that were deemed acceptable by the patients. These procedure-specific benchmarks provide practical targets for clinical assessment and patient counseling and serve as reference values for future outcome research.",
          "authors": [
            "Chang Hee Baek",
            "Jung Gon Kim",
            "Bo Taek Kim",
            "Chaemoon Lim",
            "Seung Jin Kim"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42017018",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42017018/",
          "pdf_url": null,
          "published_at": "2026-04-22T04:56:00+00:00",
          "updated_at": "2026-04-22T04:56:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Clinical"
          ],
          "doi": "10.1016/j.jseint.2026.101635",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42017018/",
            "doi": "https://doi.org/10.1016/j.jseint.2026.101635"
          },
          "relevance_score": 81,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"benchmark\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1016/j.jseint.2026.101635"
        },
        {
          "title": "Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma.",
          "summary": "IMPORTANCE: Early detection of nodal metastases in high-risk cutaneous squamous cell carcinoma (cSCC) is crucial, yet the optimal baseline staging approach remains uncertain. OBJECTIVE: To compare the diagnostic performance of physical examination, ultrasonography, and contrast-enhanced computed tomography (CT) in detecting nodal metastases at baseline staging of high-risk cSCC, both overall and stratified by patients' immune status. DESIGN, SETTING, AND PARTICIPANTS: This was a prospective, multicenter, paired diagnostic study conducted from January 2022 to April 2025 across 13 tertiary dermato-oncology centers in Spain. The study included patients with histologically confirmed high-risk cSCC (stage T2b/T3 or T2a with additional high-risk features). Data were analyzed from July to September 2025. MAIN OUTCOMES AND MEASURES: Sensitivity, specificity, predictive values, and area under the receiver operating characteristic curve (AUROC) of each diagnostic modality, benchmarked against histology or short-term clinical follow-up as reference standard. RESULTS: The analysis included 155 patients (median [IQR] age, 80.3 [74.4-85.5] years; 34 [21.9%] female and 121 [78.1%] male; 64 [41.3%] immunosuppressed), of whom 12 patients (7.7%; 95% CI, 4.3%-13.4%) developed nodal metastases within 3 months after surgery. Ultrasonography results showed the highest overall sensitivity (63.6%; 95% CI, 30.8%-89.1%), followed by CT (54.5%; 95% CI, 23.4%-83.3%) and physical examination (8.3%; 95% CI, 0.2%-38.5%). Specificities were 95.6% (95% CI, 90.6%-98.4%), 95.0% (95% CI, 90.0%-98.0%), and 99.3% (95% CI, 96.2%-100%), respectively. Ultrasonography and CT demonstrated almost perfect agreement (κ = 0.87; 95% CI, 0.72-1.00), whereas concordance with physical examination was poor. Subgroup analysis by immune status revealed marked disparities in diagnostic performance. In patients with immunocompetence, both ultrasonography and CT achieved 100% sensitivity (95% CI, 54.1%-100% and 47.8%-100%, respectively) and excellent AUROC (0.98; 95% CI, 0.96-1.00 for both). In contrast, sensitivity declined markedly among patients who were immunosuppressed (20.0% [95% CI, 0.5%-71.6%] for ultrasonography and 16.7% [95% CI, 0.4%-64.1%] for CT; AUROCs, 0.57 ([95% CI, 0.37-0.77] and 0.55 [95% CI, 0.38-0.72], respectively), with metastases often emerging abruptly during follow-up despite negative baseline staging. CONCLUSIONS AND RELEVANCE: This diagnostic study found that ultrasonography and CT significantly outperformed physical examination for detecting baseline nodal metastases in high-risk cSCC and can be used interchangeably depending on clinical context and resource availability. However, their poor performance in patients with immunosuppression reveals a need for tailored recommendations in future clinical practice guidelines and emphasizes the importance of close clinical follow-up in this subgroup.",
          "authors": [
            "Carla Ferrándiz-Pulido",
            "Álvaro Gómez-Tomás",
            "Sahyly Siurana",
            "Carles Tortajada",
            "Rafael Salido-Vallejo",
            "Rafael S Aguayo-Ortiz",
            "Iolanda Ribes Amorós",
            "Lucía Turrión-Merino",
            "Beatriz Brea Álvarez",
            "Íñigo Pérez González",
            "Ignasi Martí-Marti",
            "Santiago Medrano-Martorell",
            "Sebastian Podlipnik",
            "Jordi Mollet",
            "Emili Masferrer",
            "Daniel Lopez-Castillo",
            "Mireia Yébenes",
            "Marc Corbacho-Monné",
            "Lorena Leal",
            "Alberto Solano-López",
            "Verónica Ruiz-Salas",
            "Esther Granell-Moreno",
            "Sheila Alfonso",
            "Maria-Dolores Mendoza",
            "Álvaro Martínez-Domenech",
            "Cecilia Juárez-Dobjanschi",
            "Elia Samaniego González",
            "Paula Díaz",
            "Agustí Toll"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42018290",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42018290/",
          "pdf_url": null,
          "published_at": "2026-04-22T11:32:00+00:00",
          "updated_at": "2026-04-22T11:32:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测"
          ],
          "topics": [
            "Benchmark",
            "Clinical"
          ],
          "doi": "10.1001/jamadermatol.2026.0803",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42018290/",
            "doi": "https://doi.org/10.1001/jamadermatol.2026.0803"
          },
          "relevance_score": 65,
          "match_reasons": [
            "summary matched \"benchmark\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1001/jamadermatol.2026.0803"
        },
        {
          "title": "Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis.",
          "summary": "BackgroundMinimally invasive direct coronary artery bypass (MIDCAB) has emerged as an alternative to conventional coronary artery bypass grafting; however, its adoption remains limited due to technical complexity and a steep learning curve, particularly in patients requiring multi-vessel revascularization. Objective data defining the learning curve of multi-vessel MIDCAB are scarce.MethodsThis retrospective study included consecutive patients who underwent multi-vessel MIDCAB between January 2020 and December 2025. Patients requiring single-vessel revascularization were intentionally excluded to ensure procedural homogeneity. The learning curve was evaluated using cumulative sum (CUSUM) analysis, and cases were stratified into three phases based on CUSUM inflection points. Perioperative and postoperative outcomes were compared across learning curve phases.ResultsA total of 169 patients were analyzed. CUSUM analysis identified three distinct learning phases: an initial learning phase (cases 1-48), a transition phase (cases 49-107), and a proficiency phase (cases 108-169). With increasing surgical experience, cardiopulmonary bypass time, aortic cross-clamp time, and total operative duration decreased significantly. The rate of conversion to open surgery declined markedly across learning phases, whereas in-hospital mortality and major postoperative complications remained low and comparable. These findings indicate improved procedural efficiency without compromising early clinical outcomes.ConclusionsMulti-vessel MIDCAB is associated with a substantial learning curve that can be objectively characterized using CUSUM analysis. Surgical proficiency is achieved only after a considerable number of cases, emphasizing the importance of adequate case volume and structured performance monitoring. These results provide a practical benchmark for centers aiming to adopt or expand multi-vessel MIDCAB programs.",
          "authors": [
            "Barış Timur",
            "Zinar Apaydın",
            "Batuhan Yazıcı",
            "Alper Selim Kocaoğlu",
            "Mehmet Emin Öner",
            "Fatime Üçdağ",
            "Zihni Mert Duman"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42017544",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42017544/",
          "pdf_url": null,
          "published_at": "2026-04-22T07:53:00+00:00",
          "updated_at": "2026-04-22T07:53:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Clinical"
          ],
          "doi": "10.1177/02676591261446466",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42017544/",
            "doi": "https://doi.org/10.1177/02676591261446466"
          },
          "relevance_score": 62,
          "match_reasons": [
            "summary matched \"benchmark\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1177/02676591261446466"
        }
      ]
    },
    {
      "name": "OpenAlex AI",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    }
  ]
}