{
  "generated_at": "2026-04-22T11:37:03.160861+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LLM",
        "sort_by": "hybrid"
      },
      {
        "name": "Vision",
        "sort_by": "hybrid"
      },
      {
        "name": "PubMed AI",
        "sort_by": "hybrid"
      },
      {
        "name": "OpenAlex AI",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「Benchmark」：命中 18 篇，覆盖 LLM、Vision 等，代表论文包括 《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》、《Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps》。",
    "主题「Evaluation」：命中 15 篇，覆盖 LLM、Vision 等，代表论文包括 《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》、《Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps》。",
    "主题「Language Model」：命中 8 篇，覆盖 LLM、Vision 等，代表论文包括 《Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment》、《Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "Benchmark",
      "paper_count": 18,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents",
        "Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps",
        "Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment",
        "Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views",
        "Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews",
        "A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding",
        "Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic",
        "A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression",
        "From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning",
        "SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models",
        "Time Series Augmented Generation for Financial Applications",
        "From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems",
        "Lost in Translation: Do LVLM Judges Generalize Across Languages?",
        "Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language",
        "Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval",
        "How Far Are Video Models from True Multimodal Reasoning?",
        "Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.",
        "Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial."
      ],
      "key_points": [
        "《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕：Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, mu…",
        "《Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps》〔评测 / 数据 / 应用 / 方法〕：We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunt…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 15,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents",
        "Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps",
        "Revac: A Social Deduction Reasoning Agent",
        "Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews",
        "A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding",
        "Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic",
        "From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning",
        "SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models",
        "Time Series Augmented Generation for Financial Applications",
        "From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems",
        "Lost in Translation: Do LVLM Judges Generalize Across Languages?",
        "How Far Are Video Models from True Multimodal Reasoning?",
        "EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation",
        "Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study.",
        "APSevLM: Acute Pancreatitis Severity Language Model."
      ],
      "key_points": [
        "《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕：Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, mu…",
        "《Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps》〔评测 / 数据 / 应用 / 方法〕：We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunt…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 8,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment",
        "Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views",
        "Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language",
        "EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation",
        "Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.",
        "Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study.",
        "Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI.",
        "APSevLM: Acute Pancreatitis Severity Language Model."
      ],
      "key_points": [
        "《Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment》〔评测 / 应用 / 方法〕：Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance…",
        "《Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views》〔评测 / 方法〕：Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language…"
      ]
    },
    {
      "name": "Diffusion",
      "paper_count": 4,
      "feed_names": [
        "Vision"
      ],
      "paper_titles": [
        "ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis",
        "MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation",
        "MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention",
        "AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model"
      ],
      "key_points": [
        "《ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis》〔数据 / 应用 / 方法〕：Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view da…",
        "《MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation》〔应用 / 方法〕：Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a sing…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 3,
      "feed_names": [
        "LLM",
        "Vision"
      ],
      "paper_titles": [
        "Revac: A Social Deduction Reasoning Agent",
        "A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression",
        "RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation"
      ],
      "key_points": [
        "《Revac: A Social Deduction Reasoning Agent》〔评测 / 应用 / 方法〕：Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading in…",
        "《A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression》〔评测 / 应用 / 方法〕：As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LLM",
      "key_points": [
        "《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕：Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, mu…",
        "《Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps》〔评测 / 数据 / 应用 / 方法〕：We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunt…",
        "《Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment》〔评测 / 应用 / 方法〕：Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents",
          "summary": "Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.",
          "authors": [
            "Vasundra Srininvasan"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19457v1",
          "abstract_url": "https://arxiv.org/abs/2604.19457v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19457v1",
          "published_at": "2026-04-21T13:37:19+00:00",
          "updated_at": "2026-04-21T13:37:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19457",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19457v1"
          },
          "relevance_score": 162,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"alignment\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19457"
        },
        {
          "title": "Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps",
          "summary": "We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.",
          "authors": [
            "Alankrit Chona",
            "Igor Kozlov",
            "Ambuj Kumar"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19533v1",
          "abstract_url": "https://arxiv.org/abs/2604.19533v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19533v1",
          "published_at": "2026-04-21T14:53:23+00:00",
          "updated_at": "2026-04-21T14:53:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19533",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19533v1"
          },
          "relevance_score": 149,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "title matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19533"
        },
        {
          "title": "Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment",
          "summary": "Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.",
          "authors": [
            "Bobo Li",
            "Rui Wu",
            "Zibo Ji",
            "Meishan Zhang",
            "Hao Fei",
            "Min Zhang",
            "Mong-Li Lee",
            "Wynne Hsu"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.CY"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19548v1",
          "abstract_url": "https://arxiv.org/abs/2604.19548v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19548v1",
          "published_at": "2026-04-21T15:05:58+00:00",
          "updated_at": "2026-04-21T15:05:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.19548",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19548v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"alignment\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19548"
        },
        {
          "title": "Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views",
          "summary": "Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.",
          "authors": [
            "Feihao Fang",
            "My T. Thai",
            "Yuanyuan Lei"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19716v1",
          "abstract_url": "https://arxiv.org/abs/2604.19716v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19716v1",
          "published_at": "2026-04-21T17:42:54+00:00",
          "updated_at": "2026-04-21T17:42:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.19716",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19716v1"
          },
          "relevance_score": 130,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19716"
        },
        {
          "title": "Revac: A Social Deduction Reasoning Agent",
          "summary": "Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.",
          "authors": [
            "Mihir Shriniwas Arya",
            "Avinash Anish",
            "Aditya Ranjan"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19523v1",
          "abstract_url": "https://arxiv.org/abs/2604.19523v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19523v1",
          "published_at": "2026-04-21T14:45:10+00:00",
          "updated_at": "2026-04-21T14:45:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.19523",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19523v1"
          },
          "relevance_score": 127,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19523"
        },
        {
          "title": "Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews",
          "summary": "The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.",
          "authors": [
            "Bowen Li",
            "Haochen Ma",
            "Yuxin Wang",
            "Jie Yang",
            "Xinchi Chen",
            "Xuanjing Huang",
            "Yining Zheng",
            "Xipeng Qiu"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19502v1",
          "abstract_url": "https://arxiv.org/abs/2604.19502v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19502v1",
          "published_at": "2026-04-21T14:21:15+00:00",
          "updated_at": "2026-04-21T14:21:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19502",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19502v1"
          },
          "relevance_score": 126,
          "match_reasons": [
            "title matched \"benchmark\"",
            "title matched \"evaluation\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19502"
        },
        {
          "title": "A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding",
          "summary": "Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.",
          "authors": [
            "Shuai Wang",
            "Hongyi Zhu",
            "Jia-Hong Huang",
            "Yixian Shen",
            "Chengxi Zeng",
            "Stevan Rudinac",
            "Monika Kackovic",
            "Nachoem Wijnberg",
            "Marcel Worring"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19689v1",
          "abstract_url": "https://arxiv.org/abs/2604.19689v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19689v1",
          "published_at": "2026-04-21T17:11:48+00:00",
          "updated_at": "2026-04-21T17:11:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19689",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19689v1"
          },
          "relevance_score": 125,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19689"
        },
        {
          "title": "Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic",
          "summary": "Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy \"king\"-\"man\"+\"woman\" = \"queen\" illustrates relational reasoning, yet replacing text with images of \"king\" and \"man\" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that \"powder\" and \"cake\" are related by \"is made of\" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.",
          "authors": [
            "Chuou Xu",
            "Liya Ji",
            "Qifeng Chen"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19567v1",
          "abstract_url": "https://arxiv.org/abs/2604.19567v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19567v1",
          "published_at": "2026-04-21T15:19:49+00:00",
          "updated_at": "2026-04-21T15:19:49+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19567",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19567v1"
          },
          "relevance_score": 123,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19567"
        },
        {
          "title": "A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression",
          "summary": "As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.",
          "authors": [
            "Jincheng Ren",
            "Siwei Wu",
            "Yizhi Li",
            "Kang Zhu",
            "Shu Xu",
            "Boyu Feng",
            "Ruibin Yuan",
            "Wei Zhang",
            "Riza Batista-Navarro",
            "Jian Yang",
            "Chenghua Lin"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19572v1",
          "abstract_url": "https://arxiv.org/abs/2604.19572v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19572v1",
          "published_at": "2026-04-21T15:25:54+00:00",
          "updated_at": "2026-04-21T15:25:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.19572",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19572v1"
          },
          "relevance_score": 105,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19572"
        },
        {
          "title": "From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning",
          "summary": "Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO",
          "authors": [
            "Beining Wu",
            "Fuyou Mao",
            "Jiong Lin",
            "Cheng Yang",
            "Jiaxuan Lu",
            "Yifu Guo",
            "Siyu Zhang",
            "Yifan Wu",
            "Ying Huang",
            "Fu Li"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19516v1",
          "abstract_url": "https://arxiv.org/abs/2604.19516v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19516v1",
          "published_at": "2026-04-21T14:39:24+00:00",
          "updated_at": "2026-04-21T14:39:24+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19516",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19516v1"
          },
          "relevance_score": 105,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19516"
        },
        {
          "title": "SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models",
          "summary": "Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git",
          "authors": [
            "Josue Torres-Fonseca",
            "Naihao Deng",
            "Yinpei Dai",
            "Shane Storks",
            "Yichi Zhang",
            "Rada Mihalcea",
            "Casey Kennington",
            "Joyce Chai"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.RO"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19638v1",
          "abstract_url": "https://arxiv.org/abs/2604.19638v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19638v1",
          "published_at": "2026-04-21T16:27:20+00:00",
          "updated_at": "2026-04-21T16:27:20+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19638",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19638v1"
          },
          "relevance_score": 102,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19638"
        },
        {
          "title": "Time Series Augmented Generation for Financial Applications",
          "summary": "Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.",
          "authors": [
            "Anton Kolonin",
            "Alexey Glushchenko",
            "Evgeny Bochkov",
            "Abhishek Saxena"
          ],
          "categories": [
            "cs.AI",
            "cs.CE"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19633v1",
          "abstract_url": "https://arxiv.org/abs/2604.19633v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19633v1",
          "published_at": "2026-04-21T16:20:59+00:00",
          "updated_at": "2026-04-21T16:20:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19633",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19633v1"
          },
          "relevance_score": 102,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19633"
        },
        {
          "title": "From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems",
          "summary": "Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.",
          "authors": [
            "Quang-Huy Nguyen",
            "Thanh-Hai Nguyen",
            "Khac-Manh Thai",
            "Duc-Hoang Pham",
            "Huy-Son Nguyen",
            "Cam-Van Thi Nguyen",
            "Masoud Mansoury",
            "Duc-Trong Le",
            "Hoang-Quynh Le"
          ],
          "categories": [
            "cs.IR",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19663v1",
          "abstract_url": "https://arxiv.org/abs/2604.19663v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19663v1",
          "published_at": "2026-04-21T16:48:13+00:00",
          "updated_at": "2026-04-21T16:48:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": "10.1145/3805712.3808574",
          "arxiv_id": "2604.19663",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19663v1",
            "doi": "https://doi.org/10.1145/3805712.3808574"
          },
          "relevance_score": 101,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1145/3805712.3808574"
        },
        {
          "title": "Lost in Translation: Do LVLM Judges Generalize Across Languages?",
          "summary": "Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.",
          "authors": [
            "Md Tahmid Rahman Laskar",
            "Mohammed Saidul Islam",
            "Mir Tafseer Nayeem",
            "Amran Bhuiyan",
            "Mizanur Rahman",
            "Shafiq Joty",
            "Enamul Hoque",
            "Jimmy Huang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19405v1",
          "abstract_url": "https://arxiv.org/abs/2604.19405v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19405v1",
          "published_at": "2026-04-21T12:29:10+00:00",
          "updated_at": "2026-04-21T12:29:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19405",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19405v1"
          },
          "relevance_score": 98,
          "match_reasons": [
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19405"
        },
        {
          "title": "Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language",
          "summary": "At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.",
          "authors": [
            "Yi Zhong",
            "Buqiang Xu",
            "Yijun Wang",
            "Zifei Shan",
            "Shuofei Qiao",
            "Guozhou Zheng",
            "Ningyu Zhang"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.CV",
            "cs.LG",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19667v1",
          "abstract_url": "https://arxiv.org/abs/2604.19667v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19667v1",
          "published_at": "2026-04-21T16:49:11+00:00",
          "updated_at": "2026-04-21T16:49:11+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.19667",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19667v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19667"
        }
      ]
    },
    {
      "name": "Vision",
      "key_points": [
        "《PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving》〔评测 / 方法〕：This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization…",
        "《Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval》〔评测 / 方法〕：This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D s…",
        "《ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis》〔数据 / 应用 / 方法〕：Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view da…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving",
          "summary": "This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.",
          "authors": [
            "Yining Pan",
            "Shijie Li",
            "Yuchen Wu",
            "Xulei Yang",
            "Na Zhao"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19379v1",
          "abstract_url": "https://arxiv.org/abs/2604.19379v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19379v1",
          "published_at": "2026-04-21T12:01:29+00:00",
          "updated_at": "2026-04-21T12:01:29+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Multimodal",
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.19379",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19379v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"multimodal\"",
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19379"
        },
        {
          "title": "Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval",
          "summary": "This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.",
          "authors": [
            "Hang Cheng",
            "Fanhe Dong",
            "Long Zeng"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19135v1",
          "abstract_url": "https://arxiv.org/abs/2604.19135v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19135v1",
          "published_at": "2026-04-21T06:32:34+00:00",
          "updated_at": "2026-04-21T06:32:34+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Alignment"
          ],
          "doi": null,
          "arxiv_id": "2604.19135",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19135v1"
          },
          "relevance_score": 100,
          "match_reasons": [
            "title matched \"diffusion\"",
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19135"
        },
        {
          "title": "ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis",
          "summary": "Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.",
          "authors": [
            "Zhengwentai Sun",
            "Keru Zheng",
            "Chenghong Li",
            "Hongjie Liao",
            "Xihe Yang",
            "Heyuan Li",
            "Yihao Zhi",
            "Shuliang Ning",
            "Shuguang Cui",
            "Xiaoguang Han"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19720v1",
          "abstract_url": "https://arxiv.org/abs/2604.19720v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19720v1",
          "published_at": "2026-04-21T17:47:26+00:00",
          "updated_at": "2026-04-21T17:47:26+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Diffusion",
            "Video Generation"
          ],
          "doi": null,
          "arxiv_id": "2604.19720",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19720v1"
          },
          "relevance_score": 90,
          "match_reasons": [
            "title matched \"video generation\"",
            "summary matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19720"
        },
        {
          "title": "MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation",
          "summary": "Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.",
          "authors": [
            "Liyang Li",
            "Wen Wang",
            "Canyu Zhao",
            "Tianjian Feng",
            "Zhiyue Zhao",
            "Hao Chen",
            "Chunhua Shen"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19679v1",
          "abstract_url": "https://arxiv.org/abs/2604.19679v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19679v1",
          "published_at": "2026-04-21T16:57:23+00:00",
          "updated_at": "2026-04-21T16:57:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Alignment",
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.19679",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19679v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"video generation\"",
            "summary matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19679"
        },
        {
          "title": "MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention",
          "summary": "Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.",
          "authors": [
            "Zhi Chen",
            "Runze Hu",
            "Le Zhang"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19675v1",
          "abstract_url": "https://arxiv.org/abs/2604.19675v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19675v1",
          "published_at": "2026-04-21T16:54:43+00:00",
          "updated_at": "2026-04-21T16:54:43+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Diffusion",
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.19675",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19675v1"
          },
          "relevance_score": 89,
          "match_reasons": [
            "title matched \"segmentation\"",
            "summary matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19675"
        },
        {
          "title": "RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation",
          "summary": "Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.",
          "authors": [
            "Ahmed Marouane Djouama",
            "Abir Belaala",
            "Abdellah Zakaria Sellam",
            "Salah Eddine Bekhouche",
            "Cosimo Distante",
            "Abdenour Hadid"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19570v1",
          "abstract_url": "https://arxiv.org/abs/2604.19570v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19570v1",
          "published_at": "2026-04-21T15:24:39+00:00",
          "updated_at": "2026-04-21T15:24:39+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Clinical"
          ],
          "doi": null,
          "arxiv_id": "2604.19570",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19570v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"segmentation\"",
            "summary matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19570"
        },
        {
          "title": "How Far Are Video Models from True Multimodal Reasoning?",
          "summary": "Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.",
          "authors": [
            "Xiaotian Zhang",
            "Jianhui Wei",
            "Yuan Wang",
            "Jie Tan",
            "Yichen Li",
            "Yan Zhang",
            "Ziyi Chen",
            "Daoan Zhang",
            "Dezhi YU",
            "Wei Xu",
            "Songtao Jiang",
            "Zuozhu Liu"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19193v1",
          "abstract_url": "https://arxiv.org/abs/2604.19193v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19193v1",
          "published_at": "2026-04-21T08:04:02+00:00",
          "updated_at": "2026-04-21T08:04:02+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.19193",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19193v1"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"multimodal\"",
            "summary matched \"video generation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19193"
        },
        {
          "title": "EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation",
          "summary": "Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \\textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \\textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.",
          "authors": [
            "Ruibing Hou",
            "Mingyue Zhou",
            "Yuwei Gui",
            "Mingshuang Luo",
            "Bingpeng Ma",
            "Hong Chang",
            "Shiguang Shan",
            "Xilin Chen"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19105v1",
          "abstract_url": "https://arxiv.org/abs/2604.19105v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19105v1",
          "published_at": "2026-04-21T05:31:06+00:00",
          "updated_at": "2026-04-21T05:31:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.19105",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19105v1"
          },
          "relevance_score": 77,
          "match_reasons": [
            "title matched \"diffusion\"",
            "summary matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19105"
        },
        {
          "title": "AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model",
          "summary": "Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.",
          "authors": [
            "Yutian Chen",
            "Shi Guo",
            "Renbiao Jin",
            "Tianshuo Yang",
            "Xin Cai",
            "Yawen Luo",
            "Mingxin Yang",
            "Mulin Yu",
            "Linning Xu",
            "Tianfan Xue"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19747v1",
          "abstract_url": "https://arxiv.org/abs/2604.19747v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19747v1",
          "published_at": "2026-04-21T17:59:47+00:00",
          "updated_at": "2026-04-21T17:59:47+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.19747",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19747v1"
          },
          "relevance_score": 72,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19747"
        },
        {
          "title": "CityRAG: Stepping Into a City via Spatially-Grounded Video Generation",
          "summary": "We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.",
          "authors": [
            "Gene Chou",
            "Charles Herrmann",
            "Kyle Genova",
            "Boyang Deng",
            "Songyou Peng",
            "Bharath Hariharan",
            "Jason Y. Zhang",
            "Noah Snavely",
            "Philipp Henzler"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.19741v1",
          "abstract_url": "https://arxiv.org/abs/2604.19741v1",
          "pdf_url": "https://arxiv.org/pdf/2604.19741v1",
          "published_at": "2026-04-21T17:59:03+00:00",
          "updated_at": "2026-04-21T17:59:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Video Generation"
          ],
          "doi": null,
          "arxiv_id": "2604.19741",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.19741v1"
          },
          "relevance_score": 72,
          "match_reasons": [
            "title matched \"video generation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.19741"
        }
      ]
    },
    {
      "name": "PubMed AI",
      "key_points": [
        "《Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.》〔评测 / 应用 / 方法〕：BACKGROUND: The American Society of Anesthesiologists Physical Status (ASA-PS) classification is integral to preoperative risk assessment; yet, assignment rema…",
        "《Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study.》〔评测 / 方法〕：BACKGROUND: Current machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can t…",
        "《Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI.》〔数据 / 应用 / 方法〕：The rapid integration of large language models into electronic medical record systems introduces a critical theoretical vulnerability. Drawing on foundational…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.",
          "summary": "BACKGROUND: The American Society of Anesthesiologists Physical Status (ASA-PS) classification is integral to preoperative risk assessment; yet, assignment remains subjective and labor-intensive. Recent large language models (LLMs) process free-text electronic health records (EHRs), but few studies have evaluated parameter-efficient adaptations that both predict ASA-PS and provide clinician-readable rationales. Low-rank adaptation (LoRA) is a parameter-efficient technique that updates only a small set of add-on parameters rather than the entire model, enabling efficient fine-tuning on modest data and hardware. A lightweight, instruction-tuned LLM with these capabilities could streamline workflow and broaden access to explainable decision support. OBJECTIVE: This study aimed to develop and evaluate a LoRA-fine-tuned large language model meta-AI (LLaMA-3) for ASA-PS classification from preoperative clinical narratives and benchmark it against traditional machine learning classifiers and domain-specific LLMs. METHODS: Preoperative anesthesia notes and discharge summaries were extracted from the EHR and reformatted into an Alpaca-style instruction-response prompt, requesting ASA-PS class labels (I-V) annotated by anesthesiologists. The LoRA-enhanced LLaMA-3 model was fine-tuned with mixed-precision training and evaluated on a hold-out test set. Baselines included random forest classifier, Extreme Gradient Boosting (XGBoost) classifier, support vector machine, fastText, BioBERT, ClinicalBERT, and untuned LLaMA-3. Performance was assessed with micro- and macroaveraged F 1 -score and Matthews correlation coefficient (MCC), each reported with 95% bootstrap CIs. Pairwise model error rates were compared using McNemar test. RESULTS: The LoRA-LLaMA-3 model achieved a micro-F 1 -score of 0.780 (95% CI 0.769-0.792) and an MCC of 0.533 (95% CI 0.518-0.546), outperforming other LLM baselines. After fine-tuning, BioBERT reached a micro-F 1 -score of 0.762 and an MCC of 0.508, whereas ClinicalBERT achieved a micro-F 1 -score of 0.757 and an MCC of 0.515. fastText yielded a micro-F 1 -score of 0.762 and an MCC of 0.536. The untuned LLaMA-3 performed poorly (micro-F 1 -score of 0.073; MCC of 0.002). However, macro-F 1 -score of LoRA-LLaMA-3 (0.316) was lower than that of other language models (0.349-0.372). Among all models, XGBoost obtained the highest scores (micro-F 1 -score of 0.815, 95% CI 0.804-0.826; macro-F 1 -score of 0.348, 95% CI 0.334-0.361; MCC 0.613, 95% CI 0.599-0.626). Ablation experiments identified dropout = 0.3, learning rate = 3×10 -5 , temperature = 0.1, and top-P= 0.1 as the optimal hyperparameter settings. The LoRA model also produced rationales that highlighted medically pertinent terms. CONCLUSIONS: LoRA fine-tuning improved LLaMA-3 from near-random performance into an ASA-PS classifier with higher micro-F 1 -score and significantly lower misclassification rates than other language model baselines. However, macroaveraged performance was lower, indicating limited discrimination for minority ASA classes. Traditional machine learning models demonstrated higher predictive performance. Beyond predictive performance, LoRA-LLaMA-3 generated clinician-oriented explanations that enhance decision transparency. By reformatting routine EHR narratives into instruction-response pairs and relying on lightweight parameter adaptation, this approach offers a practical, resource-efficient framework for introducing explainable LLMs to clinical classification tasks.",
          "authors": [
            "Min-Chia Chen",
            "Shanq-Jang Ruan",
            "Jo-Hsin Wu",
            "Pei-Fu Chen"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42013456",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42013456/",
          "pdf_url": null,
          "published_at": "2026-04-21T16:52:00+00:00",
          "updated_at": "2026-04-21T16:52:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": "10.2196/89540",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42013456/",
            "doi": "https://doi.org/10.2196/89540"
          },
          "relevance_score": 111,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"benchmark\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.2196/89540"
        },
        {
          "title": "Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study.",
          "summary": "BACKGROUND: Current machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can transform ML model-predicted risk estimates with Shapley Additive Explanations (SHAP) into clinically meaningful support information, yet the added value of incorporating ML-derived data and the relative performance of different LLMs remain uncertain. To address these gaps, we used our previously developed IMPACT framework to evaluate the quality of LLM-generated outputs. METHODS: In this retrospective analysis of MIMIC-IV v3.1 intensive care unit (ICU) admissions, we applied a previously developed XGBoost model to estimate ICU mortality risk and derive corresponding SHAP values. GPT-4o transformed the predicted mortality risk, clinical predictors, and their SHAP values into risk interpretation, recommended examinations and management. The primary analysis examined whether augmenting LLM inputs with predicted mortality risk and SHAP values improved clinical response quality, as assessed by the IMPACT framework. We further compared GPT-4o with seven contemporary LLMs; all eight models generated clinical support responses that were scored by Claude 3.7 Sonnet to assess performance differences. RESULTS: Claude 3.7 Sonnet showed excellent agreement with human IMPACT ratings (intraclass correlation coefficient [ICC] 0.979, 95% CI 0.973-0.984) and o3-mini (ICC 0.971, 95% CI 0.964-0.980). In the primary analysis, adding predicted ICU mortality risk and SHAP values significantly increased GPT-4o IMPACT scores across prompting strategies. GPT-5 mini (96.0) and gpt-oss-120B (93.4) outperformed GPT-4o (90.4; both p < 0.001) for interpretability and quality. CONCLUSIONS: Combining ML-derived risk, SHAP explanations and LLMs may modestly improve ICU clinical support information, while LLM-based evaluators demonstrated feasibility for scalable evaluation of generated clinical content.",
          "authors": [
            "Yu-Chang Yeh",
            "Hsin-Yu Yang",
            "Ching-Tang Chiu",
            "Anne Chao",
            "Yu-Chen Chuang",
            "Wing-Sum Chan"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42012584",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42012584/",
          "pdf_url": null,
          "published_at": "2026-04-21T11:08:00+00:00",
          "updated_at": "2026-04-21T11:08:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Language Model"
          ],
          "doi": "10.1186/s40635-026-00900-w",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42012584/",
            "doi": "https://doi.org/10.1186/s40635-026-00900-w"
          },
          "relevance_score": 109,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1186/s40635-026-00900-w"
        },
        {
          "title": "Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI.",
          "summary": "The rapid integration of large language models into electronic medical record systems introduces a critical theoretical vulnerability. Drawing on foundational computer science proofs of \"model collapse,\" this viewpoint introduces the concept of \"Clinical Model Autophagy\"-a systemic degradation of diagnostic integrity that occurs when clinical artificial intelligence (AI) models are recursively trained on unverified, AI-generated synthetic data. As these recursive models may progressively regress toward statistical means, they undergo \"Interpretative Drift,\" a clinically concerning phenomenon where rare pathological variances are systematically erased and complex diseases are homogenized into benign averages. To prevent the irreversible contamination of health care data ecosystems, the author urgently proposes the Data Purity Standard (DPS). The DPS mandates the cryptographic watermarking of all AI-assisted clinical entries for provenance tracking, alongside the establishment of \"Human Vaults.\" These physically segregated repositories of physician-verified heritage data will serve as immutable biological anchors to safely guide future AI training, ensuring the long-term reliability of digital health infrastructure.",
          "authors": [
            "Pei Fan Shih"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42013455",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42013455/",
          "pdf_url": null,
          "published_at": "2026-04-21T16:52:00+00:00",
          "updated_at": "2026-04-21T16:52:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Clinical"
          ],
          "doi": "10.2196/94813",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42013455/",
            "doi": "https://doi.org/10.2196/94813"
          },
          "relevance_score": 93,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"language model\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.2196/94813"
        },
        {
          "title": "APSevLM: Acute Pancreatitis Severity Language Model.",
          "summary": "Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial for effective patient management. In this study, we present APSevLM (Acute Pancreatitis Severity Language Model), a large language model (LLM)-based approach that integrates admission-time clinical data, imaging reports, and expert knowledge to predict AP severity at an early stage. Through a comprehensive evaluation using data from over five hundred patients, APSevLM outperforms traditional scoring systems (BISAP and MCTSI), conventional machine learning algorithms, and state-of-the-art deep learning models, achieving an AUC of 0.857. Attention visualizations of the model explain complex mechanisms that dynamically weigh different information modalities based on case severity. Furthermore, a systematic feature importance analysis identifies key predictive factors, particularly hematological parameters and cardiac markers, offering valuable insights for clinical practice. Our study positions APSevLM as an accurate predictive model and highlights potential biomarkers for the early diagnosis of severe AP.",
          "authors": [
            "Leqi Zheng",
            "Jiajun Fang",
            "Hongyi Chen",
            "Naiqing Li",
            "Yunyuan Huang",
            "Qiulin Ge",
            "Yang Gu",
            "Tao Yu",
            "Chang-Dong Wang",
            "Peng Wang"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42013267",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42013267/",
          "pdf_url": null,
          "published_at": "2026-04-21T14:33:00+00:00",
          "updated_at": "2026-04-21T14:33:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Language Model"
          ],
          "doi": "10.1109/jbhi.2026.3686007",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42013267/",
            "doi": "https://doi.org/10.1109/JBHI.2026.3686007"
          },
          "relevance_score": 90,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/jbhi.2026.3686007"
        },
        {
          "title": "Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial.",
          "summary": "OBJECTIVES: To compare the difference in the number of cardiopulmonary bypass surgical patients who receive sugammadex vs. placebo and who meet the Society of Thoracic Surgery early extubation quality benchmark. DESIGN: Single-center, randomized, double-blind, placebo-controlled trial. SETTING: Participants were enrolled at a single U.S. hospital between August 2023 and July 2025. PATIENTS: Seventy-four eligible cardiac surgery patients undergoing cardiopulmonary bypass with anticipated institutional fast-track extubation were enrolled; 64 were included in the analysis. INTERVENTIONS: Administration of either sugammadex or placebo 15 minutes after arrival to the ICU following cardiac surgery. MAIN OUTCOMES AND MEASURES: The primary outcome was the number of patients meeting the Society of Thoracic Surgery quality benchmark of early extubation in the sugammadex vs. placebo groups. Secondary outcomes encompassed specifics related to clinical outcomes, medication requirements, and nursing perception. MEASUREMENTS AND MAIN RESULTS: Surgery duration (p = 0.0004), bypass duration (p = 0.0177), and intraoperative blood products (p = 0.0003) were all increased in the sugammadex group. No difference was observed in the primary outcome and 96.9% of patients in both groups were extubated within 6 hours after surgery. However, all patients in the sugammadex group achieved a train of four greater than or equal to 0.9 before extubation compared with only 37.5% of the placebo group (p < 0.0001). The intraoperative dose of rocuronium (mean p = 0.0119; median p = 0.0047) was significantly increased in the sugammadex group. All additional demographics and secondary outcomes were comparable between groups. CONCLUSIONS: This trial found no difference in the number of patients who achieved the early extubation benchmark in the sugammadex vs. placebo groups; however, a significant number of patients in the placebo group had residual neuromuscular weakness as defined by quantitative neuromuscular monitoring. Further studies are required to investigate the implications of the high incidence of quantitative monitoring related weakness in this population without the use of sugammadex.",
          "authors": [
            "Steven B Greenberg",
            "Noah Ben-Isvy",
            "Andrew R Locke",
            "Nikola Dobrilovic",
            "Rebecca Shamberg",
            "Andrew Bochenek",
            "Daneel Patoli",
            "Chi Wang",
            "Mohammed Minhaj"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42012852",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42012852/",
          "pdf_url": null,
          "published_at": "2026-04-21T11:43:00+00:00",
          "updated_at": "2026-04-21T11:43:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测"
          ],
          "topics": [
            "Benchmark",
            "Clinical"
          ],
          "doi": "10.1097/cce.0000000000001406",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42012852/",
            "doi": "https://doi.org/10.1097/CCE.0000000000001406"
          },
          "relevance_score": 88,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"benchmark\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1097/cce.0000000000001406"
        }
      ]
    },
    {
      "name": "OpenAlex AI",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    }
  ]
}