{
  "generated_at": "2026-04-24T11:46:20.505287+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LLM",
        "sort_by": "hybrid"
      },
      {
        "name": "Vision",
        "sort_by": "hybrid"
      },
      {
        "name": "PubMed AI",
        "sort_by": "hybrid"
      },
      {
        "name": "OpenAlex AI",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「Benchmark」：命中 13 篇，覆盖 LLM、Vision 等，代表论文包括 《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》、《Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems》。",
    "主题「Language Model」：命中 12 篇，覆盖 LLM、PubMed AI，代表论文包括 《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》、《Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems》。",
    "主题「Evaluation」：命中 10 篇，覆盖 LLM、Vision 等，代表论文包括 《Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability》、《Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "Benchmark",
      "paper_count": 13,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows",
        "Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems",
        "AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use",
        "Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability",
        "Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards",
        "AEL: Agent Evolving Learning for Open-Ended Environments",
        "Language as a Latent Variable for Reasoning Optimization",
        "A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking",
        "DryRUN: On the Role of Public Tests in LLM-Driven Code Generation",
        "Deep kernel video approximation for unsupervised action segmentation",
        "S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images",
        "SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes",
        "Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues."
      ],
      "key_points": [
        "《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateles…",
        "《Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems》〔评测 / 方法〕：Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestra…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 12,
      "feed_names": [
        "LLM",
        "PubMed AI"
      ],
      "paper_titles": [
        "Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows",
        "Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems",
        "AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use",
        "Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models",
        "Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models",
        "CoFEE: Reasoning Control for LLM-Based Feature Discovery",
        "DryRUN: On the Role of Public Tests in LLM-Driven Code Generation",
        "Evaluation of Automatic Speech Recognition Using Generative Large Language Models",
        "Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models.",
        "Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example.",
        "GATE: Graph and Text Exchange for Zero-Shot ECG Classification with LLM Prompts.",
        "Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification."
      ],
      "key_points": [
        "《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateles…",
        "《Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems》〔评测 / 方法〕：Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestra…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 10,
      "feed_names": [
        "LLM",
        "Vision",
        "PubMed AI"
      ],
      "paper_titles": [
        "Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability",
        "Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards",
        "Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models",
        "Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications",
        "CoFEE: Reasoning Control for LLM-Based Feature Discovery",
        "A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking",
        "Evaluation of Automatic Speech Recognition Using Generative Large Language Models",
        "SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes",
        "Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues.",
        "Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification."
      ],
      "key_points": [
        "《Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability》〔评测 / 方法〕：Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this te…",
        "《Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards》〔评测 / 数据 / 应用 / 方法〕：LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by ben…"
      ]
    },
    {
      "name": "Diffusion",
      "paper_count": 5,
      "feed_names": [
        "Vision"
      ],
      "paper_titles": [
        "Pre-process for segmentation task with nonlinear diffusion filters",
        "Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation",
        "DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion",
        "Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers",
        "DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction"
      ],
      "key_points": [
        "《Pre-process for segmentation task with nonlinear diffusion filters》〔方法〕：This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We f…",
        "《Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation》〔数据 / 应用 / 方法〕：Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digit…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 4,
      "feed_names": [
        "LLM",
        "Vision"
      ],
      "paper_titles": [
        "Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications",
        "Language as a Latent Variable for Reasoning Optimization",
        "Seeing Fast and Slow: Learning the Flow of Time in Videos",
        "S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images"
      ],
      "key_points": [
        "《Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications》〔评测 / 应用 / 方法〕：In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our f…",
        "《Language as a Latent Variable for Reasoning Optimization》〔评测 / 数据 / 方法〕：As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that lan…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LLM",
      "key_points": [
        "《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateles…",
        "《Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems》〔评测 / 方法〕：Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestra…",
        "《AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use》〔评测 / 数据 / 应用 / 方法〕：Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. The…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows",
          "summary": "The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the \"Attention Is All You Need\" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention",
          "authors": [
            "Anuj Sadani",
            "Deepak Kumar"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21816v1",
          "abstract_url": "https://arxiv.org/abs/2604.21816v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21816v1",
          "published_at": "2026-04-23T16:10:00+00:00",
          "updated_at": "2026-04-23T16:10:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.21816",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21816v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21816"
        },
        {
          "title": "Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems",
          "summary": "Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.",
          "authors": [
            "Ye Yu",
            "Heming Liu",
            "Haibo Jin",
            "Xiaopeng Yuan",
            "Peng Kuang",
            "Haohan Wang"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21794v1",
          "abstract_url": "https://arxiv.org/abs/2604.21794v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21794v1",
          "published_at": "2026-04-23T15:53:25+00:00",
          "updated_at": "2026-04-23T15:53:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.21794",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21794v1"
          },
          "relevance_score": 106,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21794"
        },
        {
          "title": "AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use",
          "summary": "Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba-pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi-sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.",
          "authors": [
            "Yuanjie Lyu",
            "Chengyu Wang",
            "Haonan Zheng",
            "Yuanhao Yue",
            "Junbing Yan",
            "Ming Wang",
            "Jun Huang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21590v1",
          "abstract_url": "https://arxiv.org/abs/2604.21590v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21590v1",
          "published_at": "2026-04-23T12:14:52+00:00",
          "updated_at": "2026-04-23T12:14:52+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.21590",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21590v1"
          },
          "relevance_score": 102,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21590"
        },
        {
          "title": "Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability",
          "summary": "Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.",
          "authors": [
            "Nicolae Filat",
            "Ahmed Hussain",
            "Konstantinos Kalogiannis",
            "Elena Burceanu"
          ],
          "categories": [
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21930v1",
          "abstract_url": "https://arxiv.org/abs/2604.21930v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21930v1",
          "published_at": "2026-04-23T17:59:54+00:00",
          "updated_at": "2026-04-23T17:59:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21930",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21930v1"
          },
          "relevance_score": 90,
          "match_reasons": [
            "title matched \"evaluation\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21930"
        },
        {
          "title": "Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models",
          "summary": "This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.",
          "authors": [
            "Chee Wei Tan",
            "Yuchen Wang",
            "Shangxin Guo"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21896v1",
          "abstract_url": "https://arxiv.org/abs/2604.21896v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21896v1",
          "published_at": "2026-04-23T17:46:29+00:00",
          "updated_at": "2026-04-23T17:46:29+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Language Model",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.21896",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21896v1"
          },
          "relevance_score": 90,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21896"
        },
        {
          "title": "Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards",
          "summary": "LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.",
          "authors": [
            "Minji Jung",
            "Minjae Lee",
            "Yejin Kim",
            "Sarang Choi",
            "Minsuk Kahng"
          ],
          "categories": [
            "cs.AI",
            "cs.CY",
            "cs.HC"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21769v1",
          "abstract_url": "https://arxiv.org/abs/2604.21769v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21769v1",
          "published_at": "2026-04-23T15:28:32+00:00",
          "updated_at": "2026-04-23T15:28:32+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21769",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21769v1"
          },
          "relevance_score": 87,
          "match_reasons": [
            "title matched \"evaluation\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21769"
        },
        {
          "title": "AEL: Agent Evolving Learning for Open-Ended Environments",
          "summary": "LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \\emph{what} to remember but \\emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \\emph{Agent Evolving Learning} (\\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \\ael{} achieves a Sharpe ratio of 2.13$\\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \\emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \\emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: https://github.com/WujiangXu/AEL.",
          "authors": [
            "Wujiang Xu",
            "Jiaojiao Han",
            "Minghao Guo",
            "Kai Mei",
            "Xi Zhu",
            "Han Zhang",
            "Dimitris N. Metaxas"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.CE"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21725v1",
          "abstract_url": "https://arxiv.org/abs/2604.21725v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21725v1",
          "published_at": "2026-04-23T14:29:25+00:00",
          "updated_at": "2026-04-23T14:29:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.21725",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21725v1"
          },
          "relevance_score": 86,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21725"
        },
        {
          "title": "Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models",
          "summary": "Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.",
          "authors": [
            "Naheed Rayhan",
            "Sohely Jahan"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21860v1",
          "abstract_url": "https://arxiv.org/abs/2604.21860v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21860v1",
          "published_at": "2026-04-23T16:56:14+00:00",
          "updated_at": "2026-04-23T16:56:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21860",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21860v1"
          },
          "relevance_score": 85,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21860"
        },
        {
          "title": "Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications",
          "summary": "In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.",
          "authors": [
            "Yvon K. Awuklu",
            "Meghyn Bienvenu",
            "Katsumi Inoue",
            "Vianney Jouhet",
            "Fleur Mougin"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21793v1",
          "abstract_url": "https://arxiv.org/abs/2604.21793v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21793v1",
          "published_at": "2026-04-23T15:53:13+00:00",
          "updated_at": "2026-04-23T15:53:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.21793",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21793v1"
          },
          "relevance_score": 84,
          "match_reasons": [
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21793"
        },
        {
          "title": "Language as a Latent Variable for Reasoning Optimization",
          "summary": "As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.",
          "authors": [
            "Linjuan Wu",
            "Haoran Wei",
            "Jialong Tang",
            "Shuang Luo",
            "Baosong Yang",
            "Yongliang Shen",
            "Weiming Lu"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21593v1",
          "abstract_url": "https://arxiv.org/abs/2604.21593v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21593v1",
          "published_at": "2026-04-23T12:19:14+00:00",
          "updated_at": "2026-04-23T12:19:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.21593",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21593v1"
          },
          "relevance_score": 84,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21593"
        },
        {
          "title": "CoFEE: Reasoning Control for LLM-Based Feature Discovery",
          "summary": "Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.",
          "authors": [
            "Maximilian Westermann",
            "Ben Griffin",
            "Aaron Ontoyin Yin",
            "Zakari Salifu",
            "Yagiz Ihlamur",
            "Kelvin Amoaba",
            "Joseph Ternasky",
            "Fuat Alican",
            "Yigit Ihlamur"
          ],
          "categories": [
            "cs.AI",
            "cs.CE",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21584v1",
          "abstract_url": "https://arxiv.org/abs/2604.21584v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21584v1",
          "published_at": "2026-04-23T12:05:38+00:00",
          "updated_at": "2026-04-23T12:05:38+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21584",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21584v1"
          },
          "relevance_score": 84,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21584"
        },
        {
          "title": "A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking",
          "summary": "The IC3 algorithm represents the state-of-the-art (SOTA) hardware model checking technique, owing to its robust performance and scalability. A significant body of research has focused on enhancing the solving efficiency of the IC3 algorithm, with particular attention to the inductive generalization process: a critical phase wherein the algorithm seeks to generalize a counterexample to inductiveness (CTI), which typically is a state leading to a bad state, into a broader set of states. This inductive generalization is a primary source of clauses in IC3 and thus plays a pivotal role in determining the overall effectiveness of the algorithm. Despite its importance, existing approaches often rely on fixed inductive generalization strategies, overlooking the dynamic and context-sensitive nature of the verification environment in which spurious counterexamples arise. This rigidity can limit the quality of generated clauses and, consequently, the performance of IC3. To address this limitation, we propose a lightweight machine-learning-based framework that dynamically selects appropriate inductive generalization strategies in response to the evolving verification context. Specifically, we employ a multi-armed bandit (MAB) algorithm to adaptively choose inductive generalization strategies based on real-time feedback from the verification process. The agent is updated by evaluating the quality of generalization outcomes, thereby refining its strategy selection over time. Empirical evaluation on a benchmark suite comprising 914 instances, primarily drawn from the latest HWMCC collection, demonstrates the efficacy of our approach. When implemented on the state-of-the-art model checker rIC3, our method solves 26 to 50 more cases than the baselines and improves the PAR-2 score by 194.72 to 389.29.",
          "authors": [
            "Xiaofeng Zhou",
            "Guangyu Hu",
            "Hongce Zhang",
            "Wei Zhang"
          ],
          "categories": [
            "cs.LO",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21688v1",
          "abstract_url": "https://arxiv.org/abs/2604.21688v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21688v1",
          "published_at": "2026-04-23T13:53:06+00:00",
          "updated_at": "2026-04-23T13:53:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21688",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21688v1"
          },
          "relevance_score": 82,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21688"
        },
        {
          "title": "DryRUN: On the Role of Public Tests in LLM-Driven Code Generation",
          "summary": "Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle. Because ground-truth input-output examples are rarely available prior to implementation in real-world software engineering, this dependency restricts methods to curated competitive programming benchmarks. Furthermore, we identify that reliance on these public tests induces an ``overconfidence gap,'' causing frameworks to overfit to simplistic examples and fail on hidden evaluations. In contrast, we observe that external sample inputs are not strictly necessary for code generation. We demonstrate that large language models can autonomously generate valid inputs and simulate execution traces to self-correct. Consequently, we develop DryRUN, a framework that eliminates the need for ground-truth samples by allowing the LLM to iteratively plan, autonomously generate its own inputs and simulate execution, mitigating algorithmic overconfidence. Evaluations on the LiveCodeBench v6 dataset (post-March 2025) demonstrate that DryRUN matches performance against CodeSIM, a state-of-the-art and public-test-dependent framework, while operating entirely without public test cases or external execution feedback while reducing output token consumption.",
          "authors": [
            "Kaushitha Silva",
            "Srinath Perera"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21598v1",
          "abstract_url": "https://arxiv.org/abs/2604.21598v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21598v1",
          "published_at": "2026-04-23T12:21:03+00:00",
          "updated_at": "2026-04-23T12:21:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.21598",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21598v1"
          },
          "relevance_score": 80,
          "match_reasons": [
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21598"
        },
        {
          "title": "Evaluation of Automatic Speech Recognition Using Generative Large Language Models",
          "summary": "Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\\% agreement with human annotators for hypothesis selection, compared to 63\\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.",
          "authors": [
            "Thibault Bañeras-Roux",
            "Shashi Kumar",
            "Driss Khalil",
            "Sergio Burdisso",
            "Petr Motlicek",
            "Shiran Liu",
            "Mickael Rouvier",
            "Jane Wottawa",
            "Richard Dufour"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21928v1",
          "abstract_url": "https://arxiv.org/abs/2604.21928v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21928v1",
          "published_at": "2026-04-23T17:59:47+00:00",
          "updated_at": "2026-04-23T17:59:47+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21928",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21928v1"
          },
          "relevance_score": 72,
          "match_reasons": [
            "title matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21928"
        },
        {
          "title": "From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation",
          "summary": "Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.",
          "authors": [
            "Bartosz Balis",
            "Michal Orzechowski",
            "Piotr Kica",
            "Michal Dygas",
            "Michal Kuszewski"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21910v1",
          "abstract_url": "https://arxiv.org/abs/2604.21910v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21910v1",
          "published_at": "2026-04-23T17:52:52+00:00",
          "updated_at": "2026-04-23T17:52:52+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.21910",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21910v1"
          },
          "relevance_score": 72,
          "match_reasons": [
            "title matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21910"
        }
      ]
    },
    {
      "name": "Vision",
      "key_points": [
        "《Pre-process for segmentation task with nonlinear diffusion filters》〔方法〕：This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We f…",
        "《KD-CVG: A Knowledge-Driven Approach for Creative Video Generation》〔数据 / 应用 / 方法〕：Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significa…",
        "《Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation》〔数据 / 应用 / 方法〕：Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digit…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Pre-process for segmentation task with nonlinear diffusion filters",
          "summary": "This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at https://github.com/cplatero/NonlinearDiffusion.",
          "authors": [
            "Javier Sanguino",
            "Carlos Platero",
            "Olga Velasco"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21422v1",
          "abstract_url": "https://arxiv.org/abs/2604.21422v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21422v1",
          "published_at": "2026-04-23T08:38:45+00:00",
          "updated_at": "2026-04-23T08:38:45+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Diffusion",
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.21422",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21422v1"
          },
          "relevance_score": 102,
          "match_reasons": [
            "title matched \"diffusion\"",
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21422"
        },
        {
          "title": "KD-CVG: A Knowledge-Driven Approach for Creative Video Generation",
          "summary": "Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \\textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \\textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model's comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG's superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at https://kdcvg.github.io/KDCVG/.",
          "authors": [
            "Linkai Liu",
            "Wei Feng",
            "Xi Zhao",
            "Shen Zhang",
            "Xingye Chen",
            "Zheng Zhang",
            "Jingjing Lv",
            "Junjie Shen",
            "Ching Law",
            "Yuchen Zhou",
            "Zipeng Guo",
            "Chao Gou"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21362v1",
          "abstract_url": "https://arxiv.org/abs/2604.21362v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21362v1",
          "published_at": "2026-04-23T07:24:15+00:00",
          "updated_at": "2026-04-23T07:24:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Alignment",
            "Multimodal"
          ],
          "doi": null,
          "arxiv_id": "2604.21362",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21362v1"
          },
          "relevance_score": 79,
          "match_reasons": [
            "title matched \"video generation\"",
            "summary matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21362"
        },
        {
          "title": "Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation",
          "summary": "Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied AI.However, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex actions.Synthetic data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real gap.In this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity preservation.Our study offers the first comprehensive exploration of synthetic data's role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.",
          "authors": [
            "Yuanchen Fei",
            "Yude Zou",
            "Zejian Kang",
            "Ming Li",
            "Jiaying Zhou",
            "Xiangru Huang"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21291v1",
          "abstract_url": "https://arxiv.org/abs/2604.21291v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21291v1",
          "published_at": "2026-04-23T05:10:15+00:00",
          "updated_at": "2026-04-23T05:10:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Diffusion",
            "Video Generation"
          ],
          "doi": null,
          "arxiv_id": "2604.21291",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21291v1"
          },
          "relevance_score": 77,
          "match_reasons": [
            "title matched \"video generation\"",
            "summary matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21291"
        },
        {
          "title": "Seeing Fast and Slow: Learning the Flow of Time in Videos",
          "summary": "How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.",
          "authors": [
            "Yen-Siang Wu",
            "Rundong Luo",
            "Jingsen Zhu",
            "Tao Tu",
            "Ali Farhadi",
            "Matthew Wallingford",
            "Yu-Chiang Frank Wang",
            "Steve Marschner",
            "Wei-Chiu Ma"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.GR"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21931v1",
          "abstract_url": "https://arxiv.org/abs/2604.21931v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21931v1",
          "published_at": "2026-04-23T17:59:57+00:00",
          "updated_at": "2026-04-23T17:59:57+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Multimodal"
          ],
          "doi": null,
          "arxiv_id": "2604.21931",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21931v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "summary matched \"video generation\"",
            "summary matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21931"
        },
        {
          "title": "DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion",
          "summary": "Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.",
          "authors": [
            "Tahar Chettaoui",
            "Eduarda Caldeira",
            "Guray Ozgur",
            "Raghavendra Ramachandra",
            "Fadi Boutros",
            "Naser Damer"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21627v1",
          "abstract_url": "https://arxiv.org/abs/2604.21627v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21627v1",
          "published_at": "2026-04-23T12:46:07+00:00",
          "updated_at": "2026-04-23T12:46:07+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.21627",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21627v1"
          },
          "relevance_score": 66,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21627"
        },
        {
          "title": "Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers",
          "summary": "Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.",
          "authors": [
            "Minghao Yin",
            "Wenbo Hu",
            "Jiale Xu",
            "Ying Shan",
            "Kai Han"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21592v1",
          "abstract_url": "https://arxiv.org/abs/2604.21592v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21592v1",
          "published_at": "2026-04-23T12:18:55+00:00",
          "updated_at": "2026-04-23T12:18:55+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.21592",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21592v1"
          },
          "relevance_score": 66,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21592"
        },
        {
          "title": "Deep kernel video approximation for unsupervised action segmentation",
          "summary": "This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.",
          "authors": [
            "Silvia L. Pintea",
            "Jouke Dijkstra"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21572v1",
          "abstract_url": "https://arxiv.org/abs/2604.21572v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21572v1",
          "published_at": "2026-04-23T11:52:56+00:00",
          "updated_at": "2026-04-23T11:52:56+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Segmentation"
          ],
          "doi": null,
          "arxiv_id": "2604.21572",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21572v1"
          },
          "relevance_score": 66,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21572"
        },
        {
          "title": "DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction",
          "summary": "Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.",
          "authors": [
            "Shiyan Su",
            "Ruyi Zha",
            "Danli Shi",
            "Hongdong Li",
            "Xuelian Cheng"
          ],
          "categories": [
            "eess.IV",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21518v1",
          "abstract_url": "https://arxiv.org/abs/2604.21518v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21518v1",
          "published_at": "2026-04-23T10:27:54+00:00",
          "updated_at": "2026-04-23T10:27:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Diffusion"
          ],
          "doi": null,
          "arxiv_id": "2604.21518",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21518v1"
          },
          "relevance_score": 64,
          "match_reasons": [
            "title matched \"diffusion\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21518"
        },
        {
          "title": "S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images",
          "summary": "We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.",
          "authors": [
            "Qingxiao Li",
            "Lifeng Xu",
            "QingLi Wang",
            "Yudong Bai",
            "Mingwei Ou",
            "Shu Hu",
            "Nan Xu"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21409v1",
          "abstract_url": "https://arxiv.org/abs/2604.21409v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21409v1",
          "published_at": "2026-04-23T08:23:25+00:00",
          "updated_at": "2026-04-23T08:23:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2604.21409",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21409v1"
          },
          "relevance_score": 62,
          "match_reasons": [
            "title matched \"multimodal\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21409"
        },
        {
          "title": "SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes",
          "summary": "High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.",
          "authors": [
            "Nannan Qin",
            "Pengjie Tao",
            "Haiyan Guan",
            "Zhizhong Kang",
            "Lingfei Ma",
            "Xiangyun Hu",
            "Jonathan Li"
          ],
          "categories": [
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2604.21356v1",
          "abstract_url": "https://arxiv.org/abs/2604.21356v1",
          "pdf_url": "https://arxiv.org/pdf/2604.21356v1",
          "published_at": "2026-04-23T07:15:12+00:00",
          "updated_at": "2026-04-23T07:15:12+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2604.21356",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.21356v1"
          },
          "relevance_score": 61,
          "match_reasons": [
            "title matched \"segmentation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.21356"
        }
      ]
    },
    {
      "name": "PubMed AI",
      "key_points": [
        "《Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models.》〔数据 / 应用 / 方法〕：Prompt learning has emerged as one of the most effective paradigms for adapting pre-trained vision language models (VLMs) to biomedical image classification ta…",
        "《Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues.》〔评测 / 数据 / 应用 / 方法〕：Automated whole-body lesion segmentation in 18 F-FDG PET/CT images marks a pivotal breakthrough in oncological diagnostics, substantially improving the accurac…",
        "《Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example.》〔应用 / 方法〕：BACKGROUND: Real-world data collection in oncology remains a challenge due to the complex and unstructured format of medical notes. Recently, large language mo…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models.",
          "summary": "Prompt learning has emerged as one of the most effective paradigms for adapting pre-trained vision language models (VLMs) to biomedical image classification tasks in few-shot scenarios. However, most existing prompt learning methods rely on a single textual prompt, often ignoring the particular visual structures (e.g., the complex anatomical structures and subtle pathological features) in biomedical images. In this work, we propose Biomed DPT, a knowledge-enhanced dual-modality prompt tuning framework. For text prompts, Biomed-DPT constructs a dual prompt including template-driven ensemble clinical prompts and large language model (LLM)-driven expert domain adapted prompts. These prompts are systematically ranked and their optimal combination is searched for using a neural network. A semantic regularization loss is then applied to extract clinical knowledge while mitigating semantic discrepancies. For visual prompts, Biomed-DPT introduces zero vectors as soft prompts to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed DPT achieves an average classification accuracy of 66.28% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 79.54% in base classes and 76.91% in novel classes. Our code is available at: https://github.com/pengwei222/Biomed-DPT.",
          "authors": [
            "Wei Peng",
            "Jianchen Hu",
            "Kang Liu",
            "Meng Zhang"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42024940",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42024940/",
          "pdf_url": null,
          "published_at": "2026-04-23T17:22:00+00:00",
          "updated_at": "2026-04-23T17:22:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Clinical"
          ],
          "doi": "10.1109/jbhi.2026.3686818",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42024940/",
            "doi": "https://doi.org/10.1109/JBHI.2026.3686818"
          },
          "relevance_score": 93,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/jbhi.2026.3686818"
        },
        {
          "title": "Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues.",
          "summary": "Automated whole-body lesion segmentation in 18 F-FDG PET/CT images marks a pivotal breakthrough in oncological diagnostics, substantially improving the accuracy and efficiency of tumor burden assessment. Manual segmentation is often plagued by significant interobserver variability, underscoring the necessity for automated solutions. The synergistic combination of PET's exceptional sensitivity for detecting metabolic activity with CT's anatomical precision renders accurate segmentation crucial for achieving quantitative and reproducible clinical workflows. However, current methodologies frequently grapple with challenges such as over-segmentation or under-segmentation, inadvertently delineating normal tissues with elevated uptake or neglecting lesions characterized by subtle intensity variations, primarily due to a lack of integrated metabolic and anatomical insights. To address these limitations, we present a novel framework that adeptly integrates clinical expertise regarding anatomical and metabolic cues to refine PET/CT lesion segmentation. Our innovative mixture-of-experts (MoE) based interpretable fusion module skillfully merges complementary modality information while explicitly elucidating the pixel-level contributions of each modality to the final segmentation outcome. Rigorous evaluations across three in-domain benchmarks and two external datasets demonstrate our model's superior segmentation performance and generalizability. Furthermore, our visualizations provide compelling insights into the pivotal role each modality plays in the decision-making process, highlighting our approach's transformative potential in enhancing PET/CT lesion segmentation. Building on this foundation, we further validated the prognostic significance of the features extracted from our proposed framework in the context of PET/CT-based prognosis predictions.",
          "authors": [
            "Song Zhang",
            "Jiajin Zhang",
            "Liheng Qiu",
            "Wei Liu",
            "Dakai Jin",
            "Wenpei Jiao",
            "Le Lu",
            "Tzu-Chen Yen",
            "Shenmiao Yang",
            "Ke Yan"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42024951",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42024951/",
          "pdf_url": null,
          "published_at": "2026-04-23T17:22:00+00:00",
          "updated_at": "2026-04-23T17:22:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": "10.1109/tmi.2026.3686884",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42024951/",
            "doi": "https://doi.org/10.1109/TMI.2026.3686884"
          },
          "relevance_score": 93,
          "match_reasons": [
            "title matched \"clinical\"",
            "summary matched \"benchmark\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/tmi.2026.3686884"
        },
        {
          "title": "Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example.",
          "summary": "BACKGROUND: Real-world data collection in oncology remains a challenge due to the complex and unstructured format of medical notes. Recently, large language models (LLMs) have demonstrated success in extracting information from free-text data across various domains. This study evaluates the performance of multiple small LLMs as information extractors on Polish medical notes. MATERIALS AND METHODS: Electronic health records (EHRs) of 302 bone sarcoma patients treated in a reference center between 2016 and 2022 were selected. Five variables-pathology type, tumor size, localization, grade, and primary resection-were annotated by an experienced oncologist. Multiple prompting techniques and four LLMs were used to query the models with the task of returning the value for each variable using an XML tag. Additionally, among non-concordant values we distinguished valid results, i.e. of expected format and containing a key word/phrase from a per-variable, expert-devised list. An ensemble voting approach was applied, selecting values appearing in the majority of valid outputs. RESULTS: Single-model accuracy was modest (17.5%-30.3%) and highly prompt-dependent. The tumor localization values turned out to be the easiest to assess with an accuracy of up to 36.2%. The majority of non-concordant values were non-valid. The voting strategy improved performance significantly, with 83.6% overall accuracy, peaking at 90.0% for the resection type variable. CONCLUSIONS: Our study highlights the potential of using lightweight LLMs in the automation of data extraction from medical notes, which could significantly accelerate clinical research. A singular small LLM is not yet sufficient for real use cases in non-English settings; however, prompt engineering and ensemble methods can greatly improve performance.",
          "authors": [
            "P Teterycz",
            "S Rynkun",
            "B Szostakowski",
            "M Wągrodzki",
            "P Rutkowski",
            "M Rosińska"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42021926",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42021926/",
          "pdf_url": null,
          "published_at": "2026-04-23T05:06:00+00:00",
          "updated_at": "2026-04-23T05:06:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Clinical"
          ],
          "doi": "10.1016/j.esmorw.2026.100705",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42021926/",
            "doi": "https://doi.org/10.1016/j.esmorw.2026.100705"
          },
          "relevance_score": 81,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1016/j.esmorw.2026.100705"
        },
        {
          "title": "GATE: Graph and Text Exchange for Zero-Shot ECG Classification with LLM Prompts.",
          "summary": "Electrocardiography (ECG) is a fundamental tool for diagnosing cardiovascular diseases, yet the scarcity of large-scale annotated data limits the applicability of supervised learning approaches. While self-supervised learning (SSL) has shown promise for ECG representation learning, existing methods often suffer from semantic distortion, insufficient spatial modeling, and a lack of integration with medical knowledge. To address these challenges, we propose GATE (Graph-And-Text Exchange), a novel multimodal SSL framework that enhances the quality of the representation of ECG through cross-modal exchange between graph-structured data and clinical ECG reports. GATE employs a spatiotemporal graph encoder to capture fine-grained intra- and inter-lead dependencies, and introduces a lexical knowledge-embedded codebook to enhance the semantic representation of clinical reports, facilitating effective graph-text alignment. During inference, GATE integrates a large language model with a domain-specific knowledge base to generate semantically enriched disease descriptions, enabling robust zero-shot classification. Extensive experiments on three real-world ECG datasets demonstrate that GATE outperforms state-of-the-art self-supervised and multimodal baselines under both low-resource and zero-shot settings. Notably, GATE achieves competitive performance even when trained on only 1% of labeled data, highlighting its strong generalization and clinical potential.",
          "authors": [
            "Ying An",
            "Shiyu Tang",
            "Xianlai Chen",
            "Lin Guo"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42024946",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42024946/",
          "pdf_url": null,
          "published_at": "2026-04-23T17:22:00+00:00",
          "updated_at": "2026-04-23T17:22:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Clinical"
          ],
          "doi": "10.1109/jbhi.2026.3686890",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42024946/",
            "doi": "https://doi.org/10.1109/JBHI.2026.3686890"
          },
          "relevance_score": 71,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/jbhi.2026.3686890"
        },
        {
          "title": "Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification.",
          "summary": "Multi-label Chest X-ray (CXR) classification faces significant challenges from the inherently imperfect nature of clinical data, particularly the complex interplay of co-occurring pathologies, training data with a long-tailed distribution, and high visual similarity between distinct diseases. To address these challenges, we propose a novel framework that synergizes medical prior knowledge with prototype-driven contrastive learning, enabling disentangled and discriminative per-pathology representation learning. In particular, our approach integrates a co-occurrence modulated Label Graph Attention (LGA) module, which leverages semantic prior knowledge from a pre-trained large language model (LLM) and statistical co-occurrence patterns from training data to model inter-pathology relationships. Subsequently, a Label-Aware Decoupling (LAD) decoder is proposed to isolate pathology-specific visual features and mitigate feature suppression by dominant classes. Furthermore, we introduce an Adaptive Proto type Contrastive Learning (APCL) mechanism to enhance the discriminability of visually similar pathologies. Extensive experiments on the NIH ChestX-ray14 and CheXpert datasets demonstrate the framework's superiority, achieving state-of-the-art mean AUCs of 0.834 and 0.840, respectively. Furthermore, cross-dataset evaluations on the external MIMIC-CXR dataset validate the framework's exceptional zero-shot and few-shot generalization capabilities, highlighting its strong robustness and potential for real world clinical deployment. The implementation is available at https://github.com/ZengXHYX/Learning-from-Prototypes.",
          "authors": [
            "Xuhao Zeng",
            "Haoming Ye",
            "Nanlan Yu",
            "Feng Ding",
            "Keping Yu",
            "Haijun Li",
            "Zhijiang Wan"
          ],
          "categories": [
            "Journal Article"
          ],
          "paper_id": "pubmed:42024945",
          "abstract_url": "https://pubmed.ncbi.nlm.nih.gov/42024945/",
          "pdf_url": null,
          "published_at": "2026-04-23T17:22:00+00:00",
          "updated_at": "2026-04-23T17:22:00+00:00",
          "source": "pubmed",
          "date_label": "Entered",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Evaluation"
          ],
          "doi": "10.1109/jbhi.2026.3687164",
          "arxiv_id": null,
          "source_variants": [
            "pubmed"
          ],
          "source_urls": {
            "pubmed": "https://pubmed.ncbi.nlm.nih.gov/42024945/",
            "doi": "https://doi.org/10.1109/JBHI.2026.3687164"
          },
          "relevance_score": 71,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"clinical\"",
            "has DOI",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/jbhi.2026.3687164"
        }
      ]
    },
    {
      "name": "OpenAlex AI",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    }
  ]
}