<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>reasoning Topic Archive</title>
<link>reasoning.html</link>
<description>关键词 reasoning 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Wed, 22 Apr 2026 03:37:20 +0000</lastBuildDate>
<item>
<title>Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents</title>
<link>../papers/arxiv-d363006cb185.html</link>
<guid>https://arxiv.org/abs/2604.19457v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, e…</description>
</item>
<item>
<title>Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment</title>
<link>../papers/arxiv-3ca660d54bb4.html</link>
<guid>https://arxiv.org/abs/2604.19548v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting…</description>
</item>
<item>
<title>Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views</title>
<link>../papers/arxiv-0d90d26515bd.html</link>
<guid>https://arxiv.org/abs/2604.19716v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that ar…</description>
</item>
<item>
<title>Revac: A Social Deduction Reasoning Agent</title>
<link>../papers/arxiv-49c0fe8adf77.html</link>
<guid>https://arxiv.org/abs/2604.19523v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent d…</description>
</item>
<item>
<title>A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding</title>
<link>../papers/arxiv-5fe8f705aa06.html</link>
<guid>https://arxiv.org/abs/2604.19689v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-…</description>
</item>
<item>
<title>Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic</title>
<link>../papers/arxiv-424f40f3b425.html</link>
<guid>https://arxiv.org/abs/2604.19567v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy &quot;king&quot;-&quot;man&quot;+&quot;woman&quot; = &quot;queen&quot; illustrates relational reasoning, yet replacing text with images of &quot;king&quot; and &quot;man&quot; significantly reduces performance because it requires commonsense knowledge and the extraction of…</description>
</item>
<item>
<title>A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression</title>
<link>../papers/arxiv-cedced42e5cf.html</link>
<guid>https://arxiv.org/abs/2604.19572v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogen…</description>
</item>
<item>
<title>Time Series Augmented Generation for Financial Applications</title>
<link>../papers/arxiv-a14f6e5fa3da.html</link>
<guid>https://arxiv.org/abs/2604.19633v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent&#x27;s core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent&#x27;s reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our…</description>
</item>
<item>
<title>Lost in Translation: Do LVLM Judges Generalize Across Languages?</title>
<link>../papers/arxiv-542a2e2a02e6.html</link>
<guid>https://arxiv.org/abs/2604.19405v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K…</description>
</item>
<item>
<title>RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation</title>
<link>../papers/arxiv-ab2855fef1f1.html</link>
<guid>https://arxiv.org/abs/2604.19570v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusi…</description>
</item>
<item>
<title>How Far Are Video Models from True Multimodal Reasoning?</title>
<link>../papers/arxiv-f1cd701c6156.html</link>
<guid>https://arxiv.org/abs/2604.19193v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models&#x27; zero-shot reasoning capabili…</description>
</item>
<item>
<title>EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation</title>
<link>../papers/arxiv-3a3bbc2e6e3a.html</link>
<guid>https://arxiv.org/abs/2604.19105v1#2026-04-22#reasoning</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur…</description>
</item>
<item>
<title>MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval</title>
<link>../papers/arxiv-520299161763.html</link>
<guid>https://arxiv.org/abs/2604.18584v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, an…</description>
</item>
<item>
<title>Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion</title>
<link>../papers/arxiv-df421e6da9eb.html</link>
<guid>https://arxiv.org/abs/2604.18566v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best…</description>
</item>
<item>
<title>MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation</title>
<link>../papers/arxiv-d54216ff47bf.html</link>
<guid>https://arxiv.org/abs/2604.18509v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for…</description>
</item>
<item>
<title>OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</title>
<link>../papers/arxiv-4fb01ed67d37.html</link>
<guid>https://arxiv.org/abs/2604.18486v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world…</description>
</item>
<item>
<title>StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning</title>
<link>../papers/arxiv-4d1ad4b081bb.html</link>
<guid>https://arxiv.org/abs/2604.18401v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reas…</description>
</item>
<item>
<title>HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents</title>
<link>../papers/arxiv-5cc1d83ffee2.html</link>
<guid>https://arxiv.org/abs/2604.18349v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases…</description>
</item>
<item>
<title>Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data</title>
<link>../papers/arxiv-b69c81bbad3f.html</link>
<guid>https://arxiv.org/abs/2604.18493v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving…</description>
</item>
<item>
<title>Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling</title>
<link>../papers/arxiv-4797a7249e58.html</link>
<guid>https://arxiv.org/abs/2604.18464v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and he…</description>
</item>
<item>
<title>Training and Agentic Inference Strategies for LLM-based Manim Animation Generation</title>
<link>../papers/arxiv-993f63372808.html</link>
<guid>https://arxiv.org/abs/2604.18364v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinf…</description>
</item>
<item>
<title>AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation</title>
<link>../papers/arxiv-28c2e2bc1523.html</link>
<guid>https://arxiv.org/abs/2604.18562v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{&lt;SEG&gt;}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model&#x27;s ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image toke…</description>
</item>
<item>
<title>Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models</title>
<link>../papers/arxiv-c7fa0d917c8c.html</link>
<guid>https://arxiv.org/abs/2604.18429v1#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3…</description>
</item>
<item>
<title>Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.</title>
<link>../papers/doi-a39ecce65f3a.html</link>
<guid>https://pubmed.ncbi.nlm.nih.gov/42004487/#2026-04-21#reasoning</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated…</description>
</item>
<item>
<title>CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas</title>
<link>../papers/arxiv-5e024cdf605d.html</link>
<guid>https://arxiv.org/abs/2604.15267v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner&#x27;s dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first co…</description>
</item>
<item>
<title>IE as Cache: Information Extraction Enhanced Agentic Reasoning</title>
<link>../papers/arxiv-b9668967d0c4.html</link>
<guid>https://arxiv.org/abs/2604.14930v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic…</description>
</item>
<item>
<title>From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench</title>
<link>../papers/arxiv-913915b00c96.html</link>
<guid>https://arxiv.org/abs/2604.15037v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18…</description>
</item>
<item>
<title>Context Over Content: Exposing Evaluation Faking in Automated Judges</title>
<link>../papers/arxiv-0bc9230c8b6d.html</link>
<guid>https://arxiv.org/abs/2604.15224v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model&#x27;s continued operation systematically corrupts its as…</description>
</item>
<item>
<title>AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment</title>
<link>../papers/arxiv-400f736db53f.html</link>
<guid>https://arxiv.org/abs/2604.15222v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remain heavily dependent on expert judgment. While recent AI tools demonstrate promising capabilities in analyzing and generating requirements, their role within formal systems engineering processes-and their alignment with established INCOSE criteria-remains insufficiently understood. This paper investigates the extent to…</description>
</item>
<item>
<title>MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events</title>
<link>../papers/arxiv-4416c06e91a3.html</link>
<guid>https://arxiv.org/abs/2604.15203v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine r…</description>
</item>
<item>
<title>ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints</title>
<link>../papers/arxiv-148ce4c33832.html</link>
<guid>https://arxiv.org/abs/2604.14902v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may ch…</description>
</item>
<item>
<title>Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation</title>
<link>../papers/doi-fda96f4fa371.html</link>
<guid>https://arxiv.org/abs/2604.15190v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, wh…</description>
</item>
<item>
<title>From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning</title>
<link>../papers/arxiv-671db056cec2.html</link>
<guid>https://arxiv.org/abs/2604.15244v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using on…</description>
</item>
<item>
<title>Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications</title>
<link>../papers/arxiv-57fc3ce735ba.html</link>
<guid>https://arxiv.org/abs/2604.15233v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data.…</description>
</item>
<item>
<title>RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography</title>
<link>../papers/arxiv-d12df90e00da.html</link>
<guid>https://arxiv.org/abs/2604.15231v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by…</description>
</item>
<item>
<title>RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models</title>
<link>../papers/arxiv-4a4068542625.html</link>
<guid>https://arxiv.org/abs/2604.14951v1#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world s…</description>
</item>
<item>
<title>From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.</title>
<link>../papers/doi-71303bb82f13.html</link>
<guid>https://pubmed.ncbi.nlm.nih.gov/41989909/#2026-04-17#reasoning</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eigh…</description>
</item>
<item>
<title>GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis</title>
<link>../papers/arxiv-283874153373.html</link>
<guid>https://arxiv.org/abs/2604.13888v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i…</description>
</item>
<item>
<title>Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning</title>
<link>../papers/arxiv-82411c54ef00.html</link>
<guid>https://arxiv.org/abs/2604.13804v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge…</description>
</item>
<item>
<title>LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning</title>
<link>../papers/arxiv-c517a8dff3b8.html</link>
<guid>https://arxiv.org/abs/2604.14140v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Probl…</description>
</item>
<item>
<title>Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis</title>
<link>../papers/arxiv-da2e5b9f5c3e.html</link>
<guid>https://arxiv.org/abs/2604.14121v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs&#x27; reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the…</description>
</item>
<item>
<title>The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents</title>
<link>../papers/arxiv-7da57578b1cc.html</link>
<guid>https://arxiv.org/abs/2604.13759v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4…</description>
</item>
<item>
<title>MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment</title>
<link>../papers/arxiv-7021296a66d5.html</link>
<guid>https://arxiv.org/abs/2604.13828v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPS…</description>
</item>
<item>
<title>MAny: Merge Anything for Multimodal Continual Instruction Tuning</title>
<link>../papers/arxiv-b488936a3be9.html</link>
<guid>https://arxiv.org/abs/2604.14016v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf…</description>
</item>
<item>
<title>MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging</title>
<link>../papers/arxiv-309351a1c9e5.html</link>
<guid>https://arxiv.org/abs/2604.13756v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-d…</description>
</item>
<item>
<title>Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA</title>
<link>../papers/arxiv-08a10c45b30f.html</link>
<guid>https://arxiv.org/abs/2604.13731v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail ov…</description>
</item>
<item>
<title>Reward Design for Physical Reasoning in Vision-Language Models</title>
<link>../papers/arxiv-3bc807e502bd.html</link>
<guid>https://arxiv.org/abs/2604.13993v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorl…</description>
</item>
<item>
<title>ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution</title>
<link>../papers/arxiv-228d06064606.html</link>
<guid>https://arxiv.org/abs/2604.13787v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic frame…</description>
</item>
<item>
<title>Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models</title>
<link>../papers/arxiv-edb7485d7898.html</link>
<guid>https://arxiv.org/abs/2604.14044v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental &quot;temporal blindness&quot;. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and…</description>
</item>
<item>
<title>Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding</title>
<link>../papers/arxiv-60b6a5d36d13.html</link>
<guid>https://arxiv.org/abs/2604.13540v1#2026-04-16#reasoning</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model&#x27;s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing&#x27;&#x27; paradigm, where…</description>
</item>
</channel>
</rss>
