Keyword Tracking

关键词追踪：reasoning

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LLM：Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2026-04-09

2026-04-10

2026-04-11

2026-04-12

2026-04-13

2026-04-14

2026-04-15

2026-04-16

2026-04-17

2026-04-18

2026-04-19

2026-04-20

2026-04-21

2026-04-22

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-04-22

2026-04-22 11:37:03 (Asia/Shanghai)

LLM

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

查看原始来源

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning,…

LLM

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

查看原始来源

Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-…

LLM

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

查看原始来源

Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a sy…

LLM

Revac: A Social Deduction Reasoning Agent

查看原始来源

Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate…

LLM

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

查看原始来源

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promis…

LLM

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

查看原始来源

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual…

LLM

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

查看原始来源

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved…

LLM

Time Series Augmented Generation for Financial Applications

查看原始来源

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fai…

LLM

Lost in Translation: Do LVLM Judges Generalize Across Languages?

查看原始来源

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these eva…

Vision

RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

查看原始来源

Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigm…

Vision

How Far Are Video Models from True Multimodal Reasoning?

查看原始来源

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existin…

Vision

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

查看原始来源

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advan…

2026-04-21

2026-04-21 11:40:46 (Asia/Shanghai)

LLM

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

查看原始来源

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and…

LLM

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

查看原始来源

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmark…

LLM

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

查看原始来源

Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are nois…

LLM

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

查看原始来源

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that i…

LLM

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

查看原始来源

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasi…

LLM

HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

查看原始来源

Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer s…

LLM

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

查看原始来源

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homog…

LLM

Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

查看原始来源

Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly i…

LLM

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

查看原始来源

Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and…

Vision

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

查看原始来源

Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt…

Vision

Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models

查看原始来源

Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images.…

PubMed AI

Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.

查看原始来源

BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intens…

2026-04-17

2026-04-17 11:39:21 (Asia/Shanghai)

LLM

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

查看原始来源

It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reaso…

LLM

IE as Cache: Information Extraction Enhanced Agentic Reasoning

查看原始来源

Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. Howeve…

LLM

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

查看原始来源

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus…

LLM

Context Over Content: Exposing Evaluation Faking in Automated Judges

查看原始来源

The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text s…

LLM

AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

查看原始来源

Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remai…

LLM

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

查看原始来源

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human overs…

LLM

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

查看原始来源

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually…

LLM

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

查看原始来源

Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator…

LLM

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

查看原始来源

Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its t…

LLM

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

查看原始来源

NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users expr…

LLM

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

查看原始来源

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods larg…

Vision

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

查看原始来源

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to…

PubMed AI

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.

查看原始来源

Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual…

2026-04-16

2026-04-16 11:43:00 (Asia/Shanghai)

LLM

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

查看原始来源

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-…

LLM

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

查看原始来源

The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character…

LLM

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

查看原始来源

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this…

LLM

Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

查看原始来源

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by s…

LLM

The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

查看原始来源

Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard…

LLM

MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

查看原始来源

User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to ma…

LLM

MAny: Merge Anything for Multimodal Continual Instruction Tuning

查看原始来源

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic f…

LLM

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

查看原始来源

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the…

LLM

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

查看原始来源

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a t…

LLM

Reward Design for Physical Reasoning in Vision-Language Models

查看原始来源

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Languag…

LLM

ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

查看原始来源

Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, ex…

Vision

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

查看原始来源

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "tempo…

Vision

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

查看原始来源

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where…

Vision

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

查看原始来源

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these bound…

PubMed AI

Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.

查看原始来源

BACKGROUND: Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses s…

PubMed AI

PKFAR: psychiatry knowledge-fused augmented reasoning with large language models.

查看原始来源

PURPOSE: Psychiatric diagnosis faces significant challenges due to subjective symptom reporting and complex diagnostic criteria. While Large Language Models (LLMs) offer potential…

PubMed AI

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations.

查看原始来源

BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how phy…

2026-04-15

2026-04-15 11:35:50 (Asia/Shanghai)

LLM

Parallax: Why AI Agents That Think Must Never Act

查看原始来源

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots b…

LLM

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

查看原始来源

LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspir…

LLM

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

查看原始来源

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledg…

LLM

Towards Long-horizon Agentic Multimodal Search

查看原始来源

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous inform…

LLM

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

查看原始来源

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and el…

LLM

Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

查看原始来源

LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world…

LLM

Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

查看原始来源

Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candle…

LLM

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

查看原始来源

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool…

LLM

LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems

查看原始来源

The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudim…

LLM

Toward Autonomous Long-Horizon Engineering for ML Research

查看原始来源

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environme…

Vision

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

查看原始来源

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, a…

Vision

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

查看原始来源

Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the…

PubMed AI

Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education.

查看原始来源

OBJECTIVE: Rapid advancements in artificial intelligence (AI) technologies offer new opportunities in medical education. The aim of this study is to compare the performance of lar…

2026-04-14

2026-04-14 11:37:06 (Asia/Shanghai)

LLM

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

查看原始来源

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibit…

LLM

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

查看原始来源

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their abi…

LLM

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

查看原始来源

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfec…

LLM

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

查看原始来源

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of t…

LLM

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

查看原始来源

The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these age…

LLM

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

查看原始来源

Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex…

LLM

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

查看原始来源

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into…

LLM

Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

查看原始来源

Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete…

LLM

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

查看原始来源

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settin…

Vision

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

查看原始来源

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level…

PubMed AI

Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions.

查看原始来源

OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity,…

2026-04-09

2026-04-09 14:51:56 (Asia/Shanghai)

PubMed AI

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks.

查看原始来源

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to…

2026-04-08

2026-04-08 17:10:24 (Asia/Shanghai)

LLM

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…

LLM

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

查看原始来源

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant…

LLM

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

查看原始来源

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficu…

Vision

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…