LLM Feed Archive

LLM Feed Archive llm.html LLM 的长期订阅 RSS，汇总最近命中的论文和归档。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents ../papers/arxiv-d363006cb185.html https://arxiv.org/abs/2604.19457v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, e… Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps ../papers/arxiv-66f4fae6bbd8.html https://arxiv.org/abs/2604.19533v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each… Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment ../papers/arxiv-3ca660d54bb4.html https://arxiv.org/abs/2604.19548v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting… Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views ../papers/arxiv-0d90d26515bd.html https://arxiv.org/abs/2604.19716v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that ar… Revac: A Social Deduction Reasoning Agent ../papers/arxiv-49c0fe8adf77.html https://arxiv.org/abs/2604.19523v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent d… Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews ../papers/arxiv-dcac53916c57.html https://arxiv.org/abs/2604.19502v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulnes… A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic ../papers/arxiv-424f40f3b425.html https://arxiv.org/abs/2604.19567v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of… A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression ../papers/arxiv-cedced42e5cf.html https://arxiv.org/abs/2604.19572v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogen… From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning ../papers/arxiv-f8c71869303c.html https://arxiv.org/abs/2604.19516v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated edit… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Time Series Augmented Generation for Financial Applications ../papers/arxiv-a14f6e5fa3da.html https://arxiv.org/abs/2604.19633v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our… From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems ../papers/doi-feed310756b2.html https://arxiv.org/abs/2604.19663v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleve… Lost in Translation: Do LVLM Judges Generalize Across Languages? ../papers/arxiv-542a2e2a02e6.html https://arxiv.org/abs/2604.19405v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K… Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language ../papers/arxiv-db59ef9531cc.html https://arxiv.org/abs/2604.19667v1#2026-04-22#llm Wed, 22 Apr 2026 11:37:03 +0800 At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models ca… MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval ../papers/arxiv-520299161763.html https://arxiv.org/abs/2604.18584v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, an… Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion ../papers/arxiv-df421e6da9eb.html https://arxiv.org/abs/2604.18566v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best… ClawEnvKit: Automatic Environment Generation for Claw-Like Agents ../papers/arxiv-f83cd96fcc3e.html https://arxiv.org/abs/2604.18543v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured genera… MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation ../papers/arxiv-d54216ff47bf.html https://arxiv.org/abs/2604.18509v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for… OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation ../papers/arxiv-4fb01ed67d37.html https://arxiv.org/abs/2604.18486v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world… StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning ../papers/arxiv-4d1ad4b081bb.html https://arxiv.org/abs/2604.18401v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reas… ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship ../papers/arxiv-7ffafd0c2863.html https://arxiv.org/abs/2604.18356v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substa… HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents ../papers/arxiv-5cc1d83ffee2.html https://arxiv.org/abs/2604.18349v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases… Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs ../papers/arxiv-237dd6d25d41.html https://arxiv.org/abs/2604.18576v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to… Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data ../papers/arxiv-b69c81bbad3f.html https://arxiv.org/abs/2604.18493v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving… Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling ../papers/arxiv-4797a7249e58.html https://arxiv.org/abs/2604.18464v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and he… IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters ../papers/arxiv-5f77807f2720.html https://arxiv.org/abs/2604.18375v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds be… Training and Agentic Inference Strategies for LLM-based Manim Animation Generation ../papers/arxiv-993f63372808.html https://arxiv.org/abs/2604.18364v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinf… Multilingual Training and Evaluation Resources for Vision-Language Models ../papers/arxiv-bb0f0a1b4a2e.html https://arxiv.org/abs/2604.18347v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English,… On the Importance and Evaluation of Narrativity in Natural Language AI Explanations ../papers/arxiv-6eff757730ed.html https://arxiv.org/abs/2604.18311v1#2026-04-21#llm Tue, 21 Apr 2026 11:40:46 +0800 Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In th… CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas ../papers/arxiv-5e024cdf605d.html https://arxiv.org/abs/2604.15267v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first co… IE as Cache: Information Extraction Enhanced Agentic Reasoning ../papers/arxiv-b9668967d0c4.html https://arxiv.org/abs/2604.14930v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic… QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies ../papers/arxiv-60286bc4afdd.html https://arxiv.org/abs/2604.15151v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present… From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench ../papers/arxiv-913915b00c96.html https://arxiv.org/abs/2604.15037v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18… An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics ../papers/arxiv-1b80284b2f1e.html https://arxiv.org/abs/2604.15145v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics f… MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation ../papers/arxiv-9f9995d5a903.html https://arxiv.org/abs/2604.15309v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage genera… Context Over Content: Exposing Evaluation Faking in Automated Judges ../papers/arxiv-0bc9230c8b6d.html https://arxiv.org/abs/2604.15224v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its as… AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment ../papers/arxiv-400f736db53f.html https://arxiv.org/abs/2604.15222v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remain heavily dependent on expert judgment. While recent AI tools demonstrate promising capabilities in analyzing and generating requirements, their role within formal systems engineering processes-and their alignment with established INCOSE criteria-remains insufficiently understood. This paper investigates the extent to… MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events ../papers/arxiv-4416c06e91a3.html https://arxiv.org/abs/2604.15203v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine r… Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC ../papers/arxiv-c894f3778ac6.html https://arxiv.org/abs/2604.15082v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis componen… ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints ../papers/arxiv-148ce4c33832.html https://arxiv.org/abs/2604.14902v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may ch… Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation ../papers/doi-fda96f4fa371.html https://arxiv.org/abs/2604.15190v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, wh… From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning ../papers/arxiv-671db056cec2.html https://arxiv.org/abs/2604.15244v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using on… Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications ../papers/arxiv-57fc3ce735ba.html https://arxiv.org/abs/2604.15233v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data.… RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography ../papers/arxiv-d12df90e00da.html https://arxiv.org/abs/2604.15231v1#2026-04-17#llm Fri, 17 Apr 2026 11:39:21 +0800 Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by… GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis ../papers/arxiv-283874153373.html https://arxiv.org/abs/2604.13888v1#2026-04-16#llm Thu, 16 Apr 2026 11:43:00 +0800 The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i… HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark ../papers/arxiv-f6718acdd1da.html https://arxiv.org/abs/2604.13954v1#2026-04-16#llm Thu, 16 Apr 2026 11:43:00 +0800 Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a ben… Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning ../papers/arxiv-82411c54ef00.html https://arxiv.org/abs/2604.13804v1#2026-04-16#llm Thu, 16 Apr 2026 11:43:00 +0800 The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge… LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning ../papers/arxiv-c517a8dff3b8.html https://arxiv.org/abs/2604.14140v1#2026-04-16#llm Thu, 16 Apr 2026 11:43:00 +0800 As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Probl… Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis ../papers/arxiv-da2e5b9f5c3e.html https://arxiv.org/abs/2604.14121v1#2026-04-16#llm Thu, 16 Apr 2026 11:43:00 +0800 LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the…