agent Topic Archive

agent Topic Archive agent.html 关键词 agent 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents ../papers/arxiv-d363006cb185.html https://arxiv.org/abs/2604.19457v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, e… Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps ../papers/arxiv-66f4fae6bbd8.html https://arxiv.org/abs/2604.19533v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each… Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment ../papers/arxiv-3ca660d54bb4.html https://arxiv.org/abs/2604.19548v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting… Revac: A Social Deduction Reasoning Agent ../papers/arxiv-49c0fe8adf77.html https://arxiv.org/abs/2604.19523v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent d… A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic ../papers/arxiv-424f40f3b425.html https://arxiv.org/abs/2604.19567v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of… A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression ../papers/arxiv-cedced42e5cf.html https://arxiv.org/abs/2604.19572v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogen… From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning ../papers/arxiv-f8c71869303c.html https://arxiv.org/abs/2604.19516v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated edit… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Time Series Augmented Generation for Financial Applications ../papers/arxiv-a14f6e5fa3da.html https://arxiv.org/abs/2604.19633v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our… Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language ../papers/arxiv-db59ef9531cc.html https://arxiv.org/abs/2604.19667v1#2026-04-22#agent Wed, 22 Apr 2026 11:37:03 +0800 At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models ca… ClawEnvKit: Automatic Environment Generation for Claw-Like Agents ../papers/arxiv-f83cd96fcc3e.html https://arxiv.org/abs/2604.18543v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured genera… MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation ../papers/arxiv-d54216ff47bf.html https://arxiv.org/abs/2604.18509v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for… OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation ../papers/arxiv-4fb01ed67d37.html https://arxiv.org/abs/2604.18486v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world… StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning ../papers/arxiv-4d1ad4b081bb.html https://arxiv.org/abs/2604.18401v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reas… ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship ../papers/arxiv-7ffafd0c2863.html https://arxiv.org/abs/2604.18356v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substa… HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents ../papers/arxiv-5cc1d83ffee2.html https://arxiv.org/abs/2604.18349v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases… Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs ../papers/arxiv-237dd6d25d41.html https://arxiv.org/abs/2604.18576v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to… IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters ../papers/arxiv-5f77807f2720.html https://arxiv.org/abs/2604.18375v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds be… Training and Agentic Inference Strategies for LLM-based Manim Animation Generation ../papers/arxiv-993f63372808.html https://arxiv.org/abs/2604.18364v1#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinf… Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. ../papers/doi-a39ecce65f3a.html https://pubmed.ncbi.nlm.nih.gov/42004487/#2026-04-21#agent Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated… CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas ../papers/arxiv-5e024cdf605d.html https://arxiv.org/abs/2604.15267v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first co… IE as Cache: Information Extraction Enhanced Agentic Reasoning ../papers/arxiv-b9668967d0c4.html https://arxiv.org/abs/2604.14930v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic… QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies ../papers/arxiv-60286bc4afdd.html https://arxiv.org/abs/2604.15151v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present… From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench ../papers/arxiv-913915b00c96.html https://arxiv.org/abs/2604.15037v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18… MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation ../papers/arxiv-9f9995d5a903.html https://arxiv.org/abs/2604.15309v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage genera… Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC ../papers/arxiv-c894f3778ac6.html https://arxiv.org/abs/2604.15082v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis componen… ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints ../papers/arxiv-148ce4c33832.html https://arxiv.org/abs/2604.14902v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may ch… Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications ../papers/arxiv-57fc3ce735ba.html https://arxiv.org/abs/2604.15233v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 NL2SQL systems aim to address the growing need for natural language interaction with data. However, real-world information rarely maps to a single SQL query because (1) users express queries iteratively (2) questions often span multiple data sources beyond the closed-world assumption of a single database, and (3) queries frequently rely on commonsense or external knowledge. Consequently, satisfying realistic data needs require integrating heterogeneous sources, modalities, and contextual data.… RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography ../papers/arxiv-d12df90e00da.html https://arxiv.org/abs/2604.15231v1#2026-04-17#agent Fri, 17 Apr 2026 11:39:21 +0800 Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by… GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis ../papers/arxiv-283874153373.html https://arxiv.org/abs/2604.13888v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i… HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark ../papers/arxiv-f6718acdd1da.html https://arxiv.org/abs/2604.13954v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a ben… Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning ../papers/arxiv-82411c54ef00.html https://arxiv.org/abs/2604.13804v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge… TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration ../papers/arxiv-7436d1e41f94.html https://arxiv.org/abs/2604.14116v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formul… The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents ../papers/arxiv-7da57578b1cc.html https://arxiv.org/abs/2604.13759v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4… Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA ../papers/arxiv-08a10c45b30f.html https://arxiv.org/abs/2604.13731v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail ov… Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents ../papers/arxiv-d93ad6a1678e.html https://arxiv.org/abs/2604.14004v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We… ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution ../papers/arxiv-228d06064606.html https://arxiv.org/abs/2604.13787v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic frame… POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch ../papers/arxiv-cb64e4ef9ea1.html https://arxiv.org/abs/2604.14029v1#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch.… A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations. ../papers/doi-1c2530337309.html https://pubmed.ncbi.nlm.nih.gov/41982325/#2026-04-16#agent Thu, 16 Apr 2026 11:43:00 +0800 BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how physicians gather and integrate data in practice. Recent advances in large language models (LLMs) offer promising approaches to creating more realistic clinical interactive conversations. However, these approaches are limited in neurosurgery, where patient communication capacity varies significantly and diagnosis heavily… Parallax: Why AI Agents That Think Must Never Act ../papers/arxiv-fe385734239d.html https://arxiv.org/abs/2604.12986v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the… Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents ../papers/arxiv-7666c1ca118e.html https://arxiv.org/abs/2604.12948v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating… Towards Long-horizon Agentic Multimodal Search ../papers/arxiv-c584099374f8.html https://arxiv.org/abs/2604.12890v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on… QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence ../papers/arxiv-b54bdcbc3afb.html https://arxiv.org/abs/2604.12867v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance up… ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search ../papers/arxiv-7083910d7cb6.html https://arxiv.org/abs/2604.12762v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding c… Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning ../papers/arxiv-101a9d38d4bd.html https://arxiv.org/abs/2604.12717v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasiz… Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training ../papers/arxiv-6b1d979a5981.html https://arxiv.org/abs/2604.12967v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our… Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs ../papers/arxiv-d7c2dcff959d.html https://arxiv.org/abs/2604.12896v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that,… LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems ../papers/arxiv-e3e44107b49f.html https://arxiv.org/abs/2604.12874v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning fra… Toward Autonomous Long-Horizon Engineering for ML Research ../papers/arxiv-56f9e6c49a6d.html https://arxiv.org/abs/2604.13018v1#2026-04-15#agent Wed, 15 Apr 2026 11:35:50 +0800 Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScienti…