evaluation Topic Archive

evaluation Topic Archive evaluation.html 关键词 evaluation 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents ../papers/arxiv-d363006cb185.html https://arxiv.org/abs/2604.19457v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, e… Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps ../papers/arxiv-66f4fae6bbd8.html https://arxiv.org/abs/2604.19533v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each… Revac: A Social Deduction Reasoning Agent ../papers/arxiv-49c0fe8adf77.html https://arxiv.org/abs/2604.19523v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent d… Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews ../papers/arxiv-dcac53916c57.html https://arxiv.org/abs/2604.19502v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulnes… A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic ../papers/arxiv-424f40f3b425.html https://arxiv.org/abs/2604.19567v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of… From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning ../papers/arxiv-f8c71869303c.html https://arxiv.org/abs/2604.19516v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated edit… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Time Series Augmented Generation for Financial Applications ../papers/arxiv-a14f6e5fa3da.html https://arxiv.org/abs/2604.19633v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our… From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems ../papers/doi-feed310756b2.html https://arxiv.org/abs/2604.19663v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleve… Lost in Translation: Do LVLM Judges Generalize Across Languages? ../papers/arxiv-542a2e2a02e6.html https://arxiv.org/abs/2604.19405v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K… How Far Are Video Models from True Multimodal Reasoning? ../papers/arxiv-f1cd701c6156.html https://arxiv.org/abs/2604.19193v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabili… EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation ../papers/arxiv-3a3bbc2e6e3a.html https://arxiv.org/abs/2604.19105v1#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur… Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study. ../papers/doi-eefd4e77621d.html https://pubmed.ncbi.nlm.nih.gov/42012584/#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 BACKGROUND: Current machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can transform ML model-predicted risk estimates with Shapley Additive Explanations (SHAP) into clinically meaningful support information, yet the added value of incorporating ML-derived data and the relative performance of different LLMs remain uncertain. To address these gaps, we used our previously developed IMPACT framework to evaluate the… APSevLM: Acute Pancreatitis Severity Language Model. ../papers/doi-e00fc28ccec0.html https://pubmed.ncbi.nlm.nih.gov/42013267/#2026-04-22#evaluation Wed, 22 Apr 2026 11:37:03 +0800 Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial for effective patient management. In this study, we present APSevLM (Acute Pancreatitis Severity Language Model), a large language model (LLM)-based approach that integrates admission-time clinical data, imaging reports, and expert knowledge to predict AP severity at an early stage. Through a comprehensive evaluation u… Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion ../papers/arxiv-df421e6da9eb.html https://arxiv.org/abs/2604.18566v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best… ClawEnvKit: Automatic Environment Generation for Claw-Like Agents ../papers/arxiv-f83cd96fcc3e.html https://arxiv.org/abs/2604.18543v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured genera… ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship ../papers/arxiv-7ffafd0c2863.html https://arxiv.org/abs/2604.18356v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substa… Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling ../papers/arxiv-4797a7249e58.html https://arxiv.org/abs/2604.18464v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and he… Multilingual Training and Evaluation Resources for Vision-Language Models ../papers/arxiv-bb0f0a1b4a2e.html https://arxiv.org/abs/2604.18347v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English,… On the Importance and Evaluation of Narrativity in Natural Language AI Explanations ../papers/arxiv-6eff757730ed.html https://arxiv.org/abs/2604.18311v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In th… OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation ../papers/arxiv-1f905a979d75.html https://arxiv.org/abs/2604.18326v1#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individu… Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. ../papers/doi-a39ecce65f3a.html https://pubmed.ncbi.nlm.nih.gov/42004487/#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated… A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa. ../papers/doi-db4a2a7daf35.html https://pubmed.ncbi.nlm.nih.gov/42003757/#2026-04-21#evaluation Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), where early recognition and timely intervention are critical, the accuracy of AI-generated information available to parents may have important clinical implications. This study evaluated the performance of LLMs in responding to parent-oriented questions about AN. METHODS: A comparative model evaluation was conducted usin… An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework. ../papers/doi-ff821e86a727.html https://pubmed.ncbi.nlm.nih.gov/41996403/#2026-04-18#evaluation Sat, 18 Apr 2026 11:26:55 +0800 The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare environments has created critical cybersecurity vulnerabilities that demand both accurate and interpretable intrusion detection solutions. Existing deep learning-based intrusion detection systems (IDS) achieve high detection accuracy but lack inherent explainability, limiting their clinical adoption under regulatory frameworks such as GDPR and FDA guidelines. This paper presents MedDefender-MHAN, an explainable m… QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies ../papers/arxiv-60286bc4afdd.html https://arxiv.org/abs/2604.15151v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present… From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench ../papers/arxiv-913915b00c96.html https://arxiv.org/abs/2604.15037v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18… An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics ../papers/arxiv-1b80284b2f1e.html https://arxiv.org/abs/2604.15145v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics f… MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation ../papers/arxiv-9f9995d5a903.html https://arxiv.org/abs/2604.15309v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage genera… Context Over Content: Exposing Evaluation Faking in Automated Judges ../papers/arxiv-0bc9230c8b6d.html https://arxiv.org/abs/2604.15224v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its as… AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment ../papers/arxiv-400f736db53f.html https://arxiv.org/abs/2604.15222v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 Artificial Intelligence is increasingly introduced into systems engineering activities, particularly within requirements engineering, where quality assessment and validation remain heavily dependent on expert judgment. While recent AI tools demonstrate promising capabilities in analyzing and generating requirements, their role within formal systems engineering processes-and their alignment with established INCOSE criteria-remains insufficiently understood. This paper investigates the extent to… MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events ../papers/arxiv-4416c06e91a3.html https://arxiv.org/abs/2604.15203v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine r… Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC ../papers/arxiv-c894f3778ac6.html https://arxiv.org/abs/2604.15082v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis componen… Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation ../papers/doi-fda96f4fa371.html https://arxiv.org/abs/2604.15190v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, wh… RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework ../papers/arxiv-27b12af34ce1.html https://arxiv.org/abs/2604.15308v1#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop plann… Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review. ../papers/doi-caeec9f876b5.html https://pubmed.ncbi.nlm.nih.gov/41990239/#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 OBJECTIVES: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and dia… Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals. ../papers/doi-2fe134b4d7bc.html https://pubmed.ncbi.nlm.nih.gov/41989203/#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical planning. This study evaluates the accuracy and speed of large language models (LLMs) in automatically vetting lumbar spine MRI referrals from general practice.MethodsThree LLMs (GPT-4, Claude Opus, Gemini) were tasked with assigning an outcome (Accept - Routine, Accept - Urgent, Reject) and flagging MRI contraindications for lumbar spine… Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3. ../papers/doi-fa629176d611.html https://pubmed.ncbi.nlm.nih.gov/41989204/#2026-04-17#evaluation Fri, 17 Apr 2026 11:39:21 +0800 OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-perceived usability across thematically stratified queries. MATERIALS AND METHODS: In a blinded, within-subject design, 16 frequently asked questions about biologic therapies were categorized into three domains: treatment/drug selection, safety/adverse effects, and special conditions/daily life. Responses were standardized,… GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis ../papers/arxiv-283874153373.html https://arxiv.org/abs/2604.13888v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i… HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark ../papers/arxiv-f6718acdd1da.html https://arxiv.org/abs/2604.13954v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a ben… Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning ../papers/arxiv-82411c54ef00.html https://arxiv.org/abs/2604.13804v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge… Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis ../papers/arxiv-da2e5b9f5c3e.html https://arxiv.org/abs/2604.14121v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the… TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration ../papers/arxiv-7436d1e41f94.html https://arxiv.org/abs/2604.14116v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formul… MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment ../papers/arxiv-7021296a66d5.html https://arxiv.org/abs/2604.13828v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPS… MAny: Merge Anything for Multimodal Continual Instruction Tuning ../papers/arxiv-b488936a3be9.html https://arxiv.org/abs/2604.14016v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf… MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging ../papers/arxiv-309351a1c9e5.html https://arxiv.org/abs/2604.13756v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-d… Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking ../papers/arxiv-5a504d65a980.html https://arxiv.org/abs/2604.13776v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-spe… ROSE: Retrieval-Oriented Segmentation Enhancement ../papers/arxiv-e008501b0fb5.html https://arxiv.org/abs/2604.14147v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date externa… Seedance 2.0: Advancing Video Generation for World Complexity ../papers/arxiv-fb1144e05ca6.html https://arxiv.org/abs/2604.14148v1#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities ava… Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study. ../papers/doi-442942d6cd6f.html https://pubmed.ncbi.nlm.nih.gov/41985066/#2026-04-16#evaluation Thu, 16 Apr 2026 11:43:00 +0800 BACKGROUND: Large language models use machine learning to produce natural language. These models have a range of potential applications in health care, such as patient education and diagnosis. However, evaluations of large language models in health care are still scarce. OBJECTIVE: This study aimed to (1) evaluate the accuracy and efficiency of automated fact-checking by 2 large language models and (2) illustrate a process through which a large language model might support a patient in redrafti…