benchmark Topic Archive

benchmark Topic Archive benchmark.html 关键词 benchmark 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents ../papers/arxiv-d363006cb185.html https://arxiv.org/abs/2604.19457v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, e… Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps ../papers/arxiv-66f4fae6bbd8.html https://arxiv.org/abs/2604.19533v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each… Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment ../papers/arxiv-3ca660d54bb4.html https://arxiv.org/abs/2604.19548v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting… Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views ../papers/arxiv-0d90d26515bd.html https://arxiv.org/abs/2604.19716v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that ar… Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews ../papers/arxiv-dcac53916c57.html https://arxiv.org/abs/2604.19502v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulnes… A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic ../papers/arxiv-424f40f3b425.html https://arxiv.org/abs/2604.19567v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of… A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression ../papers/arxiv-cedced42e5cf.html https://arxiv.org/abs/2604.19572v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogen… From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning ../papers/arxiv-f8c71869303c.html https://arxiv.org/abs/2604.19516v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated edit… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Time Series Augmented Generation for Financial Applications ../papers/arxiv-a14f6e5fa3da.html https://arxiv.org/abs/2604.19633v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our… From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems ../papers/doi-feed310756b2.html https://arxiv.org/abs/2604.19663v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleve… Lost in Translation: Do LVLM Judges Generalize Across Languages? ../papers/arxiv-542a2e2a02e6.html https://arxiv.org/abs/2604.19405v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K… Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language ../papers/arxiv-db59ef9531cc.html https://arxiv.org/abs/2604.19667v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models ca… Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval ../papers/arxiv-39272031a7a0.html https://arxiv.org/abs/2604.19135v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage… How Far Are Video Models from True Multimodal Reasoning? ../papers/arxiv-f1cd701c6156.html https://arxiv.org/abs/2604.19193v1#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabili… Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study. ../papers/doi-8b199115e87e.html https://pubmed.ncbi.nlm.nih.gov/42013456/#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 BACKGROUND: The American Society of Anesthesiologists Physical Status (ASA-PS) classification is integral to preoperative risk assessment; yet, assignment remains subjective and labor-intensive. Recent large language models (LLMs) process free-text electronic health records (EHRs), but few studies have evaluated parameter-efficient adaptations that both predict ASA-PS and provide clinician-readable rationales. Low-rank adaptation (LoRA) is a parameter-efficient technique that updates only a sma… Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial. ../papers/doi-ec10f242cbed.html https://pubmed.ncbi.nlm.nih.gov/42012852/#2026-04-22#benchmark Wed, 22 Apr 2026 11:37:03 +0800 OBJECTIVES: To compare the difference in the number of cardiopulmonary bypass surgical patients who receive sugammadex vs. placebo and who meet the Society of Thoracic Surgery early extubation quality benchmark. DESIGN: Single-center, randomized, double-blind, placebo-controlled trial. SETTING: Participants were enrolled at a single U.S. hospital between August 2023 and July 2025. PATIENTS: Seventy-four eligible cardiac surgery patients undergoing cardiopulmonary bypass with anticipated institu… MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval ../papers/arxiv-520299161763.html https://arxiv.org/abs/2604.18584v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, an… Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion ../papers/arxiv-df421e6da9eb.html https://arxiv.org/abs/2604.18566v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best… ClawEnvKit: Automatic Environment Generation for Claw-Like Agents ../papers/arxiv-f83cd96fcc3e.html https://arxiv.org/abs/2604.18543v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured genera… MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation ../papers/arxiv-d54216ff47bf.html https://arxiv.org/abs/2604.18509v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for… OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation ../papers/arxiv-4fb01ed67d37.html https://arxiv.org/abs/2604.18486v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world… ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship ../papers/arxiv-7ffafd0c2863.html https://arxiv.org/abs/2604.18356v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substa… HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents ../papers/arxiv-5cc1d83ffee2.html https://arxiv.org/abs/2604.18349v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases… Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs ../papers/arxiv-237dd6d25d41.html https://arxiv.org/abs/2604.18576v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to… Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data ../papers/arxiv-b69c81bbad3f.html https://arxiv.org/abs/2604.18493v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving… Multilingual Training and Evaluation Resources for Vision-Language Models ../papers/arxiv-bb0f0a1b4a2e.html https://arxiv.org/abs/2604.18347v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English,… On the Importance and Evaluation of Narrativity in Natural Language AI Explanations ../papers/arxiv-6eff757730ed.html https://arxiv.org/abs/2604.18311v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In th… UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models ../papers/arxiv-e5254900d751.html https://arxiv.org/abs/2604.18518v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accura… Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models ../papers/arxiv-c7fa0d917c8c.html https://arxiv.org/abs/2604.18429v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3… One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection ../papers/arxiv-51a89c3cb173.html https://arxiv.org/abs/2604.18393v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse re… OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation ../papers/arxiv-1f905a979d75.html https://arxiv.org/abs/2604.18326v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individu… Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection ../papers/arxiv-d700189c3374.html https://arxiv.org/abs/2604.18313v1#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge,… Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. ../papers/doi-a39ecce65f3a.html https://pubmed.ncbi.nlm.nih.gov/42004487/#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated… Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database. ../papers/doi-481edb543c43.html https://pubmed.ncbi.nlm.nih.gov/42004488/#2026-04-21#benchmark Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: The KEYNOTE-522 trial showed that neoadjuvant chemotherapy (NAC) plus adjuvant pembrolizumab improved overall survival, event-free survival (EFS), and pathological complete response (pCR) in high-risk early-stage triple-negative breast cancer. As treatments evolve, evaluating real-world (RW) effectiveness is key to understanding trial generalizability. This study benchmarked RW efficacy endpoints in early-stage triple-negative breast cancer patients treated with NAC. MATERIALS AND M… Medic Training at Military-Civilian Partnerships-A Narrative Review. ../papers/doi-00657ec6b105.html https://pubmed.ncbi.nlm.nih.gov/42001305/#2026-04-20#benchmark Mon, 20 Apr 2026 11:48:52 +0800 INTRODUCTION: Military-Civilian Partnerships (MCP) were developed to mitigate degradation of combat medical readiness during peacetime. Although these programs have historically focused on sustaining surgical readiness and training military physicians, MCP increasingly augment training for Army Combat Medics, Navy Hospital Corpsmen, Air Force Aerospace Service Specialist, and other non-physician military medical personnel. The effectiveness, scalability, and alignment of MCP along with evolving… Pretraining effective T5 generative models for clinical and biomedical applications. ../papers/doi-d4977a45ef49.html https://pubmed.ncbi.nlm.nih.gov/41996418/#2026-04-18#benchmark Sat, 18 Apr 2026 11:26:55 +0800 This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-based language models in clinical and biomedical domains. We introduce five different T5-EHR models, each pretrained from scratch using different combinations of clinical and biomedical corpora alongside domain-specific vocabularies. We evaluated these models across a variety of clinical and biomedical tasks to quantify the impact of pretraining data and vocabulary tokenization choices o… MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding. ../papers/doi-04f076dfee40.html https://pubmed.ncbi.nlm.nih.gov/41994492/#2026-04-18#benchmark Sat, 18 Apr 2026 11:26:55 +0800 PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific lecture slides remains poorly understood. This work introduces Medical Imaging Lecture Understanding (MILU), a large-scale benchmark designed to characterize cross-model variability in structured understanding of real medical imaging lectures. APPROACH: MILU includes 23 lecture sets with 1117 slides. LLaVA-OneVision,… Weakly Supervised Composed Object Re-Identification With Large Models. ../papers/doi-4950fa4bce35.html https://pubmed.ncbi.nlm.nih.gov/41996440/#2026-04-18#benchmark Sat, 18 Apr 2026 11:26:55 +0800 Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but cannot specify desired appearance changes, whereas CIR supports attribute-guided retrieval but does not enforce identity consistency. To bridge this gap, we introduce composed object re-identification (CORI), a new task that requires the retrieved target to simultaneously satisfy identity preservation and text-guided at… An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework. ../papers/doi-ff821e86a727.html https://pubmed.ncbi.nlm.nih.gov/41996403/#2026-04-18#benchmark Sat, 18 Apr 2026 11:26:55 +0800 The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare environments has created critical cybersecurity vulnerabilities that demand both accurate and interpretable intrusion detection solutions. Existing deep learning-based intrusion detection systems (IDS) achieve high detection accuracy but lack inherent explainability, limiting their clinical adoption under regulatory frameworks such as GDPR and FDA guidelines. This paper presents MedDefender-MHAN, an explainable m… CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas ../papers/arxiv-5e024cdf605d.html https://arxiv.org/abs/2604.15267v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first co… IE as Cache: Information Extraction Enhanced Agentic Reasoning ../papers/arxiv-b9668967d0c4.html https://arxiv.org/abs/2604.14930v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic… QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies ../papers/arxiv-60286bc4afdd.html https://arxiv.org/abs/2604.15151v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present… From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench ../papers/arxiv-913915b00c96.html https://arxiv.org/abs/2604.15037v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18… An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics ../papers/arxiv-1b80284b2f1e.html https://arxiv.org/abs/2604.15145v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics f… MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation ../papers/arxiv-9f9995d5a903.html https://arxiv.org/abs/2604.15309v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage genera… Context Over Content: Exposing Evaluation Faking in Automated Judges ../papers/arxiv-0bc9230c8b6d.html https://arxiv.org/abs/2604.15224v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its as… MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events ../papers/arxiv-4416c06e91a3.html https://arxiv.org/abs/2604.15203v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine r… Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC ../papers/arxiv-c894f3778ac6.html https://arxiv.org/abs/2604.15082v1#2026-04-17#benchmark Fri, 17 Apr 2026 11:39:21 +0800 This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis componen…