Keyword Tracking

关键词追踪：benchmark

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LLM：Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2026-04-09

2026-04-10

2026-04-11

2026-04-12

2026-04-13

2026-04-14

2026-04-15

2026-04-16

2026-04-17

2026-04-18

2026-04-19

2026-04-20

2026-04-21

2026-04-22

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-04-22

2026-04-22 11:37:03 (Asia/Shanghai)

LLM

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

查看原始来源

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning,…

LLM

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

查看原始来源

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a databas…

LLM

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

查看原始来源

Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-…

LLM

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

查看原始来源

Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a sy…

LLM

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

查看原始来源

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primari…

LLM

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

查看原始来源

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promis…

LLM

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

查看原始来源

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual…

LLM

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

查看原始来源

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved…

LLM

From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

查看原始来源

Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimi…

LLM

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

查看原始来源

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insuffi…

LLM

Time Series Augmented Generation for Financial Applications

查看原始来源

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fai…

LLM

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

查看原始来源

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendati…

LLM

Lost in Translation: Do LVLM Judges Generalize Across Languages?

查看原始来源

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these eva…

LLM

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

查看原始来源

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in cu…

Vision

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

查看原始来源

This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval metho…

Vision

How Far Are Video Models from True Multimodal Reasoning?

查看原始来源

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existin…

PubMed AI

Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.

查看原始来源

BACKGROUND: The American Society of Anesthesiologists Physical Status (ASA-PS) classification is integral to preoperative risk assessment; yet, assignment remains subjective and l…

PubMed AI

Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial.

查看原始来源

OBJECTIVES: To compare the difference in the number of cardiopulmonary bypass surgical patients who receive sugammadex vs. placebo and who meet the Society of Thoracic Surgery ear…

2026-04-21

2026-04-21 11:40:46 (Asia/Shanghai)

LLM

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

查看原始来源

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and…

LLM

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

查看原始来源

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmark…

LLM

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

查看原始来源

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a da…

LLM

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

查看原始来源

Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are nois…

LLM

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

查看原始来源

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that i…

LLM

ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship

查看原始来源

Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathe…

LLM

HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

查看原始来源

Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer s…

LLM

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

查看原始来源

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is…

LLM

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

查看原始来源

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homog…

LLM

Multilingual Training and Evaluation Resources for Vision-Language Models

查看原始来源

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limi…

LLM

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

查看原始来源

Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural L…

Vision

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

查看原始来源

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains…

Vision

Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models

查看原始来源

Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images.…

Vision

One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection

查看原始来源

Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that o…

Vision

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

查看原始来源

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in…

Vision

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

查看原始来源

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action…

PubMed AI

Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.

查看原始来源

BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intens…

PubMed AI

Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database.

查看原始来源

BACKGROUND: The KEYNOTE-522 trial showed that neoadjuvant chemotherapy (NAC) plus adjuvant pembrolizumab improved overall survival, event-free survival (EFS), and pathological com…

2026-04-20

2026-04-20 11:48:52 (Asia/Shanghai)

PubMed AI

Medic Training at Military-Civilian Partnerships-A Narrative Review.

查看原始来源

INTRODUCTION: Military-Civilian Partnerships (MCP) were developed to mitigate degradation of combat medical readiness during peacetime. Although these programs have historically f…

2026-04-18

2026-04-18 11:26:55 (Asia/Shanghai)

PubMed AI

Pretraining effective T5 generative models for clinical and biomedical applications.

查看原始来源

This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-based language models in clinical and biomedical domains. We introduce…

PubMed AI

MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding.

查看原始来源

PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific…

PubMed AI

Weakly Supervised Composed Object Re-Identification With Large Models.

查看原始来源

Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but c…

PubMed AI

An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework.

查看原始来源

The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare environments has created critical cybersecurity vulnerabilities that demand both accurate and in…

2026-04-17

2026-04-17 11:39:21 (Asia/Shanghai)

LLM

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

查看原始来源

It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reaso…

LLM

IE as Cache: Information Extraction Enhanced Agentic Reasoning

查看原始来源

Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. Howeve…

LLM

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

查看原始来源

Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains un…

LLM

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

查看原始来源

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus…

LLM

An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics

查看原始来源

The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in…

LLM

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

查看原始来源

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flex…

LLM

Context Over Content: Exposing Evaluation Faking in Automated Judges

查看原始来源

The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text s…

LLM

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

查看原始来源

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human overs…

LLM

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

查看原始来源

This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \texts…

LLM

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

查看原始来源

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually…

LLM

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

查看原始来源

Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its t…

Vision

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

查看原始来源

We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approa…

Vision

Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation

查看原始来源

Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules…

PubMed AI

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.

查看原始来源

Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual…

2026-04-16

2026-04-16 11:43:00 (Asia/Shanghai)

LLM

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

查看原始来源

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-…

LLM

HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

查看原始来源

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementar…

LLM

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

查看原始来源

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this…

LLM

Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

查看原始来源

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by s…

LLM

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

查看原始来源

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains…

LLM

MAny: Merge Anything for Multimodal Continual Instruction Tuning

查看原始来源

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic f…

LLM

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

查看原始来源

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the…

LLM

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

查看原始来源

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a t…

LLM

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

查看原始来源

Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains,…

LLM

Reward Design for Physical Reasoning in Vision-Language Models

查看原始来源

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Languag…

LLM

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

查看原始来源

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet…

Vision

ROSE: Retrieval-Oriented Segmentation Enhancement

查看原始来源

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate…

Vision

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

查看原始来源

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "tempo…

Vision

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

查看原始来源

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these bound…

Vision

PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation

查看原始来源

Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and…

PubMed AI

Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.

查看原始来源

BACKGROUND: Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses s…

PubMed AI

PKFAR: psychiatry knowledge-fused augmented reasoning with large language models.

查看原始来源

PURPOSE: Psychiatric diagnosis faces significant challenges due to subjective symptom reporting and complex diagnostic criteria. While Large Language Models (LLMs) offer potential…

PubMed AI

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations.

查看原始来源

BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how phy…

2026-04-15

2026-04-15 11:35:50 (Asia/Shanghai)

LLM

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

查看原始来源

LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspir…

LLM

Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

查看原始来源

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledg…

LLM

Towards Long-horizon Agentic Multimodal Search

查看原始来源

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous inform…

LLM

QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

查看原始来源

As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi…

LLM

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

查看原始来源

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and el…

LLM

Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

查看原始来源

LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world…

LLM

Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

查看原始来源

Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candle…

LLM

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

查看原始来源

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold…

LLM

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

查看原始来源

The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We pr…

LLM

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

查看原始来源

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on differen…

LLM

Toward Autonomous Long-Horizon Engineering for ML Research

查看原始来源

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environme…

Vision

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

查看原始来源

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g…

Vision

Generative Refinement Networks for Visual Synthesis

查看原始来源

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. I…

PubMed AI

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.

查看原始来源

The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by c…

PubMed AI

Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.

查看原始来源

Vision-language models in healthcare face a critical limitation, i.e., the modality gap, where image and text embeddings occupy distantly separated regions in shared representatio…

2026-04-14

2026-04-14 11:37:06 (Asia/Shanghai)

LLM

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

查看原始来源

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibit…

LLM

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

查看原始来源

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their abi…

LLM

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

查看原始来源

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfec…

LLM

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

查看原始来源

Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-a…

LLM

Detecting Safety Violations Across Many Agent Traces

查看原始来源

To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversa…

LLM

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

查看原始来源

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long t…

LLM

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

查看原始来源

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into…

LLM

Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory

查看原始来源

Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete…

LLM

PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

查看原始来源

We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of mul…

LLM

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

查看原始来源

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settin…

LLM

Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

查看原始来源

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion pla…

Vision

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

查看原始来源

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference imag…

Vision

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

查看原始来源

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level…

Vision

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

查看原始来源

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings fr…

Vision

Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models

查看原始来源

Occlusion, where target structures are partially hidden by surgical instruments or overlapping tissues, remains a critical yet underexplored challenge for foundation segmentation…

PubMed AI

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.

查看原始来源

BACKGROUND: Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine…

PubMed AI

Toward Sustainable Clinical Analysis: Benchmarking Plastic Use in LC-MS Sample Preparation - Exemplified by Ketamine Analogues in Whole Blood.

查看原始来源

The aim of this study was to assess and benchmark plastic consumption in sample preparation for forensic analysis, alongside the development of an LC-MS method for ketamine analog…

PubMed AI

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling.

查看原始来源

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remain…

PubMed AI

Diversity in clinical Trials: The example of systemic lupus erythematosus.

查看原始来源

OBJECTIVE: The FDA requires clinical trials to reflect real-world diversity. Systemic lupus erythematosus (SLE) is a disease that disproportionately affects individuals of Black A…

PubMed AI

Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions.

查看原始来源

OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity,…

2026-04-09

2026-04-09 14:51:56 (Asia/Shanghai)

PubMed AI

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks.

查看原始来源

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to…

PubMed AI

A guide to using embedded ethics in human stem-cell-based embryo model research.

查看原始来源

Human stem-cell-based embryo models (hSCBEMs) offer unprecedented opportunities for basic and translational research. However, the rapid pace of scientific developments in the fie…

2026-04-08

2026-04-08 17:10:24 (Asia/Shanghai)

LLM

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

查看原始来源

The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in mu…

LLM

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…

LLM

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

查看原始来源

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant…

LLM

Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

查看原始来源

Efficient data loading remains a bottleneck for near-term quantum machine-learning. Existing schemes (angle, amplitude, and basis encoding) either underuse the exponential Hilbert…

LLM

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

查看原始来源

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer…

LLM

Gym-Anything: Turn any Software into an Agent Environment

查看原始来源

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limit…

LLM

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

查看原始来源

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its p…

LLM

A Large-Scale Empirical Comparison of Meta-Learners and Causal Forests for Heterogeneous Treatment Effect Estimation in Marketing Uplift Modeling

查看原始来源

Estimating Conditional Average Treatment Effects (CATE) at the individual level is central to precision marketing, yet systematic benchmarking of uplift modeling methods at indust…

LLM

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

查看原始来源

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficu…

LLM

JUÁ - A Benchmark for Information Retrieval in Brazilian Legal Text Collections

查看原始来源

Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance defini…

Vision

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…

Vision

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

查看原始来源

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its p…

Vision

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

查看原始来源

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While T…