Keyword Tracking

关键词追踪：evaluation

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LM：NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

查看原始来源

Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, results can often be imp…

查看原始来源

Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks t…

查看原始来源

LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a cr…

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

查看原始来源

Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains diffi…

Agent Runtime Security

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

查看原始来源

We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and…

2026-06-23

2026-06-23 13:10:02 (Asia/Shanghai)

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

查看原始来源

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. Th…

TriggerBench: Investigating Prospective Memory for Large Language Models

查看原始来源

While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Pros…

Evaluation Awareness Is Not One Capability: Evidence from Open Language Models

查看原始来源

Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This opens a gap between b…

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

查看原始来源

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterpr…

Agent Runtime Security

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

查看原始来源

Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also…

查看原始来源

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware t…

查看原始来源

Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing…

查看原始来源

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional co…

查看原始来源

In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do th…

查看原始来源

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, an…

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

查看原始来源

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, c…

Terminal and SWE Agents

Recursive Agent Harnesses

查看原始来源

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code…

查看原始来源

Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, vi…

查看原始来源

Collaborative computation across organizations is often constrained by the need to process sensitive data and proprietary code without exposing them to untrusted infrastructure or…

查看原始来源

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first us…

Agent Runtime Security

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

查看原始来源

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing bench…

查看原始来源

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust…

查看原始来源

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether…

2026-06-03

2026-06-03 14:09:56 (Asia/Shanghai)

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

查看原始来源

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods prima…

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

查看原始来源

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what di…

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

查看原始来源

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-sp…

查看原始来源

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and…

2026-06-02

2026-06-02 13:56:35 (Asia/Shanghai)

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

查看原始来源

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist charact…

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

查看原始来源

Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visu…

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

查看原始来源

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmark…

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

查看原始来源

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning…

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

查看原始来源

Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large Language Model (LLM)-based chat systems.…

Agent Runtime Security

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

查看原始来源

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-worl…

查看原始来源

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential…

查看原始来源

Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless…

查看原始来源

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persi…

查看原始来源

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper exte…

Merge-Bench: Resolve Merge Conflicts with Large Language Models

查看原始来源

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hu…

Agent Runtime Security

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

查看原始来源

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-…

查看原始来源

AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone.…

查看原始来源

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existin…

查看原始来源

Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices.…

查看原始来源

Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expe…

查看原始来源

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a…

查看原始来源

Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for chi…

Neurosymbolic Auditing of Natural-Language Software Requirements

查看原始来源

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify th…

Agent Runtime Security

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

查看原始来源

Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's identity, folding messaging, memory, self-authored skills, scheduling, and shell…

Agent Runtime Security

Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

查看原始来源

Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers, files, scripts, sys…

查看原始来源

We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention…

查看原始来源

Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or c…

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based…

查看原始来源

Automated whole-body lesion segmentation in 18 F-FDG PET/CT images marks a pivotal breakthrough in oncological diagnostics, substantially improving the accuracy and efficiency of…

PubMed AI

Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification.

查看原始来源

Multi-label Chest X-ray (CXR) classification faces significant challenges from the inherently imperfect nature of clinical data, particularly the complex interplay of co-occurring…

查看原始来源

OBJECTIVES: Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve cl…

查看原始来源

Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial…

查看原始来源

BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intens…

PubMed AI

A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa.

查看原始来源

BACKGROUND: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), wh…

2026-04-18

2026-04-18 11:26:55 (Asia/Shanghai)

PubMed AI

An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework.

查看原始来源

The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare environments has created critical cybersecurity vulnerabilities that demand both accurate and in…

查看原始来源

ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical planning. This study…

PubMed AI

Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3.

查看原始来源

OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-percei…

查看原始来源

BACKGROUND AND OBJECTIVES: Traditional medical board examinations present clinical information in static vignettes with multiple-choices (MC), fundamentally different from how phy…

查看原始来源

BACKGROUND: Artificial intelligence-powered conversational agents (ie, chatbots) are increasingly popular outlets for users seeking psychological support, yet little is known abou…

查看原始来源

OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity,…

2026-04-11

2026-04-11 23:09:08 (Asia/Shanghai)

PubMed AI

Evaluating the clinical decision-making performance of large language models in clinically oriented thoracic anatomy scenarios: a comparative evaluation study.

查看原始来源

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

查看原始来源

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While T…