Keyword Tracking

关键词追踪：RAG

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LM：NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

查看原始来源

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based ap…

2026-06-25

2026-06-25 13:11:21 (Asia/Shanghai)

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

查看原始来源

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled v…

Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz

查看原始来源

The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the Germa…

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

查看原始来源

Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-sever…

查看原始来源

Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must navigate codebases…

查看原始来源

As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe o…

2026-06-23

2026-06-23 13:10:02 (Asia/Shanghai)

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

查看原始来源

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. Th…

TriggerBench: Investigating Prospective Memory for Large Language Models

查看原始来源

While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Pros…

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

查看原始来源

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can…

查看原始来源

Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM…

查看原始来源

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historicall…

2026-06-18

2026-06-18 14:03:08 (Asia/Shanghai)

A Technical Taxonomy of LLM Agent Communication Protocols

查看原始来源

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructu…

X+Slides: Benchmarking Audience-Conditioned Slide Generation

查看原始来源

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and…

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

查看原始来源

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answer…

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

查看原始来源

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. Whil…

Agent Runtime Security

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

查看原始来源

Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and coding-agent environments, creating an indirect prompt-…

2026-06-17

2026-06-17 14:22:19 (Asia/Shanghai)

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

查看原始来源

Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically require…

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

查看原始来源

Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis…

From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

查看原始来源

Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) int…

Small Initialization Matters for Large Language Models

查看原始来源

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed t…

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

查看原始来源

Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consume…

Terminal and SWE Agents

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

查看原始来源

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregress…

查看原始来源

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, pr…

查看原始来源

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, m…

Agent Runtime Security

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

查看原始来源

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accum…

2026-06-11

2026-06-11 13:59:12 (Asia/Shanghai)

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

查看原始来源

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients incre…

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

查看原始来源

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We…

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

查看原始来源

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior r…

查看原始来源

AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-re…

查看原始来源

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team w…

查看原始来源

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG…

2026-06-04

2026-06-04 14:02:06 (Asia/Shanghai)

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

查看原始来源

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study position…

Self-Evolving Deep Research via Joint Generation and Evaluation

查看原始来源

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional que…

查看原始来源

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether…

查看原始来源

The development of complex software systems, e.g., cyber-physical systems (CPSs), involves continuous evolution of both system implementations and their requirements. These two ar…

2026-06-02

2026-06-02 13:56:35 (Asia/Shanghai)

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

查看原始来源

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist charact…

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

查看原始来源

Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLM…

查看原始来源

The rapid expansion of the Python ecosystem has fueled two distinct but converging threats: adversaries increasingly target the software supply chain via the Python Package Index…

查看原始来源

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and…

查看原始来源

Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a prod…

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

查看原始来源

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-bas…

Agent Runtime Security

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

查看原始来源

Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafe…

查看原始来源

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise,…

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

查看原始来源

This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper exte…

Agent Runtime Security

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

查看原始来源

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-…

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectivene…

查看原始来源

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls.…

2026-05-19

2026-05-19 13:08:04 (Asia/Shanghai)

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

查看原始来源

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion whe…

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

查看原始来源

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reas…

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

查看原始来源

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invoca…

查看原始来源

Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices.…

查看原始来源

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade…

2026-05-15

2026-05-15 14:57:29 (Asia/Shanghai)

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

查看原始来源

We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRID…

Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

查看原始来源

Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We appro…

Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

查看原始来源

Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent…

A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

查看原始来源

Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5--6 bits per weight on ha…

Quantifying and Mitigating Premature Closure in Frontier LLMs

查看原始来源

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large lan…

Terminal and SWE Agents

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

查看原始来源

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have…

2026-05-14

2026-05-14 12:52:54 (Asia/Shanghai)

MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

查看原始来源

Multimodal irregular time series (MITS) consist of asynchronous and irregularly sampled observations from heterogeneous numerical and textual channels. In healthcare, for example,…

OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research

查看原始来源

The Materials Genome Initiative catalyzed the proliferation of centralized platforms--SaaS, PaaS, and IaaS--that aggregate computational and experimental resources for accelerated…

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

查看原始来源

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. How…

GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

查看原始来源

Open datasets and benchmarks for entity-level carbon-emission prediction remain fragmented across access, scale, granularity, and evaluation. We introduce GHGbench, an open datase…

Agent Runtime Security

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

查看原始来源

Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, argu…

查看原始来源

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate o…

Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

查看原始来源

Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize execut…

Agent Runtime Security

A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting

查看原始来源

Organizations increasingly depend on endpoint devices and corporate communication channels, yet they still face critical risks such as sensitive data leakage, suspicious user beha…

查看原始来源

LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to…

Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards requi…

查看原始来源

Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation a…

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

查看原始来源

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL…

Recursive Multi-Agent Systems

查看原始来源

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We exten…

CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

查看原始来源

Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variab…

查看原始来源

Multi-label Chest X-ray (CXR) classification faces significant challenges from the inherently imperfect nature of clinical data, particularly the complex interplay of co-occurring…

查看原始来源

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current metho…

Vision

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

查看原始来源

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). H…

查看原始来源

The rapid integration of large language models into electronic medical record systems introduces a critical theoretical vulnerability. Drawing on foundational computer science pro…

查看原始来源

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action…

PubMed AI

Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents.

查看原始来源

BACKGROUND: Unstructured oncology consultation notes contain rich clinical information that may support survival prediction. Open-weight large language models (LLMs) can utilize t…

Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC

查看原始来源

This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \texts…

Vision

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

查看原始来源

We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approa…

Vision

Boundary-Centric Active Learning for Temporal Action Segmentation

查看原始来源

Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, wh…

PubMed AI

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.

查看原始来源

Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual…

查看原始来源

BACKGROUND: Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses s…

查看原始来源

Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterog…

Vision

Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

查看原始来源

Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, r…

查看原始来源

The aim of this study was to assess and benchmark plastic consumption in sample preparation for forensic analysis, alongside the development of an LC-MS method for ketamine analog…

2026-04-11

2026-04-11 23:09:08 (Asia/Shanghai)

OpenAlex AI

Coalition Formation Events: How Multi-Agent Systems Create Temporary Actors

查看原始来源

Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

查看原始来源

Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to sup…