<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>RAG Topic Archive</title>
<link>rag.html</link>
<description>关键词 RAG 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Sun, 28 Jun 2026 05:24:06 +0000</lastBuildDate>
<item>
<title>NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models</title>
<link>../papers/arxiv-91c0ed0f09c2.html</link>
<guid>https://arxiv.org/abs/2606.27047v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark…</description>
</item>
<item>
<title>Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings</title>
<link>../papers/arxiv-663bd6d3e1b5.html</link>
<guid>https://arxiv.org/abs/2606.27287v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few c…</description>
</item>
<item>
<title>TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference</title>
<link>../papers/arxiv-4739852a0036.html</link>
<guid>https://arxiv.org/abs/2606.27161v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principl…</description>
</item>
<item>
<title>When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models</title>
<link>../papers/arxiv-89d85e485ed3.html</link>
<guid>https://arxiv.org/abs/2606.27288v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical margin…</description>
</item>
<item>
<title>Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization</title>
<link>../papers/arxiv-de6c8b129e13.html</link>
<guid>https://arxiv.org/abs/2606.27025v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \emph{Interac…</description>
</item>
<item>
<title>HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models</title>
<link>../papers/arxiv-e4f42bdcbdde.html</link>
<guid>https://arxiv.org/abs/2606.27187v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent.…</description>
</item>
<item>
<title>OpenRCA 2.0: From Outcome Labels to Causal Process Supervision</title>
<link>../papers/arxiv-20b18a2e996b.html</link>
<guid>https://arxiv.org/abs/2606.27154v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injectio…</description>
</item>
<item>
<title>Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA</title>
<link>../papers/arxiv-5c94d679d076.html</link>
<guid>https://arxiv.org/abs/2606.27023v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration te…</description>
</item>
<item>
<title>Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries</title>
<link>../papers/arxiv-eca9abd16e6a.html</link>
<guid>https://arxiv.org/abs/2606.26936v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (&quot;the average Jane&quot;) could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit fr…</description>
</item>
<item>
<title>Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation</title>
<link>../papers/arxiv-da78713ac31a.html</link>
<guid>https://arxiv.org/abs/2606.26686v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied r…</description>
</item>
<item>
<title>MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG</title>
<link>../papers/arxiv-d0d97e7148fe.html</link>
<guid>https://arxiv.org/abs/2606.26793v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search w…</description>
</item>
<item>
<title>To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair</title>
<link>../papers/arxiv-f2d059b8979b.html</link>
<guid>https://arxiv.org/abs/2606.26978v1#2026-06-26#rag</guid>
<pubDate>Fri, 26 Jun 2026 13:16:53 +0800</pubDate>
<description>LLM-based agents for program repair are increasingly built on a &quot;generate-run-revise&quot; paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior…</description>
</item>
<item>
<title>How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations</title>
<link>../papers/arxiv-a993d18a7754.html</link>
<guid>https://arxiv.org/abs/2606.26041v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluati…</description>
</item>
<item>
<title>Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz</title>
<link>../papers/doi-a2a209515fcf.html</link>
<guid>https://arxiv.org/abs/2606.25622v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, thi…</description>
</item>
<item>
<title>Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets</title>
<link>../papers/arxiv-4357e9fa7bf7.html</link>
<guid>https://arxiv.org/abs/2606.25760v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchm…</description>
</item>
<item>
<title>MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources</title>
<link>../papers/arxiv-c18831bd2d45.html</link>
<guid>https://arxiv.org/abs/2606.25832v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns t…</description>
</item>
<item>
<title>SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment</title>
<link>../papers/arxiv-26ee80b3fcf6.html</link>
<guid>https://arxiv.org/abs/2606.25821v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which suffer from a scarcity of high-quality training data, often have their tokens routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders…</description>
</item>
<item>
<title>How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring</title>
<link>../papers/arxiv-25dbdb1a09a8.html</link>
<guid>https://arxiv.org/abs/2606.25487v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite…</description>
</item>
<item>
<title>Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution</title>
<link>../papers/arxiv-f7ba1bc50aef.html</link>
<guid>https://arxiv.org/abs/2606.25514v1#2026-06-25#rag</guid>
<pubDate>Thu, 25 Jun 2026 13:11:21 +0800</pubDate>
<description>Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must navigate codebases to locate the root cause, reproduce the failure, implement a fix, and validate the resulting patch. Inefficient context management, thereby, can lead to rapid context degradation and context poisoning, preventing successful resolution. We propose icat-agent, a decentralized, multi-agent scaffolding that replaces shared…</description>
</item>
<item>
<title>AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning</title>
<link>../papers/arxiv-7fb19b10d271.html</link>
<guid>https://arxiv.org/abs/2606.24526v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introdu…</description>
</item>
<item>
<title>AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach</title>
<link>../papers/doi-d4dcf6e219ed.html</link>
<guid>https://arxiv.org/abs/2606.24655v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large L…</description>
</item>
<item>
<title>Are We Ready For An Agent-Native Memory System?</title>
<link>../papers/arxiv-09ad880f1f66.html</link>
<guid>https://arxiv.org/abs/2606.24775v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, crit…</description>
</item>
<item>
<title>CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning</title>
<link>../papers/arxiv-521120a059b4.html</link>
<guid>https://arxiv.org/abs/2606.24636v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified op…</description>
</item>
<item>
<title>AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability</title>
<link>../papers/arxiv-22621133f739.html</link>
<guid>https://arxiv.org/abs/2606.24589v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use.…</description>
</item>
<item>
<title>Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity</title>
<link>../papers/arxiv-80e1786313ab.html</link>
<guid>https://arxiv.org/abs/2606.24623v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We eva…</description>
</item>
<item>
<title>Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation</title>
<link>../papers/arxiv-640fe613ba1c.html</link>
<guid>https://arxiv.org/abs/2606.24515v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalabl…</description>
</item>
<item>
<title>Qwen-AgentWorld: Language World Models for General Agents</title>
<link>../papers/arxiv-dbbe3714f257.html</link>
<guid>https://arxiv.org/abs/2606.24597v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environ…</description>
</item>
<item>
<title>PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models</title>
<link>../papers/arxiv-5e668ce8c325.html</link>
<guid>https://arxiv.org/abs/2606.24388v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adver…</description>
</item>
<item>
<title>SHERLOC: Structured Diagnostic Localization for Code Repair Agents</title>
<link>../papers/arxiv-b868687e026f.html</link>
<guid>https://arxiv.org/abs/2606.24820v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact re…</description>
</item>
<item>
<title>NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?</title>
<link>../papers/arxiv-934bd8b79fae.html</link>
<guid>https://arxiv.org/abs/2606.24530v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research…</description>
</item>
<item>
<title>LemonHarness Technical Report</title>
<link>../papers/arxiv-a79c559da3e4.html</link>
<guid>https://arxiv.org/abs/2606.24311v1#2026-06-24#rag</guid>
<pubDate>Wed, 24 Jun 2026 13:06:49 +0800</pubDate>
<description>As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such…</description>
</item>
<item>
<title>AIR: Adaptive Interleaved Reasoning with Code in MLLMs</title>
<link>../papers/arxiv-3b598225b45f.html</link>
<guid>https://arxiv.org/abs/2606.23678v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers ML…</description>
</item>
<item>
<title>TriggerBench: Investigating Prospective Memory for Large Language Models</title>
<link>../papers/arxiv-6e7f90d682c8.html</link>
<guid>https://arxiv.org/abs/2606.23459v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with…</description>
</item>
<item>
<title>Can LLMs Reliably Self-Report Adversarial Prefills, and How?</title>
<link>../papers/arxiv-1052b107e838.html</link>
<guid>https://arxiv.org/abs/2606.23671v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introsp…</description>
</item>
<item>
<title>POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation</title>
<link>../papers/arxiv-6142e52062f6.html</link>
<guid>https://arxiv.org/abs/2606.23533v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the output must follow strict formatting and structural rules. Unlike open-ended tasks such as question answering or translation, domain-specific generation must be both semantically correct and compliant with existing guidelines and standards. In this work, we study the nationwide interoperability problem of utility power outage reports in the Un…</description>
</item>
<item>
<title>Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles</title>
<link>../papers/arxiv-cfea83337d7a.html</link>
<guid>https://arxiv.org/abs/2606.23672v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>This paper presents our algorithmic innovations for the NVIDIA Nemotron Model Reasoning Challenge, focusing on Bit Manipulation Puzzles. In this task, the objective is to discover a hidden logical rule transforming input binary strings to outputs, then apply it to unseen inputs. Large Language Models (LLMs) notoriously struggle here; traditional methods force them to simulate complex boolean logic and arithmetic, leading to hallucinations. Furthermore, the search space of bitwise operations (co…</description>
</item>
<item>
<title>Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?</title>
<link>../papers/arxiv-9dcfdba2a88b.html</link>
<guid>https://arxiv.org/abs/2606.23189v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>Computer-use agents (CUAs) now act on a user&#x27;s behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three co…</description>
</item>
<item>
<title>TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization</title>
<link>../papers/arxiv-0caba902aa17.html</link>
<guid>https://arxiv.org/abs/2606.23496v1#2026-06-23#rag</guid>
<pubDate>Tue, 23 Jun 2026 13:10:02 +0800</pubDate>
<description>Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants…</description>
</item>
<item>
<title>Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference</title>
<link>../papers/arxiv-1b4902e41aec.html</link>
<guid>https://arxiv.org/abs/2606.20245v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model&#x27;s internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approac…</description>
</item>
<item>
<title>Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users</title>
<link>../papers/arxiv-87d47984a1d0.html</link>
<guid>https://arxiv.org/abs/2606.20482v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To q…</description>
</item>
<item>
<title>Multi-View Decompilation for LLM-Based Malware Classification</title>
<link>../papers/arxiv-3a55d1872ba7.html</link>
<guid>https://arxiv.org/abs/2606.20436v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of beni…</description>
</item>
<item>
<title>PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback</title>
<link>../papers/arxiv-3c36b58c9dfd.html</link>
<guid>https://arxiv.org/abs/2606.20287v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic as…</description>
</item>
<item>
<title>ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval</title>
<link>../papers/arxiv-ff6f7db531ba.html</link>
<guid>https://arxiv.org/abs/2606.20280v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive l…</description>
</item>
<item>
<title>LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents</title>
<link>../papers/arxiv-863f5636db25.html</link>
<guid>https://arxiv.org/abs/2606.20529v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time…</description>
</item>
<item>
<title>RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning</title>
<link>../papers/arxiv-380550205212.html</link>
<guid>https://arxiv.org/abs/2606.20142v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer&#x27;s internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisi…</description>
</item>
<item>
<title>Probe-and-Refine Tuning of Repository Guidance for Coding Agents</title>
<link>../papers/arxiv-b849fd15a901.html</link>
<guid>https://arxiv.org/abs/2606.20512v1#2026-06-19#rag</guid>
<pubDate>Fri, 19 Jun 2026 14:26:15 +0800</pubDate>
<description>LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper w…</description>
</item>
<item>
<title>A Technical Taxonomy of LLM Agent Communication Protocols</title>
<link>../papers/arxiv-a3308e8fb0ba.html</link>
<guid>https://arxiv.org/abs/2606.19135v1#2026-06-18#rag</guid>
<pubDate>Thu, 18 Jun 2026 14:03:08 +0800</pubDate>
<description>As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy&#x27;s purpose, meta-char…</description>
</item>
<item>
<title>X+Slides: Benchmarking Audience-Conditioned Slide Generation</title>
<link>../papers/arxiv-9c43ebfc69d7.html</link>
<guid>https://arxiv.org/abs/2606.19256v1#2026-06-18#rag</guid>
<pubDate>Thu, 18 Jun 2026 14:03:08 +0800</pubDate>
<description>Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Bu…</description>
</item>
<item>
<title>Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering</title>
<link>../papers/arxiv-84bd52c521fa.html</link>
<guid>https://arxiv.org/abs/2606.18986v1#2026-06-18#rag</guid>
<pubDate>Thu, 18 Jun 2026 14:03:08 +0800</pubDate>
<description>Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based enc…</description>
</item>
<item>
<title>Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation</title>
<link>../papers/arxiv-d8657d87faac.html</link>
<guid>https://arxiv.org/abs/2606.19315v1#2026-06-18#rag</guid>
<pubDate>Thu, 18 Jun 2026 14:03:08 +0800</pubDate>
<description>Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors…</description>
</item>
</channel>
</rss>
