# 每日论文简报

- 生成时间：2026-05-08 14:15:32 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=12, Agent Runtime Security=1
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 11 篇，覆盖 LM，代表论文包括 《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》、《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》。
- 主题「Benchmark」：命中 10 篇，覆盖 LM，代表论文包括 《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》、《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》。
- 主题「Language Model」：命中 3 篇，覆盖 LM，代表论文包括 《Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback》、《BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models》。
- 主题「Agent」：命中 1 篇，覆盖 Agent Runtime Security，代表论文包括 《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》。
- 主题「Evaluation」：命中 1 篇，覆盖 Agent Runtime Security，代表论文包括 《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》。

## 主题聚焦

### LLM

- 命中篇数：11
- 覆盖分组：LM
- 代表论文：《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》、《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》、《Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity》
- 主题速读：
  - 《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggle…
  - 《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》〔评测 / 方法〕：Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensu…

### Benchmark

- 命中篇数：10
- 覆盖分组：LM
- 代表论文：《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》、《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》、《Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity》
- 主题速读：
  - 《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggle…
  - 《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》〔评测 / 方法〕：Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensu…

### Language Model

- 命中篇数：3
- 覆盖分组：LM
- 代表论文：《Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback》、《BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models》、《Evaluation Awareness in Language Models Has Limited Effect on Behaviour》
- 主题速读：
  - 《Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback》〔评测 / 应用 / 方法〕：Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individua…
  - 《BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models》〔评测 / 数据 / 应用 / 方法〕：Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsat…

### Agent

- 命中篇数：1
- 覆盖分组：Agent Runtime Security
- 代表论文：《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》
- 主题速读：
  - 《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》〔评测 / 应用 / 方法〕：Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers,…

### Evaluation

- 命中篇数：1
- 覆盖分组：Agent Runtime Security
- 代表论文：《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》
- 主题速读：
  - 《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》〔评测 / 应用 / 方法〕：Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers,…

## LM 观察

### 本组速览

- 《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggle…
- 《MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents》〔评测 / 方法〕：Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensu…
- 《Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity》〔评测 / 应用 / 方法〕：Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on…
- 《Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback》〔评测 / 应用 / 方法〕：Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individua…
- 《BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models》〔评测 / 数据 / 应用 / 方法〕：Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsat…

### 论文速览

1. [LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG](https://arxiv.org/abs/2605.06285)
   - Published：2026-05-08 12:00
   - 作者：Yijia Zheng，Marcel Worring
   - 来源：arxiv
   - 相关性分数：231
   - 命中原因：title matched "reasoning"; title matched "agent"; title matched "RAG"; summary matched "language model"
   - 分类：cs.CL, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.06285
   - 摘要：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.

2. [MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents](https://arxiv.org/abs/2605.06334)
   - Published：2026-05-08 12:00
   - 作者：Ashwani Anand，Ivi Chatzi，Ritam Raha，Anne-Kathrin Schmuck
   - 来源：arxiv
   - 相关性分数：213
   - 命中原因：title matched "LLM"; title matched "agent"; title matched "benchmark"; summary matched "language model"
   - 分类：cs.CL, cs.LG, cs.LO
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.06334
   - 摘要：Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.

3. [Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity](https://arxiv.org/abs/2605.06327)
   - Published：2026-05-08 12:00
   - 作者：Florian A. D. Burnat，Brittany I. Davidson
   - 来源：arxiv
   - 相关性分数：213
   - 命中原因：title matched "LLM"; title matched "alignment"; title matched "evaluation"; summary matched "language model"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.06327
   - 摘要：Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ($20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$pp ($p=0.007$) and reduces harmful compliance vs. deployment by $3.6$pp ($p=0.024$, $0/20$ items inverted) -- while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious}, with marginal eval-vs-deployment refusal effects of $-9$ to $-20$pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the $70$B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.'' One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs.

4. [Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback](https://arxiv.org/abs/2605.05739)
   - Published：2026-05-08 12:00
   - 作者：Mohammad Al Ridhawi，Mahtab Haj Ali，Hussein Al Osman
   - 来源：arxiv
   - 相关性分数：213
   - 命中原因：title matched "LLM"; title matched "agent"; title matched "evaluation"; summary matched "language model"
   - 分类：cs.LG, cs.AI, cs.CL, q-fin.CP
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.05739
   - 摘要：Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff's $\alpha = 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $\rho = 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen's $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.

5. [BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models](https://arxiv.org/abs/2605.05758)
   - Published：2026-05-08 12:00
   - 作者：Xin Gao，Ruiyi Zhang，Meixi Du，Peijia Qin，Pengtao Xie
   - 来源：arxiv
   - 相关性分数：209
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "agent"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.05758
   - 摘要：Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool

6. [BALAR : A Bayesian Agentic Loop for Active Reasoning](https://arxiv.org/abs/2605.05386)
   - Published：2026-05-08 12:00
   - 作者：Aymen Echarghaoui，Dongxia Wu，Emily B. Fox
   - 来源：arxiv
   - 相关性分数：191
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI, cs.CL, cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.05386
   - 摘要：Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm that requires no fine-tuning and enables structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with $14.6\%$ higher accuracy on AR-Bench-DC, $38.5\%$ on AR-Bench-SP, and $30.5\%$ on iCraft-MD.

7. [PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training](https://arxiv.org/abs/2604.03675)
   - Published：2026-05-08 12:00
   - 作者：Erhan Zhang，Yiqun Chen，Zechun Niu，Wei Yang，Xiaochi Wei，Yan Gao 等
   - 来源：arxiv
   - 相关性分数：187
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI, cs.CL, cs.IR
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2604.03675
   - 摘要：In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

8. [Evaluation Awareness in Language Models Has Limited Effect on Behaviour](https://arxiv.org/abs/2605.05835)
   - Published：2026-05-08 12:00
   - 作者：Amelie Knecht，Lucas Florin，Thilo Hagendorff
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "language model"; title matched "evaluation"; summary matched "reasoning"; summary matched "alignment"
   - 分类：cs.CL, cs.CY
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Language Model
   - PDF：https://arxiv.org/pdf/2605.05835
   - 摘要：Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, moral reasoning, and political opinion. We tested this both on-policy, sampling multiple CoTs per item and comparing those that spontaneously contained VEA against those that did not, and off-policy, using model prefilling to inject evaluation-aware sentences where missing and remove them where present, with subsequent resampling. VEA has limited effect on model behaviour: injecting VEA into CoTs produces near-zero effects ($\omega \leq 0.06$), removing it causes small shifts ($\omega \leq 0.12$) and spontaneously occurring VEA shifts answer distributions by at most 3.7 percentage points ($\omega \leq 0.31$). Our findings call for caution when interpreting high VEA rates as evidence of strategic behaviour or alignment tampering. Evaluation awareness may pose a smaller safety risk than the current literature assumes.

9. [MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval](https://arxiv.org/abs/2605.06132)
   - Published：2026-05-08 12:00
   - 作者：Chunyu Li，Jingyi Kang，Ding Chen，Mengyuan Zhang，Jiajun Shen，Bo Tang 等
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.06132
   - 摘要：In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20\% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.

10. [Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning](https://arxiv.org/abs/2605.06241)
   - Published：2026-05-08 12:00
   - 作者：\"Omer Faruk Akg\"ul，Rajgopal Kannan，Willie Neiswanger，Viktor Prasanna
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.06241
   - 摘要：Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.

11. [Agentic Retrieval-Augmented Generation for Financial Document Question Answering](https://arxiv.org/abs/2605.05409)
   - Published：2026-05-08 12:00
   - 作者：Yang Shu，Yingmin Liu，Zequn Xie
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "agent"; summary matched "LLM"; summary matched "reasoning"; summary matched "RAG"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.05409
   - 摘要：Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence--structured tables, textual narratives, and footnotes--scattered across corporate filings. Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. Extensive experiments on three benchmark datasets--FinQA, ConvFinQA, and TAT-QA--demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62--9.32 percentage points. Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework's robustness and practical viability for financial institutions.

12. [LeakDojo: Decoding the Leakage Threats of RAG Systems](https://arxiv.org/abs/2605.05818)
   - Published：2026-05-08 12:00
   - 作者：Maosen Zhang，Jianshuo Dong，Boting Lu，Wenyue Li，Xiaoping Zhang，Tianwei Zhang 等
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "RAG"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CR, cs.AI, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.05818
   - 摘要：Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction-following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable insights for understanding and mitigating RAG leakage in practice. Our codebase is available at https://github.com/yeasen-z/LeakDojo.

## Agent Runtime Security 观察

### 本组速览

- 《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》〔评测 / 应用 / 方法〕：Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers,…

### 论文速览

1. [Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation](https://arxiv.org/abs/2605.06393v1)
   - Published：2026-05-07 23:08
   - 作者：Di Lu，Bo Zhang，Xiyuan Li，Yongzhi Liao，Xuewen Dong，Yulong Shen 等
   - 来源：arxiv
   - 相关性分数：102
   - 命中原因：title matched "computer-use agent"; summary matched "prompt injection"; summary matched "indirect prompt injection"; has PDF
   - 分类：cs.CR
   - 标签：评测 / 应用 / 方法
   - 主题词：Agent / Evaluation
   - PDF：https://arxiv.org/pdf/2605.06393v1
   - 摘要：Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct access to host-side resources, including browsers, files, scripts, system commands, and external communication channels. While useful for automating real tasks, this capability also creates a host-level abuse surface: a legitimately deployed agent may be steered toward unsafe operations through malicious messages, indirect prompt injection, unsafe skills, or tampering along the host-side control path. We argue that such risks cannot be addressed by ad hoc blocking rules alone, because the security criticality of an operation depends jointly on its action type, target object, execution context, and potential effect. This paper presents an operation-centric model for risk-based confinement of SHCUA operations. The proposed design keeps ordinary functionality on the constrained REE path, while protecting security-critical classification, authorization, binding, evidence generation, and selected execution-control decisions inside a cloud-native TEE-backed trusted operation plane. We instantiate the architecture on OpenClaw using Intel TDX as the primary trusted backend, with remote terminal-side trusted components verifying TDX-audited commands before constrained local execution. The evaluation shows that the design can block unsafe or policy-disallowed operations before execution, preserve ordinary functionality for allowed workloads, and provide auditable evidence with deployment-dependent overhead.