# 每日论文简报

- 生成时间：2026-06-03 14:09:56 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=8, Terminal and SWE Agents=9
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 22 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》、《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》。
- 主题「Language Model」：命中 18 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》、《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》。
- 主题「Benchmark」：命中 6 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Adaptive Latent Agentic Reasoning》、《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》。
- 主题「Agent」：命中 6 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》、《What Makes Interaction Trajectories Effective for Training Terminal Agents?》。
- 主题「RAG」：命中 5 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》、《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：8 篇
- Terminal and SWE Agents：9 篇

## 主题聚焦

### LLM

- 命中篇数：22
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》、《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》、《Can Factual Opinions Be Edited (Manipulated) in Large Language Models?》
- 主题速读：
  - 《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inferenc…
  - 《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》〔方法〕：Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attack…

### Language Model

- 命中篇数：18
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》、《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》、《Can Factual Opinions Be Edited (Manipulated) in Large Language Models?》
- 主题速读：
  - 《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inferenc…
  - 《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》〔方法〕：Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attack…

### Benchmark

- 命中篇数：6
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Adaptive Latent Agentic Reasoning》、《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》、《Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs》
- 主题速读：
  - 《Adaptive Latent Agentic Reasoning》〔评测 / 应用 / 方法〕：Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM a…
  - 《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》〔评测 / 应用 / 方法〕：Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidate…

### Agent

- 命中篇数：6
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》、《What Makes Interaction Trajectories Effective for Training Terminal Agents?》、《Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks》
- 主题速读：
  - 《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》〔方法〕：AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because t…
  - 《What Makes Interaction Trajectories Effective for Training Terminal Agents?》〔方法〕：Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harn…

### RAG

- 命中篇数：5
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》、《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》、《What Makes Interaction Trajectories Effective for Training Terminal Agents?》
- 主题速读：
  - 《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》〔评测 / 应用 / 方法〕：Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidate…
  - 《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》〔方法〕：AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because t…

## LM 观察

### 本组速览

- 《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inferenc…
- 《Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models》〔方法〕：Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attack…
- 《Can Factual Opinions Be Edited (Manipulated) in Large Language Models?》〔评测 / 方法〕：Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current e…
- 《Large Language Models Are Overconfident in Their Own Responses》〔评测 / 方法〕：Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is…
- 《Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models》〔评测 / 应用 / 方法〕：The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has de…

### 论文速览

1. [Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning](https://arxiv.org/abs/2606.03965)
   - Published：2026-06-03 12:00
   - 作者：Yu Xia，Zhouhang Xie，Xin Xu，Byungkyu Kang，Prarit Lamba，Xiang Gao 等
   - 来源：arxiv
   - 相关性分数：213
   - 命中原因：title matched "LLM"; title matched "reasoning"; title matched "agent"; summary matched "language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03965
   - 摘要：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2. [Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models](https://arxiv.org/abs/2606.03793)
   - Published：2026-06-03 12:00
   - 作者：Hashmat Shadab Malik，Muzammal Naseer，Salman Khan
   - 来源：arxiv
   - 相关性分数：213
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "LLM"
   - 分类：cs.CL, cs.CV
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03793
   - 摘要：Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

3. [Can Factual Opinions Be Edited (Manipulated) in Large Language Models?](https://arxiv.org/abs/2606.03096)
   - Published：2026-06-03 12:00
   - 作者：Yuanpu Cao，Ziyi Yin，Fenglong Ma，Jinghui Chen
   - 来源：arxiv
   - 相关性分数：191
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03096
   - 摘要：Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

4. [Large Language Models Are Overconfident in Their Own Responses](https://arxiv.org/abs/2606.03437)
   - Published：2026-06-03 12:00
   - 作者：Mario Sanz-Guerrero，Manuel Mager，Katharina von der Wense
   - 来源：arxiv
   - 相关性分数：191
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "instruction tuning"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03437
   - 摘要：Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

5. [Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models](https://arxiv.org/abs/2606.03165)
   - Published：2026-06-03 12:00
   - 作者：Thomas Stephan Juzek，Xiaoyang Ming，Jose A. Hernandez
   - 来源：arxiv
   - 相关性分数：177
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "evaluation"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2606.03165
   - 摘要：The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

6. [A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs](https://arxiv.org/abs/2606.03867)
   - Published：2026-06-03 12:00
   - 作者：Cuong Vuong Tuan，Trang Mai Xuan，Tien-Cuong Nguyen，Vu-Duc Ngo，Thien Van Luong
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03867
   - 摘要：Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

7. [Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States](https://arxiv.org/abs/2606.02907)
   - Published：2026-06-03 12:00
   - 作者：Subramanyam Sahoo，Vinija Jain，Aman Chadha，Divya Chaudhary
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "language model"; title matched "reasoning"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.02907
   - 摘要：Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

8. [Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models](https://arxiv.org/abs/2606.03399)
   - Published：2026-06-03 12:00
   - 作者：Farhan Sheth，Ziyuan Yang，Yongying Lan，Si Yong Yeo
   - 来源：arxiv
   - 相关性分数：173
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL, cs.CR
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03399
   - 摘要：While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data before transmission. However, straightforward solutions such as encrypting the entire dataset introduce prohibitive computational, alignment, and communication overheads, rendering large-scale practical deployment infeasible. To preserve privacy while maintaining usability, we present Healthcare Encryption & Redaction via Adaptive Linguistic Decomposition (HERALD), a token-level cryptographic redaction framework designed to achieve this balance by encrypting only sensitive tokens while preserving the surrounding context for downstream model utility. HERALD combines medical named-entity recognizer (NER) with part-of-speech (POS) driven policies to select candidate tokens, performs targeted lemmatization to stabilize surface forms, and substitutes each protected token with a deterministic ciphertext wrapped in explicit delimiters. Notably, HERALD is model-agnostic and operates entirely on the client side, ensuring that sensitive content remains encrypted throughout storage, transmission, and processing without requiring changes to downstream models. We evaluated HERALD on both classification and medical question answering (MQA) tasks on public datasets. Across different tasks, experiments illustrate that fully secured baselines suffer significant utility loss, whereas HERALD consistently recovers performance close to plaintext. Overall, HERALD provides a novel utilization pipeline.

9. [Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions](https://arxiv.org/abs/2606.03331)
   - Published：2026-06-03 12:00
   - 作者：Atm Mizanur Rahman (University of Illinois Urbana-Champaign)，Md Arid Hasan (University of Toronto)，Syed Ishtiaque Ahmed (University of Toronto)，Sharifa Sultana (University of Illinois Urbana-Champaign)
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03331
   - 摘要：Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

10. [Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization](https://arxiv.org/abs/2606.03022)
   - Published：2026-06-03 12:00
   - 作者：Mingkuan Zhao，Wentao Hu，Tianchen Huang，Yuheng Min，Suquan Chen，Yide Gao 等
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03022
   - 摘要：Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address this issue through a geometric framework rooted in the linear representation hypothesis. We propose that hallucinations manifest as orthogonal noise relative to the semantic manifold of the residual stream. Specifically, we hypothesize that while attention heads ideally propagate information congruent with the context subspace, hallucinations arise when specific heads introduce components orthogonal to this subspace, disrupting the coherence of the latent representation. Based on this formulation, we introduce Dynamic Contextual Orthogonalization (DCO), an inference-time intervention method. DCO utilizes the input residual stream as a dynamic context anchor to perform orthogonal decomposition on attention head outputs. To distinguish between context-aligned semantic updates and divergent noise, DCO employs a layer-wise Z-score suppression mechanism that selectively attenuates outlier orthogonal components based on statistical distributions. Evaluations on Llama-3-8B and 70B across benchmarks such as XSum, NQ-Swap, and IFEval demonstrate that DCO achieves superior contextual faithfulness compared to state-of-the-art intervention baselines. Furthermore, DCO maintains high performance on knowledge-intensive tasks like TriviaQA and TruthfulQA, effectively mitigating the trade-off between hallucination suppression and parametric knowledge retention often observed in existing methods. Our findings validate the geometric interpretation of hallucinations and establish DCO as a computationally efficient approach for enforcing manifold alignment.Our code is available at https://github.com/Harry-Miral/DCO

11. [IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation](https://arxiv.org/abs/2606.02584)
   - Published：2026-06-03 12:00
   - 作者：Ayman Ali Sharara
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "benchmark"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI, cs.IR
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2606.02584
   - 摘要：Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

12. [Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?](https://arxiv.org/abs/2606.03782)
   - Published：2026-06-03 12:00
   - 作者：Renhao Pei，Yihong Liu，Sampo Pyysalo，Hinrich Sch\"utze，Shaoxiong Ji
   - 来源：arxiv
   - 相关性分数：169
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03782
   - 摘要：Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.

13. [Adaptive Latent Agentic Reasoning](https://arxiv.org/abs/2606.02871)
   - Published：2026-06-03 12:00
   - 作者：Dongwon Jung，Peng Shi，Yi Zhang，Junshan Zhang，Muhao Chen
   - 来源：arxiv
   - 相关性分数：155
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "LLM"; summary matched "benchmark"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.02871
   - 摘要：Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.

14. [Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models](https://arxiv.org/abs/2606.03846)
   - Published：2026-06-03 12:00
   - 作者：Qi Cao，Takeshi Kojima，Andrew Gambardella，Helinyi Peng，Yutaka Matsuo，Yusuke Iwasawa
   - 来源：arxiv
   - 相关性分数：155
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03846
   - 摘要：Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

15. [Hint-Guided Diversified Policy Optimization for LLM Reasoning](https://arxiv.org/abs/2606.03021)
   - Published：2026-06-03 12:00
   - 作者：Zhiyu Cao，Kaixin Wu，Mingjie Zhong，Peifeng Li，Xiaobo Li，Can Ye 等
   - 来源：arxiv
   - 相关性分数：155
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.03021
   - 摘要：Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

## Agent Runtime Security 观察

### 本组速览

- 《D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting》〔评测 / 数据 / 方法〕：Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iterativel…
- 《MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents》〔评测 / 应用 / 方法〕：Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidate…
- 《MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety》〔评测 / 应用 / 方法〕：Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We i…
- 《From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework》〔方法〕：AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because t…
- 《Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems》〔评测 / 应用 / 方法〕：Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative…

### 论文速览

1. [D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting](https://arxiv.org/abs/2606.02640)
   - Published：2026-06-03 12:00
   - 作者：Huanli Gong，Zhipeng Wei，Yu Fu，Haz Sameen Shahgir，Ananya Gupta，Yue Dong 等
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.02640
   - 摘要：Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts toward harmful goals. Existing defenses largely detect or block unsafe content at individual turns or at the final response, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce D-Judge, a semantics-preserving output rewriting defense that intervenes directly in this loop by rewriting the victim LLM's responses before they are evaluated by the attacker's judge. By misaligning the judge's feedback signal without changing the meaning of the original response, D-Judge derails the attacker's prompt-refinement process, causing subsequent queries to be optimized against a distorted signal of attack progress. To improve D-Judge's ability to produce such rewrites, we construct a dataset of semantically equivalent response pairs that induce different judge-assigned harmfulness scores, and use it for supervised fine-tuning followed by direct preference optimization. Experiments on HarmBench show that D-Judge reduces the success rate of state-of-the-art multi-turn jailbreaks while preserving performance on benign benchmarks.

2. [MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents](https://arxiv.org/abs/2606.03203)
   - Published：2026-06-03 12:00
   - 作者：Jia Yu，Zilong Wang，Xinyang Jiang，Dongsheng Li，Shuo Wang
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "computer-use agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / RAG
   - PDF：https://arxiv.org/pdf/2606.03203
   - 摘要：Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

3. [MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety](https://arxiv.org/abs/2606.02630)
   - Published：2026-06-03 12:00
   - 作者：Anushka Sheoran，Yiduo Hao
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Jailbreak
   - PDF：https://arxiv.org/pdf/2606.02630
   - 摘要：Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and invoke authority. We introduce MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate GPT-4.1-mini under fixed template, template-adaptive, and live adversarial attacks. Unsafe responses rise from 35% to nearly 80% by Turn 4 under live attack. Under the same adversary, GPT-4.1-mini and Claude Sonnet 4.5 are statistically indistinguishable at baseline but diverge to a 19x gap by Turn 4, a difference invisible to single-turn evaluation. We characterize four degradation trajectory signatures and identify a two-element attack formula responsible for most catastrophic failures. A lightweight input-side classifier reduces Turn 4 unsafe responses by 52 percentage points despite severe accuracy degradation, but the 45% false alarm rate on benign queries is the primary deployment constraint. A methodological finding also emerges: Claude Sonnet refused to generate adversarial messages in over half of late-turn conversations despite explicit red team framing, suggesting safety training may generalize to the attacker role.

4. [From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework](https://arxiv.org/abs/2606.03777)
   - Published：2026-06-03 12:00
   - 作者：Alex Leung，Rex Zhang，Kentaroh Toyoda，SiewMei Loh
   - 来源：arxiv
   - 相关性分数：75
   - 命中原因：summary matched "prompt injection"; summary matched "malicious tool"; has PDF; has rich summary
   - 分类：cs.AI, cs.CR, q-fin.RM
   - 标签：方法
   - 主题词：RAG / Agent
   - PDF：https://arxiv.org/pdf/2606.03777
   - 摘要：AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

5. [Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems](https://arxiv.org/abs/2606.02755)
   - Published：2026-06-03 12:00
   - 作者：Eric Liang
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.02755
   - 摘要：Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mismatch makes ordinary post-hoc benchmarking insufficient for systems that must be safe, reliable, auditable, and economically useful. This paper contributes an evaluation-protocol extension for operational LLM systems grounded in acceptance-test-driven development, safety engineering, and business-centric validation. The extension translates stakeholder goals into executable behavioral contracts, release gates, monitoring signals, and evidence artifacts before prompt, model, retrieval, or agent changes are accepted. It adapts the red-green-refactor discipline of test-driven development to a red-train-green lifecycle: first define failing acceptance tests for desired behavior, then improve the LLM system through prompt changes, retrieval design, fine-tuning, guardrails, or data augmentation, and finally release only when multidimensional gates are satisfied. The contribution is a governance-oriented metric stack, reference architecture, and empirical protocol for comparing acceptance-test-driven LLM development against prompt-first and benchmark-after workflows.

6. [Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs](https://arxiv.org/abs/2606.02581)
   - Published：2026-06-03 12:00
   - 作者：Sanjay Mishra
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.IR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.02581
   - 摘要：Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token costs and end-to-end latency. Static retrieval configurations cannot resolve this tension across heterogeneous query workloads -- simple definitional queries waste budget on unnecessary context, while complex analytical prompts are underserved by shallow retrieval. This paper introduces \emph{Cost-Aware RAG} (CA-RAG), a per-query routing framework that selects from a discrete catalog of \emph{strategy bundles} -- each coupling a retrieval depth (from retrieval-free direct inference to top-$k{=}10$ dense retrieval) with a fixed generation profile -- by maximizing a scalar utility that linearly combines an estimated quality prior with normalized penalties for predicted latency and total billed tokens. CA-RAG is implemented with FAISS-backed dense retrieval and OpenAI chat/embedding APIs, and evaluated on a 28-query benchmark spanning four bundles. The router dynamically exercises all bundles, achieving \textbf{26\% fewer billed tokens} than always-heavy retrieval and \textbf{34\% lower mean latency} than always-direct inference while maintaining equivalent answer quality. Per-query delta analysis reveals that savings are non-uniform and concentrated in simpler queries, motivating complexity-aware guardrails. Sensitivity analysis confirms that the same bundle catalog supports multiple cost-latency-quality operating points through weight adjustment alone. All results are generated directly from logged CSV artifacts for full reproducibility. CA-RAG provides a transparent, auditable foundation for cost-conscious LLM deployments.

7. [Toward a Modular Architecture for Embedded AI Agent Systems at the Edge](https://arxiv.org/abs/2606.02862)
   - Published：2026-06-03 12:00
   - 作者：Marcus R\"ub，Michael Gerhards
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "policy enforcement"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.MA
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.02862
   - 摘要：The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

8. [Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing](https://arxiv.org/abs/2606.02822)
   - Published：2026-06-03 12:00
   - 作者：Alexandre Cristov\~ao Maiorano
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.02822
   - 摘要：Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-only), $L_2$ (budget-only), and $L_3$ (full stack). $L_1$ and $L_2$ are sibling single-axis ablations, not subsets of each other; $L_3$ is their union plus tool-registry authentication and credential scrubbing. Across $N=10$ replications, the per-OWASP finding count is clean: refusal alone removes all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone removes all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings by terminating multi-step sequences; LLM06 (excessive agency) requires the full stack. We probe brittleness under paraphrasing: with 300 Gemini-generated paraphrases ($K=5$ over a 60-template brittleness corpus), $L_1$ refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. A fifth target, $L_4$-real, swaps the stub backend for Gemini-2.5-flash behind the same $L_3$ regex and matches $L_1$ exactly, indicating no measurable alignment contribution beyond the regex (not a general claim about alignment). Budget controls show no drop (0 pp once the rate-limit floor is factored out). A refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent; a budget control resists the same mutation.

## Terminal and SWE Agents 观察

### 本组速览

- 《What Makes Interaction Trajectories Effective for Training Terminal Agents?》〔方法〕：Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harn…
- 《Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing》〔评测 / 方法〕：AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for n…
- 《Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment》〔评测 / 数据 / 方法〕：Automating C-to-Rust migration is critical for improving software security without sacrificing performance. Traditional rule-based methods struggle with divers…
- 《Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks》〔评测 / 方法〕：Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, rea…
- 《VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection》〔方法〕：Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated function classifiers produ…

### 论文速览

1. [What Makes Interaction Trajectories Effective for Training Terminal Agents?](https://arxiv.org/abs/2606.03461)
   - Published：2026-06-03 12:00
   - 作者：Sidi Yang，Chaofan Tao，Jierun Chen，Tiezheng Yu，Ruoyu Wang，Yuxin Jiang 等
   - 来源：arxiv
   - 相关性分数：115
   - 命中原因：title matched "terminal agent"; summary matched "Terminal-Bench"; summary matched "code agent"; has PDF
   - 分类：cs.AI
   - 标签：方法
   - 主题词：RAG / Agent
   - PDF：https://arxiv.org/pdf/2606.03461
   - 摘要：Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.

2. [Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing](https://arxiv.org/abs/2606.03618)
   - Published：2026-06-03 12:00
   - 作者：Mehmet Utku Colak
   - 来源：arxiv
   - 相关性分数：97
   - 命中原因：title matched "code agent"; summary matched "coding agent"; has PDF; has rich summary
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.03618
   - 摘要：AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

3. [Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment](https://arxiv.org/abs/2604.02852)
   - Published：2026-06-03 12:00
   - 作者：Jia Feng，Wenjie Gan，Cuiyun Gao，Chaozheng Wang，Feng Luo，Xin Xia 等
   - 来源：arxiv
   - 相关性分数：97
   - 命中原因：title matched "repository-level"; summary matched "repository level"; has PDF; has rich summary
   - 分类：cs.SE
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.02852
   - 摘要：Automating C-to-Rust migration is critical for improving software security without sacrificing performance. Traditional rule-based methods struggle with diverse C idioms, often producing rigid and unidiomatic Rust code. Large Language Models (LLMs), trained on massive code corpora, offer a promising alternative by leveraging cross-language generalization to generate more idiomatic and maintainable Rust code. However, several challenges remain. First, existing LLM-based approaches fail to handle cross-file dependencies effectively, either ignoring them or including entire files as context, which limits accurate dependency modeling. Second, complex dependencies and structured inputs and outputs make it difficult to verify syntactic correctness and functional equivalence at the repository level. Third, the lack of large-scale C-Rust parallel data constrains model performance. We propose DepTrans, a framework that combines model capability enhancement with structured inference. DepTrans introduces Reinforcement-Aligned Syntax Training to improve generation quality through multi-task fine-tuning and feedback-driven reinforcement learning. It further applies Dependency-Guided Iterative Refinement to capture fine-grained cross-file dependencies and iteratively refine generated Rust code. We construct a dataset of 85k training samples and a benchmark of 145 repository-level instances. Experiments show that DepTrans achieves a 60.7 percent compilation success rate and 43.5 percent computational accuracy, outperforming the strongest baseline by 22.8 and 17.3 percentage points. It also successfully builds 7 of 15 industrial C projects, demonstrating its practical potential.

4. [Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks](https://arxiv.org/abs/2606.02875)
   - Published：2026-06-03 12:00
   - 作者：Dipesh KC，Anjila Budathoki
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.02875
   - 摘要：Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

5. [VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection](https://arxiv.org/abs/2603.13384)
   - Published：2026-06-03 12:00
   - 作者：Renwei Meng，Haoyi Wu，Jingming Wang
   - 来源：arxiv
   - 相关性分数：79
   - 命中原因：title matched "repository-level"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：方法
   - 主题词：LLM / RAG
   - PDF：https://arxiv.org/pdf/2603.13384
   - 摘要：Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated function classifiers produce fragile and poorly calibrated warnings. Repository-level LLM agents can gather richer evidence, but prior variants under-specify reproducibility, verifier behavior, baseline fairness, and statistical uncertainty. We present VulnAgent-R2, a budget-aware agentic auditing framework with three additional reusable modules: counterfactual evidence reweighting, build-aware verification-plan synthesis, and a cost-risk Pareto scheduler. The system combines graph triage, bounded context optimization, role-specialized agents, sceptic counter-evidence, selective dynamic verification, and calibrated fusion. On Devign, Big-Vul, DiverseVul, and PrimeVul, VulnAgent-R2 obtains 0.798/0.895, 0.739/0.871, 0.700/0.842, and 0.385/0.781 F1/AUROC, respectively. On JITVul it reaches 0.606 F1, 0.529 Top-1, and 0.742 Top-3 localization, while reducing online tokens by 38.3\% over always-full multi-agent execution. Online time includes retrieval, LLM calls, CER scoring, verifier planning, compilation, and test execution, but excludes one-time shared indexing. Bootstrap tests show the PrimeVul gain over VulnAgent-X is +0.038 F1, 95\% CI [0.020, 0.055], Holm-adjusted $p=0.009$. Treating vulnerability detection as calibrated evidence accumulation improves detection, localization, auditability, and cost control under the evaluated protocol, while remaining a prioritization aid rather than a replacement for manual review.Code is available at https://github.com/renweimeng/Vlun-Agent-X.

6. [Automated Repair of Requirements for Cyber-Physical Systems in Simulink Requirements Tables](https://arxiv.org/abs/2606.03870)
   - Published：2026-06-03 12:00
   - 作者：Aren A. Babikian，Alessio Di Sandro，Federico Formica，Claudio Menghi，Marsha Chechik
   - 来源：arxiv
   - 相关性分数：75
   - 命中原因：summary matched "program repair"; summary matched "automated program repair"; has PDF; has rich summary
   - 分类：cs.SE
   - 标签：方法
   - 主题词：RAG / Alignment
   - PDF：https://arxiv.org/pdf/2606.03870
   - 摘要：The development of complex software systems, e.g., cyber-physical systems (CPSs), involves continuous evolution of both system implementations and their requirements. These two artifacts often proceed independently, creating a risk of misalignment. For example, a system may be updated due to implementation-level concerns, yielding a new version that no longer satisfies its original requirements. Traditional compliance recovery techniques, e.g., automated program repair, address this problem by modifying the system while assuming that requirements are correct. However, faulty, outdated or inadequate requirements are a well-documented challenge in practice, motivating the complementary task of requirement repair. In this paper, we propose a framework that leverages system execution data to repair misaligned CPS requirements, thereby restoring requirement-to-system compliance. Our approach evaluates the correctness of declarative requirements over time-based, real-valued signals expressed using the MATLAB Simulink Requirements Tables language. We evaluate seven variants of our framework on six real-world case studies covering 12 requirements. Results confirm the effectiveness of the proposed framework in producing correct and useful repaired requirements.

7. [AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation](https://arxiv.org/abs/2605.12925)
   - Published：2026-06-03 12:00
   - 作者：Priyam Sahoo，Gaurav Mittal，Xiaomin Li，Shengjie Ma，Benjamin Steenhoek，Pingping Lin 等
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "SWE-bench"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Agent / Evaluation
   - PDF：https://arxiv.org/pdf/2605.12925
   - 摘要：Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have enough passing trajectories to construct task-level process references, yielding a 1,815-trajectory evaluation subset. Among passing trajectories in this subset, 10.7% exhibit behavior we call a Lucky Pass: regression cycles, blind retries, missing verification, or temporally disordered exploration, implementation, and verification. We introduce AgentLens, a framework for process-level assessment of SWE-agent trajectories, and define AgentLens-Bench, a dataset of 1,815 trajectories annotated with quality scores, waste signals, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens builds PTA references by merging multiple passing solutions for the same task, and uses a context-sensitive intent labeler to assign actions to Exploration, Implementation, Verification, or Orchestration based on trajectory history rather than tool identity alone. On AgentLens-Bench, the quality score separates passing trajectories into Lucky, Solid, and Ideal tiers and further decomposes Lucky Passes into five recurring mechanisms. Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We plan to release the project repository soon, including AgentLens-Bench artifacts, the AgentLens SDK, and the analysis tooling.

8. [EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning](https://arxiv.org/abs/2606.03108)
   - Published：2026-06-03 12:00
   - 作者：Guhong Chen，Yingcheng Shi，Yongbin Li，Binhua Li，Xander Xu，Hu Wei 等
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "repository-level"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.03108
   - 摘要：Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

9. [Human-AI Collaboration and the Transformation of Software Engineering Work](https://arxiv.org/abs/2606.03394)
   - Published：2026-06-03 12:00
   - 作者：Mamdouh Alenezi
   - 来源：arxiv
   - 相关性分数：57
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：方法
   - 主题词：Agent / Coding Agent
   - PDF：https://arxiv.org/pdf/2606.03394
   - 摘要：The integration of Generative AI (GenAI) and Agentic AI into software development is reconfiguring software engineering from an activity centered on human authorship of code into a discipline centered on directing, verifying, and governing autonomous and semi-autonomous systems. Drawing on a curated, multi-source evidence base of recent peer-reviewed and archival studies -- including large-scale empirical observations of autonomous coding agents contributing hundreds of thousands of pull requests to open-source repositories -- this paper synthesizes how the locus of engineering work is shifting from individual coding productivity toward human--AI collaboration, agent orchestration, verification and validation, governance, and socio-technical systems thinking. We adopt a structured interpretive synthesis to characterize three coexisting paradigms: Traditional, Generative AI-Enabled, and Agentic AI-Enabled software engineering. We map which traditional activities are being automated, which are being augmented, and which are newly emerging, and we trace plausible role trajectories over the next decade. The paper's principal contribution is an original, theory-driven competency framework that organizes the capabilities required of future engineers into five interacting categories -- % technical, cognitive, socio-technical, governance, and organizational -- % operationalized through a competency matrix and a transformation framework linking paradigm shifts to capability demands. We derive nine empirically testable propositions and articulate implications for theory, industry workforce transformation, university curricula, and organizational leadership. We argue that, as code becomes abundant, the durable value of the software engineer increasingly resides in intent specification, critical judgment, and accountable oversight rather than in the sheer volume of code produced.
