# 每日论文简报

- 生成时间：2026-06-11 13:59:12 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=4, Terminal and SWE Agents=3
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 16 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》、《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》。
- 主题「Language Model」：命中 14 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》、《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》。
- 主题「Agent」：命中 5 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》、《Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs》。
- 主题「Benchmark」：命中 4 篇，覆盖 LM，代表论文包括 《OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models》、《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》。
- 主题「RAG」：命中 2 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers》、《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：4 篇
- Terminal and SWE Agents：3 篇

## 主题聚焦

### LLM

- 命中篇数：16
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》、《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》、《Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization》
- 主题速读：
  - 《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment…
  - 《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》〔评测 / 数据 / 方法〕：Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting r…

### Language Model

- 命中篇数：14
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》、《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》、《OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models》
- 主题速读：
  - 《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment…
  - 《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》〔评测 / 数据 / 方法〕：Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting r…

### Agent

- 命中篇数：5
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》、《Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs》、《Fourier Features Let Agents Learn High Precision Policies with Imitation Learning》
- 主题速读：
  - 《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》〔评测 / 数据 / 方法〕：General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a gen…
  - 《Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs》〔评测 / 方法〕：Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These depende…

### Benchmark

- 命中篇数：4
- 覆盖分组：LM
- 代表论文：《OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models》、《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》、《APPO: Agentic Procedural Policy Optimization》
- 主题速读：
  - 《OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models》〔评测 / 数据 / 应用 / 方法〕：High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correc…
  - 《Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks》〔评测 / 数据 / 方法〕：General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a gen…

### RAG

- 命中篇数：2
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers》、《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》
- 主题速读：
  - 《Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers》〔评测〕：We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classif…
  - 《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》〔应用 / 方法〕：AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: ea…

## LM 观察

### 本组速览

- 《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment…
- 《Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation》〔评测 / 数据 / 方法〕：Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting r…
- 《OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models》〔评测 / 数据 / 应用 / 方法〕：High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correc…
- 《Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization》〔评测 / 应用 / 方法〕：Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (…
- 《ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing》〔评测 / 应用 / 方法〕：Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing…

### 论文速览

1. [Measuring Epistemic Resilience of LLMs Under Misleading Medical Context](https://arxiv.org/abs/2606.12291v1)
   - Published：2026-06-11 00:27
   - 作者：Hongjian Zhou，Xinyu Zou，Jinge Wu，Sean Wu，Junchi Yu，Bradley Max Segal 等
   - 来源：arxiv
   - 相关性分数：194
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12291v1
   - 摘要：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2. [Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation](https://arxiv.org/abs/2606.12117v1)
   - Published：2026-06-10 22:12
   - 作者：Selen Erkan，Bastian Boll，Kristian Kersting，Björn Deiseroth，Letitia Parcalabescu
   - 来源：arxiv
   - 相关性分数：182
   - 命中原因：title matched "LLM"; title matched "benchmark"; title matched "evaluation"; summary matched "language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12117v1
   - 摘要：Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

3. [OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models](https://arxiv.org/abs/2606.12169v1)
   - Published：2026-06-10 22:56
   - 作者：Negin Baghbanzadeh，Pritam Sarkar，Michael Colacci，Abeer Badawi，Adibvafa Fallahpour，Arash Afkanpour 等
   - 来源：arxiv
   - 相关性分数：178
   - 命中原因：title matched "language model"; title matched "reasoning"; summary matched "alignment"; summary matched "RAG"
   - 分类：cs.CV, cs.AI, cs.CL, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.12169v1
   - 摘要：High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.

4. [Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization](https://arxiv.org/abs/2606.12373v1)
   - Published：2026-06-11 01:39
   - 作者：Hao Xiang，Qiaoyu Tang，Le Yu，Yaojie Lu，Xianpei Han，Ben He 等
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12373v1
   - 摘要：Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

5. [ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing](https://arxiv.org/abs/2606.12342v1)
   - Published：2026-06-11 01:15
   - 作者：Chirag Chawla，Pratinav Seth，Vinay Kumar Sankarapu
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, cs.ET, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12342v1
   - 摘要：Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

6. [Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?](https://arxiv.org/abs/2606.12250v1)
   - Published：2026-06-10 23:52
   - 作者：Antoni Lasik，Jakub Pokrywka，Łukasz Grzybowski，Jeremi Ignacy Kaczmarek，Gabriela Korzańska，Janusz Świeczkowski-Feiz 等
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12250v1
   - 摘要：Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

7. [Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models](https://arxiv.org/abs/2606.12273v1)
   - Published：2026-06-11 00:14
   - 作者：Jia Deng，Junyi Li，Wayne Xin Zhao，Jinpeng Wang，Hongyu Lu，Ji-Rong Wen
   - 来源：arxiv
   - 相关性分数：140
   - 命中原因：title matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12273v1
   - 摘要：Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

8. [A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design](https://arxiv.org/abs/2606.12040v1)
   - Published：2026-06-10 21:06
   - 作者：Wanting Wang，Xiye Ma，Yuyang He，Minghui Cheng，Ran Cao
   - 来源：arxiv
   - 相关性分数：137
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI, cs.GR
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12040v1
   - 摘要：The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

9. [Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks](https://arxiv.org/abs/2606.12344v1)
   - Published：2026-06-11 01:16
   - 作者：Mengyu Zheng，Kai Han，Boxun Li，Haiyang Xu，Yuchuan Tian，Wei He 等
   - 来源：arxiv
   - 相关性分数：127
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.LG, cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.12344v1
   - 摘要：General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

10. [Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models](https://arxiv.org/abs/2606.12203v1)
   - Published：2026-06-10 23:21
   - 作者：Changyue Wang，Weihang Su，Qingyao Ai，Yichen Tang，Runzhong Qiao，Xuancheng Li 等
   - 来源：arxiv
   - 相关性分数：125
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; has PDF
   - 分类：cs.CL
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12203v1
   - 摘要：Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression methods.We have released our code at https://github.com/bebr2/SKIM .

11. [Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation](https://arxiv.org/abs/2606.12199v1)
   - Published：2026-06-10 23:19
   - 作者：Zhen Ye，Xu Tan，Yiming Li，Guangyan Zhang，Chimin Chan，Haohe Liu 等
   - 来源：arxiv
   - 相关性分数：125
   - 命中原因：title matched "reasoning"; title matched "alignment"; summary matched "LLM"; has PDF
   - 分类：eess.AS, cs.CL, cs.SD
   - 标签：方法
   - 主题词：LLM / Reasoning
   - PDF：https://arxiv.org/pdf/2606.12199v1
   - 摘要：Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

12. [Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models](https://arxiv.org/abs/2606.12114v1)
   - Published：2026-06-10 22:07
   - 作者：Rei Minamoto，Yusuke Oda，Daisuke Kawahara
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; has PDF
   - 分类：cs.CL
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12114v1
   - 摘要：Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

13. [Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs](https://arxiv.org/abs/2606.12385v1)
   - Published：2026-06-11 01:47
   - 作者：Sanjay Adhikesaven，Haoxiang Sun，Sewon Min
   - 来源：arxiv
   - 相关性分数：123
   - 命中原因：title matched "LLM"; summary matched "agent"; summary matched "RAG"; summary matched "evaluation"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.12385v1
   - 摘要：Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

14. [APPO: Agentic Procedural Policy Optimization](https://arxiv.org/abs/2606.12384v1)
   - Published：2026-06-11 01:47
   - 作者：Xucong Wang，Ziyu Ma，Yong Wang，Yuxiang Ji，Shidong Yang，Guanhua Chen 等
   - 来源：arxiv
   - 相关性分数：123
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "benchmark"
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.12384v1
   - 摘要：Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

15. [Fourier Features Let Agents Learn High Precision Policies with Imitation Learning](https://arxiv.org/abs/2606.12334v1)
   - Published：2026-06-11 01:05
   - 作者：Balázs Gyenes，Emiliyan Gospodinov，Jan Frieling，Enrico Krohmer，Nicolas Schreiber，Xiaogang Jia 等
   - 来源：arxiv
   - 相关性分数：123
   - 命中原因：title matched "agent"; summary matched "reasoning"; summary matched "RAG"; summary matched "benchmark"
   - 分类：cs.LG, cs.RO
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.12334v1
   - 摘要：High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: https://fourier-il.github.io/fourier-il

## Agent Runtime Security 观察

### 本组速览

- 《Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code》〔评测 / 方法〕：Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar…
- 《OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents》〔评测 / 方法〕：Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly…
- 《Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers》〔评测〕：We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classif…
- 《External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs》〔应用〕：Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how…

### 论文速览

1. [Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code](https://arxiv.org/abs/2606.11817v1)
   - Published：2026-06-10 16:50
   - 作者：Yitong Zhang，Shiteng Lu，Jia Li
   - 来源：arxiv
   - 相关性分数：60
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI, cs.CL, cs.SE
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.11817v1
   - 摘要：Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

2. [OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents](https://arxiv.org/abs/2606.12341v1)
   - Published：2026-06-11 01:13
   - 作者：Jin Xie，Songze Li
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.12341v1
   - 摘要：Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a protected secret; bidirectional, as a malicious observation can inject instructions that turn the agent's own reasoning model against the user; and task-dependent, as the same field is necessary for one recipient yet gratuitous for another. Per-release contextual-integrity filters, information-flow controls, and posterior-leakage monitors each address part of this but none controls cumulative, inference-based leakage at runtime. We recast agent privacy as \emph{posterior-risk control} and present OCELOT, a runtime mediator that budgets how much an adversary's belief about a secret may improve across a trajectory, rather than filtering outputs. Its mechanism, \emph{Witness-Verified Declassification}, separates judgment from trust: an untrusted, locally fine-tuned defender model inspects each candidate release and emits structured evidence -- labeled atoms and proposed declassification operators -- which a deterministic verifier audits, charging a certified min-entropy cost for the chosen variant and authorizing the least-disclosing useful release under a sink-trust-weighted budget recorded on a tamper-evident ledger. Across diverse agent benchmarks and recent defenses, OCELOT attains significantly lower leakage at higher task utility, resists adaptive injection, jailbreak, cumulative inference, and sink collusion, and adds only modest overhead.

3. [Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers](https://arxiv.org/abs/2606.11949v1)
   - Published：2026-06-10 19:24
   - 作者：Jun Wen Leong
   - 来源：arxiv
   - 相关性分数：41
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.CR, stat.ML
   - 标签：评测
   - 主题词：Evaluation / RAG
   - PDF：https://arxiv.org/pdf/2606.11949v1
   - 摘要：We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p<0.001), indicating per-classifier monitoring profiles are necessary.

4. [External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs](https://arxiv.org/abs/2606.11806v1)
   - Published：2026-06-10 16:38
   - 作者：Lin Sun，Heming Zhang，Xiangzheng Zhang
   - 来源：arxiv
   - 相关性分数：38
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：应用
   - 主题词：LLM / Prompt Injection
   - PDF：https://arxiv.org/pdf/2606.11806v1
   - 摘要：Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

## Terminal and SWE Agents 观察

### 本组速览

- 《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》〔应用 / 方法〕：AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: ea…
- 《Exploration Structure in LLM Agents for Multi-File Change Localization》〔评测 / 方法〕：Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories l…
- 《Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production》〔应用 / 方法〕：Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own…

### 论文速览

1. [PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents](https://arxiv.org/abs/2606.12329v1)
   - Published：2026-06-11 01:02
   - 作者：Ripon Chandra Malo，Tong Qiu
   - 来源：arxiv
   - 相关性分数：69
   - 命中原因：title matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：Agent / RAG
   - PDF：https://arxiv.org/pdf/2606.12329v1
   - 摘要：AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: https://github.com/riponcm/projectmem.

2. [Exploration Structure in LLM Agents for Multi-File Change Localization](https://arxiv.org/abs/2606.11976v1)
   - Published：2026-06-10 19:54
   - 作者：Akeela Darryl Fattha，Kia Ying Chua，Lingxiao Jiang，Laura Wynter
   - 来源：arxiv
   - 相关性分数：59
   - 命中原因：summary matched "SWE-bench"; summary matched "SWE bench"; has PDF; has rich summary
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.11976v1
   - 摘要：Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent-session evaluation of GitHub issues anchored at a single base commit. We compare our non-linear domain-agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku-class model achieves the highest micro F1 among Haiku class models by a large margin. Domain-agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE-bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test-file over prediction. Lastly, forced multi-agent consultation does not measurably help and raises token cost substantially.

3. [Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production](https://arxiv.org/abs/2606.11869v1)
   - Published：2026-06-10 17:44
   - 作者：Marc Alier Forment，Juanan Pereira，Francisco José García-Peñalvo，María José Casañ Guerrero
   - 来源：arxiv
   - 相关性分数：39
   - 命中原因：summary matched "code agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.11869v1
   - 摘要：Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production . We present it as a transferable practice, independent of any language or framework.