# 每日论文简报

- 生成时间：2026-06-23 13:10:02 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=3, Terminal and SWE Agents=1
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Language Model」：命中 14 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》、《TriggerBench: Investigating Prospective Memory for Large Language Models》。
- 主题「LLM」：命中 14 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》、《TriggerBench: Investigating Prospective Memory for Large Language Models》。
- 主题「Benchmark」：命中 6 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Evaluation Awareness Is Not One Capability: Evidence from Open Language Models》、《MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?》。
- 主题「Agent」：命中 2 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions》、《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》。
- 主题「RAG」：命中 2 篇，覆盖 Agent Runtime Security，代表论文包括 《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》、《TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：3 篇
- Terminal and SWE Agents：1 篇

## 主题聚焦

### Language Model

- 命中篇数：14
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》、《TriggerBench: Investigating Prospective Memory for Large Language Models》、《Can LLMs Reliably Self-Report Adversarial Prefills, and How?》
- 主题速读：
  - 《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal r…
  - 《TriggerBench: Investigating Prospective Memory for Large Language Models》〔评测 / 应用 / 方法〕：While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via ex…

### LLM

- 命中篇数：14
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》、《TriggerBench: Investigating Prospective Memory for Large Language Models》、《Can LLMs Reliably Self-Report Adversarial Prefills, and How?》
- 主题速读：
  - 《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal r…
  - 《TriggerBench: Investigating Prospective Memory for Large Language Models》〔评测 / 应用 / 方法〕：While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via ex…

### Benchmark

- 命中篇数：6
- 覆盖分组：LM、Terminal and SWE Agents
- 代表论文：《Evaluation Awareness Is Not One Capability: Evidence from Open Language Models》、《MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?》、《EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions》
- 主题速读：
  - 《Evaluation Awareness Is Not One Capability: Evidence from Open Language Models》〔评测 / 应用 / 方法〕：Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This o…
  - 《MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?》〔评测 / 应用 / 方法〕：Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position with…

### Agent

- 命中篇数：2
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions》、《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》
- 主题速读：
  - 《EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions》〔评测 / 方法〕：Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseC…
  - 《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》〔评测 / 应用 / 方法〕：Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is…

### RAG

- 命中篇数：2
- 覆盖分组：Agent Runtime Security
- 代表论文：《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》、《TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization》
- 主题速读：
  - 《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》〔评测 / 应用 / 方法〕：Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is…
  - 《TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization》〔数据 / 应用 / 方法〕：Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red…

## LM 观察

### 本组速览

- 《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal r…
- 《TriggerBench: Investigating Prospective Memory for Large Language Models》〔评测 / 应用 / 方法〕：While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via ex…
- 《Can LLMs Reliably Self-Report Adversarial Prefills, and How?》〔评测 / 方法〕：Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how…
- 《Evaluation Awareness Is Not One Capability: Evidence from Open Language Models》〔评测 / 应用 / 方法〕：Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This o…
- 《POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation》〔数据 / 方法〕：Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the outpu…

### 论文速览

1. [AIR: Adaptive Interleaved Reasoning with Code in MLLMs](https://arxiv.org/abs/2606.23678v1)
   - Published：2026-06-23 01:58
   - 作者：Cong Han，Xiaohan Lan，Haibo Qiu，Yujie Zhong
   - 来源：arxiv
   - 相关性分数：200
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CV, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23678v1
   - 摘要：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: https://github.com/CongHan0808/AIR.git.

2. [TriggerBench: Investigating Prospective Memory for Large Language Models](https://arxiv.org/abs/2606.23459v1)
   - Published：2026-06-22 23:07
   - 作者：Tianhua Zhang，Xinjiang Wang，Qianxi Zhang，Qi Chen，Kun Li，Yaoqi Chen 等
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23459v1
   - 摘要：While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with matched RM controls, contrastive positive/negative variants, and overloaded triggers, enabling fine-grained measurement of proactive recall, false-alarm rate, and attentional robustness under a single protocol. Our evaluation yields three key findings. (i) PM shows a precision-recall trade-off and attentional fragility. Though enhanced reasoning significantly improves proactive recall, models may overfit to an "always-remind" heuristic. Furthermore, PM accuracy degrades substantially under implicit constraints or triggers overloaded by concurrent user requests, indicating that robust PM remains an open challenge. (ii) PM is notably harder than RM: on identical contexts, RM near-saturates up to 100K tokens, while PM decays sharply as context length scales. (iii) PM may serve as a behavioral probe of spare reasoning capacity. Pairing PM scenarios with AIME-2025 math problems reveals that successful trajectories yield higher PM accuracy than failed ones at the same context length, showing PM tracks spare reasoning budget that token count obscures. Project page: https://github.com/KristenZHANG/TriggerBench-Official.

3. [Can LLMs Reliably Self-Report Adversarial Prefills, and How?](https://arxiv.org/abs/2606.23671v1)
   - Published：2026-06-23 01:56
   - 作者：Quang Minh Nguyen，Uzair Ahmed，Taegyoon Kim
   - 来源：arxiv
   - 相关性分数：160
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23671v1
   - 摘要：Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

4. [Evaluation Awareness Is Not One Capability: Evidence from Open Language Models](https://arxiv.org/abs/2606.23583v1)
   - Published：2026-06-23 00:48
   - 作者：Nilesh Nayan，Aishwarya Sampath Kumar，Rishiraj Girmal，Shivani Anilkumar，Sankaran Vaidyanathan，David A. Nader Palacio 等
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "language model"; title matched "evaluation"; summary matched "instruction tuning"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.23583v1
   - 摘要：Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This opens a gap between benchmark performance and deployment behavior: compliance measured under test conditions becomes an optimistic upper bound that overstates how safely a model behaves once the evaluation harness is removed. We characterize this evaluation awareness through eight experiments across 37 open-weight models and seven families. (i)Detection is moderate and training-driven (24/37 models exceed chance, best AUROC 0.714 vs.0.819 human, with instruction tuning dominating over scale). (ii)Detection shifts safety behavior (hard refusal drops 5.8 percentage points under hypothetical framing, and 21/140 HarmBench framing effects are significant, with compliance rising up to +30 percentage points. (iii)Representations survive behavioral collapse (probes retain AUROC 0.98 under rewrites that drive behavior below chance, and multi-layer steering causally moves three downstream tasks while random controls do not). (iv)These axes are weakly coupled (only 1/15 correlations are significant, the sole robust link being behavioral detection versus framing resistance, $ρ=-0.79$, $p<0.001$). We call this gap the benchmark illusion: because detectability, behavioral manifestation, and controllability vary independently, it is multivariate rather than a single number, so no single awareness score is a reliable proxy for deployment safety.

5. [POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation](https://arxiv.org/abs/2606.23533v1)
   - Published：2026-06-23 00:12
   - 作者：Hung Phan，Aniroop Naladala，Dubey Avanindra，Supryia Chinthavali，Lunga Dalton，Ali Jannesari
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.AI
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23533v1
   - 摘要：Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the output must follow strict formatting and structural rules. Unlike open-ended tasks such as question answering or translation, domain-specific generation must be both semantically correct and compliant with existing guidelines and standards. In this work, we study the nationwide interoperability problem of utility power outage reports in the United States. In practice, outage reports need to be machine-readable (e.g., JSON or XML) and must strictly follow requirements from energy-sector regulatory bodies. To address this problem, we propose POTracker, an optimized LLM for power outage report generation. We fine-tune Qwen2.5-7B-Instruct using our proposed objective. The key contribution is a new loss function, POTrackerLoss, that considers both textual similarity and structural (tag) similarity between the generated report and the ground-truth report. We evaluate POTracker on a dataset of 1,000 power outage reports and compare it with five well-known fine-tuning methods and one rule-based XML conversion method. Results show that POTracker outperforms other fine-tuning approaches, improving overall accuracy by up to 51% and reaching 86.47% structural accuracy for generated power outage reports. In addition, we conduct a human study to assess the quality of the ground-truth standard reports, where domain experts assign the generated labels an average score of 4.03 on a 0--5 scale.

6. [Randomized YaRN Improves Length Generalization for Long-Context Reasoning](https://arxiv.org/abs/2606.23687v1)
   - Published：2026-06-23 01:59
   - 作者：Manas Mehta，Fangcong Yin，Greg Durrett
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23687v1
   - 摘要：Large language models (LLMs) are typically pretrained on short sequences and then extended to work on longer sequences with additional training. However, such LLMs still struggle to further generalize to very long sequences. We propose Randomized YaRN, a training method that improves length generalization by combining YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. During training on short context data, tokens are assigned YaRN positional encodings sampled from a larger position range, exposing the model to out-of-distribution positional representations even on short-context inputs. We evaluate Randomized YaRN on two challenging long-context reasoning benchmarks, BABILong and Multi-Round Coreference Resolution (MRCR). When training on data with <8K context, Randomized YaRN consistently improves reasoning performance on context lengths from 16K to 128K and outperforms standard fine-tuning, with the largest gains appearing at far out-of-distribution lengths. Our results suggest that progressively exposing models to OOD positional distributions provides an effective recipe for generalizable long-context reasoning.

7. [Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles](https://arxiv.org/abs/2606.23672v1)
   - Published：2026-06-23 01:57
   - 作者：Prateek Agnihotri，Sanchit Jain，Prabhat Agnihotri，Aditya Prasad，Shubham Jain
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23672v1
   - 摘要：This paper presents our algorithmic innovations for the NVIDIA Nemotron Model Reasoning Challenge, focusing on Bit Manipulation Puzzles. In this task, the objective is to discover a hidden logical rule transforming input binary strings to outputs, then apply it to unseen inputs. Large Language Models (LLMs) notoriously struggle here; traditional methods force them to simulate complex boolean logic and arithmetic, leading to hallucinations. Furthermore, the search space of bitwise operations (combinations of shifts, rotations, and logic gates) suffers from a severe combinatorial explosion. To overcome this computational intractability, we present a novel approach that abandons arithmetic logic entirely in favor of string similarity, structured search, and autonomous error recovery. Our core contributions are: 1. Bases and Truth Table Formulation: We reframe logic-gate deduction into a base-selection task, leveraging string similarity (minimal bit flips) to isolate primitive transformations ("bases") and deduce truth tables without complex arithmetic. 2. Backtracking DFS and Error Recovery: We formalize a search process that tests candidate bases, detects logical collisions across examples, and backtracks upon failure to perform robust error recovery. 3. Bit Tokenization and Interactive Reasoning SFT: We force the tokenizer to encode binary strings as individual single-bit tokens. We use dynamic masking to simulate external oracle feedback, training the model to hypothesize, self-evaluate, and backtrack natively. Evaluated on bit manipulation puzzles, our approach achieved over 96% validation accuracy. This represents the highest performance in this category, driving our 7th Place overall finish in the contest.

8. [Abstract representational geometry supports inference in large language models](https://arxiv.org/abs/2606.23345v1)
   - Published：2026-06-22 21:50
   - 作者：Yunan Zeng，Yuwang Wang
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23345v1
   - 摘要：A defining feature of human intelligence is the ability to adapt to changing environments by inferring latent task structure from sparse observations. Neuroscientific research indicates that this capability relies on the hippocampus constructing abstract representations, expressed as low-dimensional, approximately orthogonal manifolds in neural state space. However, the internal mechanisms of large language models (LLMs) remain largely opaque, making it unclear whether they form comparable abstract representations or instead rely on task-specific statistical regularities when performing comparable reasoning tasks. Here we adapt a contextual reversal-learning paradigm to a text-based setting and compare humans and LLMs at both the Behavioural and representational levels. We report that although LLMs exhibit generalizable reasoning less frequently than humans, when such inference occurs, their internal states exhibit abstract geometric structures that resemble those reported in the hippocampus. Notably, this representational geometry is not uniformly distributed but is organized hierarchically across model depth: whereas lower layers show early, stable encoding of stimulus identity, higher layers form a hippocampal-like functional band enriched for abstract context geometry associated with inference. Furthermore, complementary intervention experiments mechanistically implicate geometry in reasoning: task-sequence language modelling induces geometric disentanglement, whereas geometric regularization of higher layers increases the emergence of generalizable inference. Together, these findings establish abstract representational geometry as a mechanistic principle supporting inference in large language models.

9. [Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting](https://arxiv.org/abs/2606.23391v1)
   - Published：2026-06-22 22:18
   - 作者：Falguni Ghosh，Vahid Hashemi，Bernhard Kainz
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "alignment"
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23391v1
   - 摘要：Time series forecasting is a fundamental machine learning task. Recent work has explored Large Language Models (LLMs) for this purpose due to their strong generalization, pattern recognition, and zero-shot or few-shot capabilities. Despite their suitability for long-context learning, LLMs face challenges in multimodal settings: they lack calibrated probabilistic modeling for non-text data and struggle to align heterogeneous representations. To address these issues, we propose a new framework Diffusion-LLM that integrates a conditional diffusion model into an LLM-based forecasting pipeline. This joint design enables learning the conditional distribution of future data while improving semantic alignment in a shared latent space. We evaluate Diffusion-LLM on six long-term forecasting benchmarks, including ETT, Weather, and ECL. Our method consistently outperforms existing LLM-based baseline, achieving notable gains in ultra-long-term and few-shot forecasting and demonstrating the value of distribution-aware regularization for enhancing robustness and generalization in time series LLMs.

10. [MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?](https://arxiv.org/abs/2606.23664v1)
   - Published：2026-06-23 01:48
   - 作者：Juyang Bai，Laixi Shi
   - 来源：arxiv
   - 相关性分数：128
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "benchmark"; has PDF
   - 分类：cs.LG, cs.MA
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.23664v1
   - 摘要：Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents' roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.

11. [EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions](https://arxiv.org/abs/2606.23654v1)
   - Published：2026-06-23 01:39
   - 作者：Jincheng Zhong，Weizhi Wang，Che Jiang，Kai Tian，Zhenzhao Yuan，Junlin Yang 等
   - 来源：arxiv
   - 相关性分数：128
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "evaluation"; has PDF
   - 分类：cs.CL, cs.SE
   - 标签：评测 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.23654v1
   - 摘要：Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: https://github.com/FrontisAI/EnterpriseClawBench

12. [SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression](https://arxiv.org/abs/2606.23568v1)
   - Published：2026-06-23 00:33
   - 作者：Mahmoud Safari，Frank Hutter
   - 来源：arxiv
   - 相关性分数：127
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; has PDF
   - 分类：cs.LG, cs.CL
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23568v1
   - 摘要：Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their deployment is constrained by substantial memory and compute requirements. Low-rank compression via singular value decomposition (SVD) is an effective remedy, but existing methods focus on how to factorize and which components to keep. We introduce SVD-Surgeon, a training-free method that brings the Optimal Brain Surgeon (OBS) framework to the singular-value basis. Treating each singular value as a parameter, it computes a closed-form update of the retained singular values that compensates, to second order in the model loss, for those removed by truncation. The same analysis yields a saliency for choosing which values to prune. As it operates directly on the singular-value factorization, SVD-Surgeon can be layered on top of existing SVD compressors. Applied to SVD-LLM, a leading SVD-based method, it improves the perplexity-compression trade-off on the OPT family and LLaMA 2-7B without any retraining.

13. [Self-Compacting Language Model Agents](https://arxiv.org/abs/2606.23525v1)
   - Published：2026-06-23 00:08
   - 作者：Tianjian Li，Jingyu Zhang，William Jurayj，Xi Wang，Chuanyang Jin，Mehrdad Farajtabar 等
   - 来源：arxiv
   - 相关性分数：126
   - 命中原因：title matched "language model"; title matched "agent"; summary matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.23525v1
   - 摘要：Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.

14. [On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners](https://arxiv.org/abs/2606.23668v1)
   - Published：2026-06-23 01:52
   - 作者：David Mguni，Julian Ma，Jun Wang
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.LG
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23668v1
   - 摘要：Large Language Models (LLMs) are frequently portrayed as general-purpose solvers capable of solving arbitrary tasks. We argue that this view overlooks a fundamental constraint: language is a compressed and capacity-limited interface for conveying task information. Modelling User--System interaction as a bilevel \emph{cheap-talk} game, we analyse how latent tasks are encoded into prompts and reinterpreted under alignment and safety constraints. We introduce a conceptual decomposition separating task inference from execution and derive PAC-Bayes bounds that distinguish finite-sample estimation error from irreducible structural limitations. Our first main result establishes an \emph{expressivity floor}: language acts as a capacity-limited communication channel, and whenever the informational complexity of a task family exceeds the capacity of that channel, distinct tasks become unavoidably indistinguishable to the Solver, inducing a strictly positive error floor that cannot be eliminated by additional data, optimisation, or model scaling alone. We then establish an \emph{objective-misalignment floor}: when alignment constraints restrict the admissible output set, the User-ideal distribution may lie outside the feasible class, inducing an irreducible distortion. Together, these results yield a formal negative conclusion: prompt-conditioned LLMs are not universal problem solvers through prompting alone, as there exist task families for which correct behaviour is provably unattainable even in the infinite-data regime. More broadly, our analysis shows the limits of prompt-based generalisation arise from information-constrained communication and alignment-constrained objectives. This suggests that interfaces beyond natural language, including multimodal observations and, external memory, may reduce the inherent LLM limitations by increasing the task-relevant information available to the System.

15. [Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts](https://arxiv.org/abs/2606.23375v1)
   - Published：2026-06-22 22:08
   - 作者：Arthur Wuhrmann，Gaetan Stein，Daniel Brunner，Andrei Kucharavy
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "LLM"; title matched "alignment"; summary matched "benchmark"; has PDF
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.23375v1
   - 摘要：While the wider applicability of LLMs in the legal field is currently debated due to their reliability and the gravity of any errors, narrow uses with well-understood and mitigated risks have emerged. Notably the Swiss Federal Supreme Court uses small on-premises models for tentative translations and short-passage summarization across the four official languages. However, such usage is challenging in the context of Criminal Law. Since rulings and cases employees work on routinely can contain detailed descriptions of violent and sexual offenses, their legitimate work is compromised by refusals and disclaimers due to the activation of model guardrails (over-alignment). To measure this phenomenon, we introduce TF-RefusalBench, a multilingual benchmark for criminal-law translation and summarization derived from public Swiss Supreme Court rulings. TF-RefusalBench contains 5,200 total prompts across French, German, Italian, and English, corresponding to common task prompts and passages likely to trigger refusal. We then use TF-RefusalBench to show that over-alignment is a multifaceted phenomenon, influenced by the model and the prompt and text languages being processed, and that its impact cannot be evaluated solely from an over-refusal perspective, given the disclaimer's impact on task faithfulness. Finally, we evaluate approaches to enable on-premises LLMs for Criminal Law Tasks, demonstrating that while prompting can be effective, abliteration (refusal directions ablation) eliminates refusal with minimal impact on task performance.

## Agent Runtime Security 观察

### 本组速览

- 《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》〔评测 / 应用 / 方法〕：Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is…
- 《TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization》〔数据 / 应用 / 方法〕：Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red…
- 《GIF: Locally Sound Geometric Information Flow Control for LLMs》〔评测 / 应用 / 方法〕：Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security…

### 论文速览

1. [Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?](https://arxiv.org/abs/2606.23189v1)
   - Published：2026-06-22 19:36
   - 作者：Anmol Goel，Iryna Gurevych
   - 来源：arxiv
   - 相关性分数：64
   - 命中原因：title matched "computer-use agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：RAG / Agent
   - PDF：https://arxiv.org/pdf/2606.23189v1
   - 摘要：Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation harness that turns this risk into executable, deterministically scored scenarios. We target three common failure modes in CUAs: visual co-location, where the agent pulls in prohibited items that sit next to the task target in the UI; task-ambiguity overshare, where the agent dumps dense personal state in response to an under-specified prompt; and recipient misalignment, where the agent sends content to an addressee for whom it is inappropriate. We evaluate 15 frontier agents and find a surprisingly high failure rate: 11 of 15 leak on more than 50% of scenarios, with an average leakage of 67.9%, and the same failures persist when agents act end-to-end in the environment to complete the task. We release AgentCIBench to encourage the development of safer computer-use agents and position contextual disclosure testing as a pre-deployment safety check.

2. [TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization](https://arxiv.org/abs/2606.23496v1)
   - Published：2026-06-22 23:44
   - 作者：Matan Ben-Tov，Mahmood Sharif
   - 来源：arxiv
   - 相关性分数：46
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.CR
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / RAG
   - PDF：https://arxiv.org/pdf/2606.23496v1
   - 摘要：Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants proliferate, each requiring engineering overhead to use or extend, and remaining hard to compare head-to-head. Together, these raise the bar for adopting optimizers in existing or new domains, and for advancing them via new strategies. We address these gaps with TROPT, the first open-source framework that unifies discrete optimizers' execution and standardizes their development under a single interface. TROPT makes it easy to customize end-to-end optimization recipes by swapping any component -- models, objectives, and optimizers -- extending its reach across domains and new applications. TROPT currently ships with 30+ optimization recipes -- covering applications such as jailbreaking and probing model internals -- built from 15+ optimizers (spanning white-box to black-box access) and 15+ losses, from foundational to state-of-the-art methods. Demonstrating its utility, we leverage TROPT in several studies: (i) controlled, large-scale experiments comparing and enhancing optimization strategies for LLM jailbreaks, revealing potent-yet-underadopted techniques; and (ii) porting optimizers from one domain (e.g., LLM jailbreak) to new domains (e.g., corpus-poisoning embedding model). In all, TROPT significantly lowers the barrier to adopting and advancing discrete text optimization.

3. [GIF: Locally Sound Geometric Information Flow Control for LLMs](https://arxiv.org/abs/2606.23277v1)
   - Published：2026-06-22 20:54
   - 作者：Adam Storek，Nikolaus Holzer，Zhuo Zhang，Suman Jana
   - 来源：arxiv
   - 相关性分数：43
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.23277v1
   - 摘要：Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input token may influence any output token in an autoregressive LLM, existing approaches suffer from severe taint explosion. We present Geometric Information Flow (GIF), a semantic framework for tracking information flow from input tokens to outputs. GIF uses the LLM Jacobian and local output geometry to upper-bound the Shannon mutual information between perturbed input spans and model outputs, yielding a scalable measure computable on large models via automatic differentiation and low-rank approximation. Unlike attention-based or correlational attribution heuristics, GIF satisfies local geometric soundness, and we provide a fully mechanized Lean 4 proof that it upper-bounds the true information flow induced by a given prompt under local regularity assumptions. We evaluate GIF on integrity and confidentiality tasks across multiple prompt-injection and privacy-leakage benchmarks. GIF achieves near-perfect recall even without a downstream declassifier, outperforming attention-based baselines. Combined with lightweight LLM-based declassifiers, it matches or exceeds the F1 of direct LLM-as-judge baselines such as GPT-5.5 xhigh reasoning while using up to 81x lower token cost. GIF flows detected with small surrogate models transfer to larger state-of-the-art models and other model families, even when the surrogate is up to 200x smaller, suggesting black-box deployment without gradient access.

## Terminal and SWE Agents 观察

### 本组速览

- 《Tmax: A simple recipe for terminal agents》〔评测 / 数据 / 应用 / 方法〕：Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academi…

### 论文速览

1. [Tmax: A simple recipe for terminal agents](https://arxiv.org/abs/2606.23321v1)
   - Published：2026-06-22 21:32
   - 作者：Hamish Ivison，Junjie Oscar Yin，Rulin Shao，Teng Xiao，Nathan Lambert，Hannaneh Hajishirzi
   - 来源：arxiv
   - 相关性分数：84
   - 命中原因：title matched "terminal agent"; summary matched "Terminal-Bench"; has PDF; has rich summary
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.23321v1
   - 摘要：Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with only 9B parameters, outperforming much larger models from prior work. Concretely, we generate data using a novel taxonomy, combining difficulty control, personas, and verifier diversification, which allows us to cheaply generate large amounts of terminal environments for RL and SFT training. We open-source our terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets. We then train open-weight models using RL with our data, using a simple, outcome-only recipe. We release our data, models, and code as a strong baseline for future open academic work on terminal agents at https://github.com/hamishivi/tmax.
