# 每日论文简报

- 生成时间：2026-06-16 14:38:43 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=5, Terminal and SWE Agents=3
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 18 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》、《Context-Aware RL for Agentic and Multimodal LLMs》。
- 主题「Language Model」：命中 16 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》、《Context-Aware RL for Agentic and Multimodal LLMs》。
- 主题「Benchmark」：命中 7 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio》、《LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control》。
- 主题「Agent」：命中 3 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents》、《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》。
- 主题「Prompt Injection」：命中 1 篇，覆盖 Agent Runtime Security，代表论文包括 《KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing》。

## 栏目状态

- LM：15 篇
- Agent Runtime Security：5 篇
- Terminal and SWE Agents：3 篇

## 主题聚焦

### LLM

- 命中篇数：18
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》、《Context-Aware RL for Agentic and Multimodal LLMs》、《Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio》
- 主题速读：
  - 《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we ai…
  - 《Context-Aware RL for Agentic and Multimodal LLMs》〔评测 / 应用 / 方法〕：Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a…

### Language Model

- 命中篇数：16
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》、《Context-Aware RL for Agentic and Multimodal LLMs》、《Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification》
- 主题速读：
  - 《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we ai…
  - 《Context-Aware RL for Agentic and Multimodal LLMs》〔评测 / 应用 / 方法〕：Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a…

### Benchmark

- 命中篇数：7
- 覆盖分组：LM、Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio》、《LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control》、《DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents》
- 主题速读：
  - 《Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio》〔评测 / 数据 / 应用 / 方法〕：Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its str…
  - 《LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control》〔评测 / 应用 / 方法〕：Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordi…

### Agent

- 命中篇数：3
- 覆盖分组：Agent Runtime Security、Terminal and SWE Agents
- 代表论文：《MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents》、《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》、《Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection》
- 主题速读：
  - 《MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents》〔评测 / 应用 / 方法〕：Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assis…
  - 《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》〔评测 / 数据 / 应用 / 方法〕：Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally…

### Prompt Injection

- 命中篇数：1
- 覆盖分组：Agent Runtime Security
- 代表论文：《KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing》
- 主题速读：
  - 《KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing》〔应用 / 方法〕：Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagat…

## LM 观察

### 本组速览

- 《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we ai…
- 《Context-Aware RL for Agentic and Multimodal LLMs》〔评测 / 应用 / 方法〕：Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a…
- 《Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio》〔评测 / 数据 / 应用 / 方法〕：Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its str…
- 《Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification》〔数据 / 应用 / 方法〕：Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance…
- 《Scalable Circuit Learning for Interpreting Large Language Models》〔评测 / 应用 / 方法〕：A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavio…

### 论文速览

1. [OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models](https://arxiv.org/abs/2606.16774v1)
   - Published：2026-06-15 22:20
   - 作者：Tianyi Lin，Chuanyu Sun，Jingyi Zhang，Changxu Wei，Huanjin Yao，Shunyu Liu 等
   - 来源：arxiv
   - 相关性分数：235
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "agent"; summary matched "LLM"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16774v1
   - 摘要：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

2. [Context-Aware RL for Agentic and Multimodal LLMs](https://arxiv.org/abs/2606.17053v1)
   - Published：2026-06-16 01:59
   - 作者：Peiyang Xu，Bangzheng Li，Sijia Liu，Karthik R. Narasimhan，Pramod Viswanath，Prateek Mittal 等
   - 来源：arxiv
   - 相关性分数：199
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.CV
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.17053v1
   - 摘要：Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

3. [Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio](https://arxiv.org/abs/2606.17041v1)
   - Published：2026-06-16 01:56
   - 作者：Anzhe Xie，Weihang Su，Yujia Zhou，Yiqun Liu，Qingyao Ai
   - 来源：arxiv
   - 相关性分数：185
   - 命中原因：title matched "LLM"; title matched "agent"; title matched "benchmark"; summary matched "reasoning"
   - 分类：cs.CL, cs.IR
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17041v1
   - 摘要：Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

4. [Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification](https://arxiv.org/abs/2606.16987v1)
   - Published：2026-06-16 01:24
   - 作者：Truong Thanh Hung Nguyen，Khanh Van Quynh Nguyen，Hoang-Loc Cao，Tri Duong，Phuc Ho，Van Pham 等
   - 来源：arxiv
   - 相关性分数：184
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "agent"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16987v1
   - 摘要：Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance in maritime logistics. However, exact HTS classification remains challenging because product descriptions are often short, incomplete, or ambiguous, while correct classification depends on hierarchical tariff structures, legal notes, and jurisdiction-specific rules. This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit HTS code classification in smart-port and maritime logistics environments. The framework integrates multi-agent information retrieval, semantic retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation, element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. We evaluate the framework on a private dataset of 3,300 domain-expert-labeled product records collected from logistics and delivery contexts. Experimental results show that exact 10-digit classification remains difficult even for advanced LLMs, with performance decreasing from coarse chapter-level prediction to fine-grained tariff and statistical suffix assignment. These findings demonstrate the need for evidence-grounded, uncertainty-aware, and human-centered classification workflows rather than fully autonomous single-step prediction. The proposed framework supports more interpretable, accountable, and compliance-oriented HTS classification for maritime logistics and smart-port operations. Our code is available at https://github.com/Analytics-Everywhere-Lab/hts.

5. [Scalable Circuit Learning for Interpreting Large Language Models](https://arxiv.org/abs/2606.16939v1)
   - Published：2026-06-16 00:40
   - 作者：Naiyu Yin，Dennis Wei，Tian Gao，Amit Dhurandhar，Karthikeyan Natesan Ramamurthy，Yue Yu
   - 来源：arxiv
   - 相关性分数：162
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16939v1
   - 摘要：A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

6. [How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation](https://arxiv.org/abs/2606.16821v1)
   - Published：2026-06-15 23:05
   - 作者：Yimeng Chen，Zhe Ren，Firas Laakom，Yu Li，Dandan Guo，Jürgen Schmidhuber
   - 来源：arxiv
   - 相关性分数：160
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.CR, cs.CY, cs.IR
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16821v1
   - 摘要：Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode attack taxonomy, and multiple output-level metrics. We evaluate 13 LLM backends on 308 cases each. Results show that vulnerability patterns vary across backends: overall attack success rate (ASR) ranges from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, the strongest attack mode differs by model family, and the same deployment scaffold could amplify or decrease ASR on different backends. An auxiliary agent-skill probe, where endorsement becomes an install command, exposes a sharp split among otherwise robust backends: Claude over-rejects while GPT over-trusts. These findings argue for treating recommendation reliability under adversarial search content as a first-class dimension of backend safety evaluation.

7. [LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control](https://arxiv.org/abs/2606.16802v1)
   - Published：2026-06-15 22:42
   - 作者：Anqi Zou，Han Deng，Chengyu Zhang，Junquan Hu，Yu Wang，Yuxiang Xing 等
   - 来源：arxiv
   - 相关性分数：160
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "language model"; summary matched "alignment"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.16802v1
   - 摘要：Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

8. [P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs](https://arxiv.org/abs/2606.16753v1)
   - Published：2026-06-15 22:10
   - 作者：Rafael Ferreira，Inês Vieira，Inês Calvo，James Furtado，Iago Paulo，Diogo Tavares 等
   - 来源：arxiv
   - 相关性分数：159
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16753v1
   - 摘要：As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

9. [LESS Is More: Mutual-Stability Sampling for Diffusion Language Models](https://arxiv.org/abs/2606.16908v1)
   - Published：2026-06-16 00:15
   - 作者：Amr Mohamed，Guokan Shang，Michalis Vazirgiannis
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "language model"; summary matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16908v1
   - 摘要：Diffusion large language models (dLLMs) offer a promising alternative to autoregressive decoding by iteratively refining masked sequences, enabling parallel token updates and bidirectional conditioning. Their practical efficiency, however, is limited by sampling procedures that execute a fixed number of reverse denoising steps selected before decoding, spending computation on already-stable positions and sometimes committing unstable ones too early. We present \textsc{LESS}, a training-free, model-agnostic adaptive sampler that treats token commitment as an online stopping problem. \textsc{LESS} implements mutual-stability sampling through a joint stability rule that makes a masked position eligible for unmasking only when its top-1 prediction has high confidence, its top-1 token persists across recent reverse steps, and its predictive distribution is stable under top-$K$ inter-step Jensen--Shannon divergence. We evaluate \textsc{LESS} on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B, covering full-sequence diffusion and semi-autoregressive blockwise sampling regimes, across seven benchmarks spanning general knowledge, math, and code. \textsc{LESS} improves average accuracy over strong training-free adaptive samplers while using $72.1\%$ fewer reverse steps than fixed-budget decoding. Since each reverse step requires a Transformer forward pass, these step-count reductions translate into fewer forward evaluations, lower measured wall-clock latency, and lower estimated inference compute.

10. [Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures](https://arxiv.org/abs/2606.16897v1)
   - Published：2026-06-16 00:07
   - 作者：Xueping Gao
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "language model"; title matched "alignment"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16897v1
   - 摘要：Do different LLM architectures encode high-level concepts in structurally compatible ways? We systematically characterize a geometric-functional universality dissociation: across multiple concept domains and architectural families, moderate geometric convergence coexists with near-perfect functional transfer. Using contrastive-difference CKA (CKA_Delta), a training-free diagnostic that computes kernel alignment on per-sample contrastive differences, we isolate concept-specific convergence from generic similarity -- achieving significant discrimination where standard CKA cannot. The dissociation replicates across all six concept domains we test (five with p <= 0.017 geometric discrimination and safety as a converging-functional trend, p = 0.08), including two non-instruction concepts (code-vs-NL, reasoning-vs-recall) validated without system prompts; a single 70B--70B pair provides an observational note that universality may strengthen with scale, requiring replication with additional >=70B models. We position CKA_Delta as a practical regime classifier and architectural outlier detector (Gemma: d = 1.08, AUC = 0.79) rather than an absolute transfer-accuracy predictor, providing a training-free diagnostic for cross-architecture concept monitoring.

11. [Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier](https://arxiv.org/abs/2606.16811v1)
   - Published：2026-06-15 22:55
   - 作者：Keizo Kato，Chenhui Chu，Yugo Murawaki，Sado Kurohashi
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI, cs.CL
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16811v1
   - 摘要：For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

12. [DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents](https://arxiv.org/abs/2606.17029v1)
   - Published：2026-06-16 01:52
   - 作者：Minghang Zhu，Chuyang Wei，Junhao Xu，Yilin Cheng，Zhumin Chen，Jiyan He
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "agent"; summary matched "LLM"; summary matched "reasoning"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2606.17029v1
   - 摘要：Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.

13. [Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter](https://arxiv.org/abs/2606.16934v1)
   - Published：2026-06-16 00:34
   - 作者：Patomporn Payoungkhamdee，Napat Laosaengpha，Jenta Wonglertsakul，Pittawat Taveekitworachai，Pume Tuchinda，Panjapong Poobanchuen 等
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.LG
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16934v1
   - 摘要：Reasoning with a Code Interpreter (CI) has emerged as an effective paradigm for enhancing the reasoning capabilities of large language models (LLMs) through executable computation and iterative verification. Despite its growing adoption, the behavioral properties underlying effective code reasoning remain largely underexplored. In this work, we investigate code reasoning from two distinct perspectives inspired by prior studies of natural language reasoning: extrinsic properties, represented by crucial tokens, and intrinsic properties, represented by code-specific cognitive behaviors. Across multiple LLMs, we find that stronger CI reasoning models consistently exhibit a higher prevalence of crucial tokens and cognitive behaviors, particularly verification, backtracking, and backward chaining. Building on these observations, we examine how these properties can be leveraged during both inference and training. At inference time, appending code-specific crucial tokens improves performance on several reasoning capabilities, including mathematical, ordering, and optimization, while yielding limited benefits elsewhere. At training time, augmenting a state-of-the-art framework with code-specific cognitive behaviors improves supervised fine-tuning and reinforcement learning performance in two of three evaluated models. Further analysis shows that these behaviors reduce overthinking in incorrect responses and improve token efficiency, while also revealing factors that limit gains in a certain model. Our findings provide the first systematic characterization of effective reasoning with CI and demonstrate both the potential and limitations of leveraging key properties to improve CI-based reasoning.

14. [Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models](https://arxiv.org/abs/2606.16902v1)
   - Published：2026-06-16 00:10
   - 作者：Dongbin Na，Chanwoo Kim，Soonbin Rho，Giyun Choi，Gangbok Lee，Dooyoung Hong
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "language model"; summary matched "reasoning"; summary matched "agent"; summary matched "RAG"
   - 分类：cs.RO, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2606.16902v1
   - 摘要：This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking

15. [Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens](https://arxiv.org/abs/2606.16847v1)
   - Published：2026-06-15 23:23
   - 作者：Yizhen Yao，Qinglin Zhu，Runcong Zhao，Xiangxiang Dai，Yanzheng Xiang，Yulan He 等
   - 来源：arxiv
   - 相关性分数：138
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16847v1
   - 摘要：Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textit{Error Propagation}, where new tokens absorb toxic information from erroneous context, and \textit{Local Error Reinforcement}, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textit{Anchor Tokens}, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

## Agent Runtime Security 观察

### 本组速览

- 《Automated jailbreak attack targeting multiple defense strategies》〔评测 / 方法〕：Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to th…
- 《MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents》〔评测 / 应用 / 方法〕：Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assis…
- 《DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing》〔评测 / 方法〕：As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existi…
- 《KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing》〔应用 / 方法〕：Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagat…
- 《Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models》〔评测 / 数据 / 方法〕：While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address th…

### 论文速览

1. [Automated jailbreak attack targeting multiple defense strategies](https://arxiv.org/abs/2606.16751v1)
   - Published：2026-06-15 22:09
   - 作者：Qi Wang，Chengcheng Wan，Weijia He，Yanqing Li，Hanqi Sun，Xiaodong Gu 等
   - 来源：arxiv
   - 相关性分数：65
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16751v1
   - 摘要：Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.

2. [MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents](https://arxiv.org/abs/2606.16748v1)
   - Published：2026-06-15 22:08
   - 作者：Lawrence Keunho Jang，Andrew Keunwoo Jang，Jing Yu Koh，Ruslan Salakhutdinov
   - 来源：arxiv
   - 相关性分数：65
   - 命中原因：title matched "computer-use agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.LG, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.16748v1
   - 摘要：Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.

3. [DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing](https://arxiv.org/abs/2606.16527v1)
   - Published：2026-06-15 18:30
   - 作者：Xuanyu Yin，Yilin Jiang，Jun Zhou，Kai Chen，Zhengfu Cao，Xiaolei Dong
   - 来源：arxiv
   - 相关性分数：61
   - 命中原因：title matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.CL
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16527v1
   - 摘要：As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.

4. [KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing](https://arxiv.org/abs/2606.17034v1)
   - Published：2026-06-16 01:53
   - 作者：Mufei Li，Shikun Liu，Dongqi Fu，Haoyu Wang，Yinglong Xia，Hong Li 等
   - 来源：arxiv
   - 相关性分数：47
   - 命中原因：summary matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL, cs.LG
   - 标签：应用 / 方法
   - 主题词：LLM / Prompt Injection
   - PDF：https://arxiv.org/pdf/2606.17034v1
   - 摘要：Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K--32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3--4x speedup over full recomputation.

5. [Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models](https://arxiv.org/abs/2606.16808v1)
   - Published：2026-06-15 22:51
   - 作者：Ke Miao，Jiaxin Li，Hongliang Chen，Yuke Hu，Zhan Qin
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Benchmark / RAG
   - PDF：https://arxiv.org/pdf/2606.16808v1
   - 摘要：While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.

## Terminal and SWE Agents 观察

### 本组速览

- 《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》〔评测 / 数据 / 应用 / 方法〕：Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally…
- 《Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection》〔评测 / 应用 / 方法〕：In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are…
- 《No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM p…

### 论文速览

1. [Agent trajectories as programs: fingerprinting and programming coding-agent behavior](https://arxiv.org/abs/2606.16988v1)
   - Published：2026-06-16 01:28
   - 作者：Hamidah Oderinwale
   - 来源：arxiv
   - 相关性分数：64
   - 命中原因：summary matched "SWE-bench"; summary matched "coding agent"; has PDF; has rich summary
   - 分类：cs.SE, cs.LG
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2606.16988v1
   - 摘要：Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We develop procedural representations for agent problem-solving procedures with an emergent vocabulary induction technique that is meant to be maximally compressive to avoid surface-level variation while being expressive enough to unveil the quirks of the models' patterns. We apply our framework to the software engineering evaluation dataset SWE-Bench to study the structural distinctness of agent trajectories and find that behavior is most similar between models from similar release periods and those that are distilled from one another (e.g., a distilled student model and its teacher have a Jensen-Shannon divergence of 0.25, about half the distance between other model pairs). As more models saturate evaluations, we believe that it will be important to probe model behavior along more holistic dimensions than success rates alone. We introduce ProcGrep, a library for auditing and evaluating agents for how they approach tasks at a procedural level given their traces in a top-down fashion. We believe this work has a range of applications to help developers work with and program coding agents, such as task-aware model routing, agent monitoring, and finer-grained cost analysis.

2. [Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection](https://arxiv.org/abs/2606.16839v1)
   - Published：2026-06-15 23:17
   - 作者：Jesse Nyyssölä，Hamza Bin Mazhar，Alexander Bakhtin，Matteo Esposito，Nana Reinikainen，Yuqing Wang 等
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "coding agent"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2606.16839v1
   - 摘要：In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a broad search string that yields 3233 hits from Scopus. We request two LLMs to provide an inclusion probability for each title-abstract pair according to the inclusion and exclusion criteria. From the 3233 exported abstracts, this screening reduced the number of included papers to 569, out of which we could download 470. These papers included 206 unique links and after manual evaluation we determined 83 to be tools. Finally, we ran the LLM-based coding agent on these 83 links, and got 24 successfully running tools. As replicating our approach would require roughly only 4 hours of human effort, of which 3 hours were manual PDF downloading, and 12 hours of LLM running time, this demonstrates promising efficiency when utilizing LLMs in rapid reviews. Because practitioner-built tools often lack academic papers, in the future we aim to expand our analysis to tool-hosting platforms such as GitHub and PyPI. In the future, we plan to formalize our workflow as LLM Agent Skills to make our approach easier to adopt.

3. [No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages](https://arxiv.org/abs/2606.16827v1)
   - Published：2026-06-15 23:08
   - 作者：Alessandro Giagnorio，Alberto Martin-Lopez，Gabriele Bavota
   - 来源：arxiv
   - 相关性分数：44
   - 命中原因：summary matched "code generation benchmark"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2606.16827v1
   - 摘要：Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.
