# 每日论文简报

- 生成时间：2026-05-06 12:37:23 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 12 篇，覆盖 LM，代表论文包括 《Safety and accuracy follow different scaling laws in clinical large language models》、《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》。
- 主题「Language Model」：命中 12 篇，覆盖 LM，代表论文包括 《Safety and accuracy follow different scaling laws in clinical large language models》、《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》。
- 主题「Benchmark」：命中 3 篇，覆盖 LM，代表论文包括 《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》、《MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following》。
- 主题「Agent」：命中 1 篇，覆盖 LM，代表论文包括 《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》。
- 主题「Large Language Model」：命中 1 篇，覆盖 LM，代表论文包括 《Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus》。

## 主题聚焦

### LLM

- 命中篇数：12
- 覆盖分组：LM
- 代表论文：《Safety and accuracy follow different scaling laws in clinical large language models》、《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》、《OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking》
- 主题速读：
  - 《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that hi…
  - 《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》〔应用 / 方法〕：Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm…

### Language Model

- 命中篇数：12
- 覆盖分组：LM
- 代表论文：《Safety and accuracy follow different scaling laws in clinical large language models》、《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》、《OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking》
- 主题速读：
  - 《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that hi…
  - 《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》〔应用 / 方法〕：Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm…

### Benchmark

- 命中篇数：3
- 覆盖分组：LM
- 代表论文：《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》、《MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following》、《MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents》
- 主题速读：
  - 《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》〔评测 / 数据 / 应用 / 方法〕：Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is in…
  - 《MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following》〔评测 / 方法〕：Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only…

### Agent

- 命中篇数：1
- 覆盖分组：LM
- 代表论文：《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》
- 主题速读：
  - 《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》〔评测 / 数据 / 应用 / 方法〕：Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is in…

### Large Language Model

- 命中篇数：1
- 覆盖分组：LM
- 代表论文：《Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus》
- 主题速读：
  - 《Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus》〔评测 / 数据 / 方法〕：This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome t…

## LM 观察

### 本组速览

- 《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that hi…
- 《Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones》〔应用 / 方法〕：Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm…
- 《OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking》〔评测 / 数据 / 应用 / 方法〕：Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links…
- 《Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems》〔评测 / 数据 / 应用 / 方法〕：Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is in…
- 《Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus》〔评测 / 数据 / 方法〕：This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome t…

### 论文速览

1. [Safety and accuracy follow different scaling laws in clinical large language models](https://arxiv.org/abs/2605.04039v1)
   - Published：2026-05-06 01:57
   - 作者：Sebastian Wind，Tri-Thien Nguyen，Jeta Sopa，Mahshad Lotfinia，Sebastian Bickelhaup，Michael Uder 等
   - 来源：arxiv
   - 相关性分数：201
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "agent"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.04039v1
   - 摘要：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

2. [Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones](https://arxiv.org/abs/2605.03788v1)
   - Published：2026-05-05 22:14
   - 作者：Andrea Iannoli，Lorenzo Gigli，Luca Sciullo，Angelo Trotta，Marco Di Felice
   - 来源：arxiv
   - 相关性分数：183
   - 命中原因：title matched "LLM"; title matched "reasoning"; title matched "agent"; summary matched "language model"
   - 分类：cs.AI, cs.NI, cs.RO
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03788v1
   - 摘要：Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution. This paper presents a mission-agnostic, agent-enhanced LLM framework for UAV swarm control, where users express mission objectives in natural language and the system autonomously executes them through grounded, real-time interactions. The proposed architecture combines an LLM-based Agent Core with a Model Context Protocol (MCP) gateway and a Web-of-Drones abstraction based on W3C Web of Things (WoT) standards. By exposing drones, sensors, and services as standardized WoT Things, the framework enables structured tool-based interaction, continuous state observation, and safe actuation without relying on code generation. We evaluate the framework using ArduPilot-based simulation across four swarm missions and six state-of-the-art LLMs. Results show that, despite strong reasoning abilities, current general-purpose LLMs still struggle to achieve reliable execution - even for simple swarm tasks - when operating without explicit grounding and execution support. Task-specific planning tools and runtime guardrails substantially improve robustness, while token consumption alone is not indicative of execution quality or reliability.

3. [OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking](https://arxiv.org/abs/2605.03762v1)
   - Published：2026-05-05 21:50
   - 作者：Yiding Ma，Chengyun Ruan，Kaibo Huang，Zhongliang Yang，Linna Zhou
   - 来源：arxiv
   - 相关性分数：161
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03762v1
   - 摘要：Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time-bounded forecasting samples by combining model-cutoff-aligned sample admission, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX-Past-derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the $1\%$ level, an order of magnitude below tool-only temporal filtering. OracleProto turns LLM forecasting from one-off evaluation into an auditable, reusable, and trainable dataset-level capability, providing a unified interface for fair cross-model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.

4. [Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems](https://arxiv.org/abs/2605.04018v1)
   - Published：2026-05-06 01:42
   - 作者：Yilun Zhao，Jinbiao Wei，Tingyu Song，Siyue Zhang，Chen Zhao，Arman Cohan
   - 来源：arxiv
   - 相关性分数：147
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.CL, cs.IR
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / Agent
   - PDF：https://arxiv.org/pdf/2605.04018v1
   - 摘要：Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

5. [Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus](https://arxiv.org/abs/2605.03742v1)
   - Published：2026-05-05 21:28
   - 作者：Mullosharaf K. Arabov
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "benchmark"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.03742v1
   - 摘要：This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory consumption. For small GPT-2 family models, full fine-tuning yielded lower perplexity (3.48 for GPT-2 Medium) than LoRA (7.60-8.42), but induced catastrophic forgetting. The encoder-only XLM-RoBERTa showed the worst results (perplexity 59.3). The novelty lies in creating the largest verified Tajik corpus and the first systematic analysis of PEFT effectiveness for Tajik text generation. Practical value lies in recommendations for architecture and fine-tuning strategy selection, optimizing computational costs without substantial quality loss.

6. [MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following](https://arxiv.org/abs/2605.03858v1)
   - Published：2026-05-05 23:20
   - 作者：Jaeyun Lee，Junyoung Koh，Zeynel Tok，Hunar Batra，Ronald Clark
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "benchmark"; title matched "evaluation"; summary matched "LLM"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2605.03858v1
   - 摘要：Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.

7. [OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories](https://arxiv.org/abs/2605.04036v1)
   - Published：2026-05-06 01:55
   - 作者：Yuwen Du，Rui Ye，Shuo Tang，Keduan Huang，Xinyu Zhu，Yuzhu Cai 等
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI, cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.04036v1
   - 摘要：Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

8. [Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing](https://arxiv.org/abs/2605.04003v1)
   - Published：2026-05-06 01:24
   - 作者：Danny Hoang，Ryan Matthiessen，Christopher Miller，Nasir Mannan，Ruby ElKharboutly，David Gorsich 等
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.MA, cs.AI, cs.IR
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.04003v1
   - 摘要：High-precision CNC machining of free-form aerospace components requires bounded compensations informed by inspection, simulation, and process knowledge. Off-the-shelf large language model (LLM) assistants can generate text, but they do not reliably execute risk-constrained multi-step numerical workflows or provide auditable provenance for high-stakes decisions. We present multi-agent knowledge analysis (MAKA), a human-in-the-loop decision-support architecture that separates intent routing, tools-only quantitative analysis, knowledge graph retrieval, and critic-based verification that enforces physical plausibility, safety bounds, and provenance completeness before recommendations are surfaced for human approval. MAKA is instantiated on a Ti-6Al-4V rotor blade machining testbed by fusing virtual-machining path-tracking error fields, cutting-force and deflection simulations, and scan-based 3D inspection deviation maps from 16 blades. The analysis decomposes deviation into an evidence-linked pathing component, a drift-based wear proxy capturing systematic evolution across parts, a residual systematic compliance term, and a variability proxy for instability-aware escalation. In a three-level tool-orchestration benchmark (single-step through $\geq$3-step stateful sequences), MAKA improves successful tool execution by up to 87.5 percentage points relative to an unstructured single-model interaction pattern with identical tool access. Digital twin what-if studies show MAKA can coordinate traceable compensation candidates that reduce predicted surface deviation from order $10^{-2}$in to approximately $\pm 10^{-3}$in over most of the blade within the simulation environment, providing a pre-deployment verification signal for risk-aware human decision-making.

9. [Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments](https://arxiv.org/abs/2605.03971v1)
   - Published：2026-05-06 00:53
   - 作者：Hao Mi，Qiang Sheng，Shaofei Wang，Beizhe Hu，Yifan Sun，Zhengjia Wang 等
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL
   - 标签：数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03971v1
   - 摘要：Large Language Models (LLMs) are prone to factual hallucinations, risking their reliability in real-world applications. Existing hallucination detectors mainly extract micro-level intrinsic patterns for uncertainty quantification or elicit macro-level self-judgments through verbalized prompts. However, these methods address only a single facet of the hallucination, focusing either on implicit neural uncertainty or explicit symbolic reasoning, thereby treating these inherently coupled behaviors in isolation and failing to exploit their interdependence for a holistic view. In this paper, we propose LaaB (Logical Consistency-as-a-Bridge), a framework that bridges neural features and symbolic judgments for hallucination detection. LaaB introduces a "meta-judgment" process to map symbolic labels back into the feature space. By leveraging the inherent logical bridge where response and meta-judgment labels are either the same or opposite based on the self-judgment's semantics, LaaB aligns and integrates dual-view signals via mutual learning and enhances the hallucination detection. Extensive experiments on 4 public datasets, across 4 LLMs, against 8 baselines demonstrate the superiority of LaaB.

10. [MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents](https://arxiv.org/abs/2605.03952v1)
   - Published：2026-05-06 00:38
   - 作者：Jonathan Steinberg，Oren Gal
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "agent"; summary matched "alignment"; summary matched "RAG"; summary matched "benchmark"
   - 分类：cs.CR, cs.AI, cs.SE
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / RAG
   - PDF：https://arxiv.org/pdf/2605.03952v1
   - 摘要：Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from sequenced compliance with innocuous-looking requests. We introduce MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 three-stage attack chains paired with deterministic exploit oracles on deployed software substrates (10 web-application substrates, 31 CWE classes, 5 programming languages) that treats both exploit ground truth and downstream reviewer protocol as first-class evaluation axes. On this benchmark, nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with only two refusals across all staged runs. In a matched direct-prompt experiment over four frontier Claude/Codex agents, vulnerable-output rates fall to 0-20.4%: Claude primarily refuses, while Codex primarily hardens rather than emitting the vulnerable implementation - ticket staging silences both defense modes simultaneously. Downstream, code reviewer agents approve 25.8% of these confirmed-vulnerable cumulative diffs as routine PRs, and a full-context implementation protocol closes only 50% of the staged/direct gap, ruling out context fragmentation as the sole explanation. As a deployable but non-adaptive mitigation, reframing the reviewer as an adversarial pentester reduces evasion across the evaluated reviewer subset; pentester framed evasion ranges from 3.0% to 17.6%, and an open-weight Gemma-4-E4B-it reviewer under this framing detects 88.4% of attacks on the dataset with a 4.6% false-positive rate measured on 608 real-world GitHub PRs.

11. [ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity](https://arxiv.org/abs/2605.03667v1)
   - Published：2026-05-05 20:04
   - 作者：Jiaxi Li，Lu Yin，Li Shen，Jinjin Xu，Yuhui Liu，Wenwu Wang 等
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.LG, cs.AI
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03667v1
   - 摘要：Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. However, existing low-rank methods often leave activation matrices in full-rank, which dominates memory consumption and limits throughput during large-batch training. Furthermore, directly applying sparsity to weights often leads to non-negligible performance degradation. To achieve efficient pre-training of LLMs, this paper proposes ELAS: Efficient pre-training of Low-rank LLMs via 2:4 Activation Sparsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models ranging from 60M to 1B parameters. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead, particularly with large batch sizes. Code is available at ELAS Repo.

12. [TriBench-Ko: Evaluating LLM Risks in Judicial Workflows](https://arxiv.org/abs/2605.03792v1)
   - Published：2026-05-05 22:20
   - 作者：Haesung Lee，Gyubin Choi，Eun-Ju Lee，So-Min Lee，Youkang Ko，Dogyoon Lim 等
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03792v1
   - 摘要：Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko

13. [Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer](https://arxiv.org/abs/2605.03769v1)
   - Published：2026-05-05 22:00
   - 作者：Jinghui Yuan，Jiaxuan Zou，Shuo Wang，Yong Liu，Feiping Nie
   - 来源：arxiv
   - 相关性分数：139
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.LG
   - 标签：方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03769v1
   - 摘要：Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

14. [SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification](https://arxiv.org/abs/2605.03701v1)
   - Published：2026-05-05 20:50
   - 作者：Zhifeng Hao，Zhongjie Chen，Junhao Lu，Shengyin Yu，Guimin Hu，Keli Zhang 等
   - 来源：arxiv
   - 相关性分数：138
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03701v1
   - 摘要：Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs' few-shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB-Group/SERE.

15. [From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT](https://arxiv.org/abs/2605.03686v1)
   - Published：2026-05-05 20:30
   - 作者：Mahmoud Hanouneh，Radu Timofte，Dmitry Ignatov
   - 来源：arxiv
   - 相关性分数：137
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.LG, cs.CV
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2605.03686v1
   - 摘要：Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.
