# 每日论文简报

- 生成时间：2026-05-05 12:20:54 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Language Model」：命中 15 篇，覆盖 LM，代表论文包括 《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》、《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》。
- 主题「Large Language Model」：命中 15 篇，覆盖 LM，代表论文包括 《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》、《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》。

## 主题聚焦

### Language Model

- 命中篇数：15
- 覆盖分组：LM
- 代表论文：《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》、《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》、《Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models》
- 主题速读：
  - 《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic…
  - 《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient…

### Large Language Model

- 命中篇数：15
- 覆盖分组：LM
- 代表论文：《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》、《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》、《Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models》
- 主题速读：
  - 《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic…
  - 《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient…

## LM 观察

### 本组速览

- 《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic…
- 《Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient…
- 《Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models》〔评测 / 数据 / 应用 / 方法〕：Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide r…
- 《MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation》〔评测 / 方法〕：Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or toke…
- 《OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice》〔评测 / 数据 / 应用 / 方法〕：Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cogn…

### 论文速览

1. [StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models](https://arxiv.org/abs/2605.01939)
   - Published：2026-05-05 12:00
   - 作者：Yongrui Chen，Yangyang Ma，Xiaoying Huang，Shenyu Zhang，Huajun Chen，Haofen Wang 等
   - 来源：arxiv
   - 相关性分数：219
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "reasoning"; title matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01939
   - 摘要：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration

2. [Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks](https://arxiv.org/abs/2605.01417)
   - Published：2026-05-05 12:00
   - 作者：Benjamin Warner，Ratna Sagari Grandhi，Max Kieffer，Aymane Ouraq，Saurav Panigrahi，Geetu Ambwani 等
   - 来源：arxiv
   - 相关性分数：211
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01417
   - 摘要：Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

3. [Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models](https://arxiv.org/abs/2605.01870)
   - Published：2026-05-05 12:00
   - 作者：Nikolaos Giarelis，Charalampos Mastrokostas，Nikos Karacapilidis
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "reasoning"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01870
   - 摘要：Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enabled by large-scale training and increased model capacity. However, existing LLMs can generate erroneous responses when addressing complex queries that fall outside their training distribution, due to limited internal knowledge or the need for multi-step reasoning. To address these limitations, recent work has introduced large reasoning models (LRMs), which incorporate explicit internal reasoning processes to improve response accuracy. Additionally, state-of-the-art LRMs often comprise hundreds of billions of parameters and require several seconds per inference, even on advanced multi-GPU systems. These characteristics limit their practicality for deployment in conventional computing environments. Meanwhile, NLP research on multilingual LLMs continues to prioritize high-resource languages. However, these models exhibit limited performance in under-resourced languages, primarily due to insufficient language- and culture-specific training data. In this paper, we focus on Modern Greek, for which only a limited number of question answering (QA) datasets have been proposed, most of which are intended for model evaluation. To address this research gap in Greek QA, we make the following contributions: (i) CulturaQA, a high-quality LRM-generated and human-curated dataset, for Greek LLM training and evaluation; (ii) a memory-efficient LLM evaluation framework adaptable to diverse languages and QA tasks; (iii) Maistros 8B, a state-of-the-art open-weights Greek LLM developed via knowledge distillation and fine-tuning on CulturaQA; and (iv) a comprehensive evaluation of nine LLMs across nine human-curated Greek QA datasets.

4. [MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation](https://arxiv.org/abs/2605.01374)
   - Published：2026-05-05 12:00
   - 作者：Pham Khanh Chi，Quoc Phong Dao，Thuat Nguyen，Linh Ngo Van，Trung Le，Thanh Hong Nguyen
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01374
   - 摘要：Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.

5. [OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice](https://arxiv.org/abs/2605.01333)
   - Published：2026-05-05 12:00
   - 作者：Rongyang Wang，Shuang Zhou，Jiashuo Wang，Wenya Xie，Xiaoxia Che
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "LLM"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01333
   - 摘要：Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.

6. [Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models](https://arxiv.org/abs/2605.01853)
   - Published：2026-05-05 12:00
   - 作者：Kotaro Furuya，Takahito Tanimura
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "reasoning"; summary matched "benchmark"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01853
   - 摘要：Large reasoning models (LRMs) generate extended solutions, yet it remains unclear whether these traces reflect substantive internal computation or merely verbosity and overthinking. Although recent hidden-state analyses suggest that internal representations carry correctness-related signals, their coarse aggregations may obscure the token and layer structure underlying reasoning computation. We investigate hidden-state transitions across decoding steps and layers, and identify a distinct spatiotemporal pattern in LRMs: successful trajectories exhibit broad temporal dynamics with localized layer-wise concentration, while this structure is weaker in non-reasoning models and knowledge-heavy domains. We formalize this characteristic as Spatiotemporal Amplitude of Latent Transition (StALT), a training-free trajectory statistic that summarizes temporal changes between adjacent tokens weighted by within-token layer saliency. Across diverse models and benchmarks, StALT reliably separates correct from incorrect trajectories in reasoning-intensive regimes, providing a competitive label-free correctness signal alongside strong output-space and length-based baselines. Intervention analyses further show that this spatiotemporal amplitude responds systematically to manipulations that increase or reduce the demand for internal reasoning, supporting its association with latent reasoning dynamics in LRMs. These findings provide empirical evidence that LRMs exhibit measurable hidden-state dynamics and offer a practical probe for understanding internal computation beyond output-based evaluation.

7. [FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning](https://arxiv.org/abs/2605.01495)
   - Published：2026-05-05 12:00
   - 作者：Zebin Guo，Weidong Geng，Ruichen Mao
   - 来源：arxiv
   - 相关性分数：193
   - 命中原因：title matched "reasoning"; title matched "RAG"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01495
   - 摘要：Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coarse retrieval granularity and insufficient table semantic comprehension. To address these limitations, we introduce FT-RAG, a fine-grained framework that employs knowledge association by decomposing tables into entry-level semantic units to construct a structured graph. FT-RAG employs a structural neighbor expansion mechanism to find semantically connected entities during graph retrieval, followed by multi-modal fusion to consolidate the context of table retrieval results. Further, to address the scarcity of specialized datasets in this domain, we introduce Multi-Table-RAG-Lib, a benchmark comprising 9870 QA pairs with high complexity and difficulty, curated to demand multi-table integration and text-table information fusion for reasoning. FT-RAG surpasses top-performing baselines across all metrics, achieving a 23.5\% and 59.2\% improvement in table-level and cell-level Hit Rates, respectively. Generation performance also sees a remarkable 62.2\% increase in exact value accuracy recall. These metrics verify the framework's effectiveness in factual grounding across both pure tabular and heterogeneous table-text contexts. Therefore, our method establishes a new state-of-the-art performance for complex reasoning over mixed-modality documents.

8. [MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents](https://arxiv.org/abs/2605.01386)
   - Published：2026-05-05 12:00
   - 作者：Hung Pham Van，Nguyen Manh Hieu，Khang Pham Tran Tuan，Nam Le Hai，Linh Ngo Van，Nguyen Thi Ngoc Diep 等
   - 来源：arxiv
   - 相关性分数：193
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01386
   - 摘要：Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.

9. [Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning](https://arxiv.org/abs/2605.01399)
   - Published：2026-05-05 12:00
   - 作者：Sangkwon Park，Donghun Kang，Jisoo Mok，Sungroh Yoon
   - 来源：arxiv
   - 相关性分数：189
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, cs.IR
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01399
   - 摘要：The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM's reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM's ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.

10. [Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models](https://arxiv.org/abs/2605.01451)
   - Published：2026-05-05 12:00
   - 作者：William Guey，Wei Zhang，Pierrick Bougault，Yi Wang，Bertan Ucar，Vitor D. de Moura 等
   - 来源：arxiv
   - 相关性分数：179
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "evaluation"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01451
   - 摘要：Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.

11. [SRA: Span Representation Alignment for Large Language Model Distillation](https://arxiv.org/abs/2605.01205)
   - Published：2026-05-05 12:00
   - 作者：Quoc Phong Dao，Hoang Son Nguyen，Pham Khanh Chi，Tung Nguyen，Linh Ngo Van，Nguyen Thi Ngoc Diep 等
   - 来源：arxiv
   - 相关性分数：179
   - 命中原因：title matched "language model"; title matched "large language model"; title matched "alignment"; summary matched "RAG"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01205
   - 摘要：Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

12. [Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning](https://arxiv.org/abs/2605.02073)
   - Published：2026-05-05 12:00
   - 作者：Arash Ahmadi (Mike)，Sarah Sharif (Mike)，Yaser (Mike)，Banad
   - 来源：arxiv
   - 相关性分数：175
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.02073
   - 摘要：Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at {\alpha} = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.

13. [Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast](https://arxiv.org/abs/2605.01373)
   - Published：2026-05-05 12:00
   - 作者：Jinyuan Feng，Xin Yu，Yiqun Chen，Xiaochi Wei，Yan Gao，Yi Wu 等
   - 来源：arxiv
   - 相关性分数：175
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "reasoning"; summary matched "RAG"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01373
   - 摘要：The iterative denoising paradigm of Diffusion Large Language Models (DLMs) endows them with a distinct advantage in global context modeling. However, current decoding strategies fail to leverage this capability, typically exhibiting a local preference that overlooks the heterogeneous information density within the context, ultimately degrading generation quality. To address this limitation, we systematically investigate high-information-density (HD) tokens and present two key findings: (1) explicitly conditioning on HD tokens substantially improves output quality; and (2) HD tokens exhibit an early-decoding tendency, converging earlier than surrounding tokens. Motivated by these findings, we propose Focus on the Core \textbf{(FoCore)}, a training-free decoding strategy that utilizes HD tokens in a self-contrast manner, wherein HD tokens are temporarily remasked as negative samples, to guide generation. We further introduce FoCore\_Accelerate \textbf{(FoCore\_A)}, an efficient variant that, upon detecting HD token convergence, performs parallel decoding over stable candidates within a local context window, substantially accelerating generation. Extensive experiments on math, code and logical reasoning benchmarks demonstrate that FoCore consistently improves generation quality and efficiency across both LLaDA and Dream backbones. For instance, on HumanEval, FoCore improves pass@1 from 39.02 to 42.68 over standard Classifier-Free Guidance, while FoCore-A reduces the number of decoding steps by 2.07x and per-sample latency from 20.76s to 8.64s (-58.4\%).

14. [MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety](https://arxiv.org/abs/2605.01687)
   - Published：2026-05-05 12:00
   - 作者：Jialin Song，Xiaodong Liu，Weiwei Yang，Wuyang Chen，Mingqian Feng，Xuekai Zhu 等
   - 来源：arxiv
   - 相关性分数：175
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.01687
   - 摘要：We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and 34.6 higher attack success rate (ASR)} than the second-best dataset on DeepSeek-R1-7B and GPT-4.1-mini, respectively. More importantly, safety evaluations suggest that diverse attack categories uncover fine-grained LLM vulnerabilities}, and categories that appear benign under single-turn can exhibit substantially higher adversarial effectiveness in multi-turn scenarios. These findings highlight persistent vulnerabilities of LLMs under realistic adversarial settings and establish MultiBreak as a scalable resource for advancing LLM safety.

15. [Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study](https://arxiv.org/abs/2605.02520)
   - Published：2026-05-05 12:00
   - 作者：Devi Prasad Bal，Subhashree Puhan
   - 来源：arxiv
   - 相关性分数：171
   - 命中原因：title matched "benchmark"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, cs.IR
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.02520
   - 摘要：Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-embedding-3-small embeddings, ensuring that observed differences are attributable to retrieval alone. Evaluation is conducted on 250 question-answer pairs drawn from a preprocessed subset of the BioASQ benchmark (rag-mini-bioasq) using four DeepEval metrics: contextual precision, contextual recall, faithfulness, and answer relevancy, each reported with 95% confidence intervals. A no-context ablation is included as a lower bound. Cross-Encoder Reranking achieves the best composite score (0.827) and highest contextual precision (0.852), confirming that query-document interaction yields measurable retrieval gains. Multi-Query Expansion, despite its recall-oriented design, produces the weakest contextual precision (0.671), suggesting naive query diversification introduces retrieval noise. MMR sacrifices answer relevancy for diversity, while the Dense baseline (composite 0.822) falls within 0.005 points of the top strategy. All RAG conditions dramatically outperform the no-context ablation on answer relevancy (0.658-0.701 vs. 0.287), confirming the practical value of retrieval. The full pipeline, hyperparameters, and evaluation code are publicly available.
