# 每日论文简报

- 生成时间：2026-05-07 12:38:06 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Language Model」：命中 11 篇，覆盖 LM，代表论文包括 《Misaligned by Reward: Socially Undesirable Preferences in LLMs》、《SoK: Robustness in Large Language Models against Jailbreak Attacks》。
- 主题「Large Language Model」：命中 11 篇，覆盖 LM，代表论文包括 《Misaligned by Reward: Socially Undesirable Preferences in LLMs》、《SoK: Robustness in Large Language Models against Jailbreak Attacks》。
- 主题「Benchmark」：命中 4 篇，覆盖 LM，代表论文包括 《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》、《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》。
- 主题「LLM」：命中 3 篇，覆盖 LM，代表论文包括 《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》、《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》。
- 主题「Reasoning」：命中 1 篇，覆盖 LM，代表论文包括 《Agentic Vulnerability Reasoning on Windows COM Binaries》。

## 主题聚焦

### Language Model

- 命中篇数：11
- 覆盖分组：LM
- 代表论文：《Misaligned by Reward: Socially Undesirable Preferences in LLMs》、《SoK: Robustness in Large Language Models against Jailbreak Attacks》、《Why Expert Alignment Is Hard: Evidence from Subjective Evaluation》
- 主题速读：
  - 《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations fo…
  - 《SoK: Robustness in Large Language Models against Jailbreak Attacks》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models in…

### Large Language Model

- 命中篇数：11
- 覆盖分组：LM
- 代表论文：《Misaligned by Reward: Socially Undesirable Preferences in LLMs》、《SoK: Robustness in Large Language Models against Jailbreak Attacks》、《Why Expert Alignment Is Hard: Evidence from Subjective Evaluation》
- 主题速读：
  - 《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations fo…
  - 《SoK: Robustness in Large Language Models against Jailbreak Attacks》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models in…

### Benchmark

- 命中篇数：4
- 覆盖分组：LM
- 代表论文：《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》、《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》、《Agentic Vulnerability Reasoning on Windows COM Binaries》
- 主题速读：
  - 《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》〔评测 / 方法〕：LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability brea…
  - 《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》〔评测 / 数据 / 应用 / 方法〕：Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process…

### LLM

- 命中篇数：3
- 覆盖分组：LM
- 代表论文：《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》、《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》、《MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge》
- 主题速读：
  - 《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》〔评测 / 方法〕：LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability brea…
  - 《Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers》〔评测 / 数据 / 应用 / 方法〕：Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process…

### Reasoning

- 命中篇数：1
- 覆盖分组：LM
- 代表论文：《Agentic Vulnerability Reasoning on Windows COM Binaries》
- 主题速读：
  - 《Agentic Vulnerability Reasoning on Windows COM Binaries》〔评测 / 方法〕：Windows Component Object Model (COM) services run with elevated privileges and are widely accessible to authenticated users, making race conditions in these bi…

## LM 观察

### 本组速览

- 《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations fo…
- 《SoK: Robustness in Large Language Models against Jailbreak Attacks》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models in…
- 《Why Expert Alignment Is Hard: Evidence from Subjective Evaluation》〔评测 / 方法〕：Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria,…
- 《KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels》〔评测 / 方法〕：LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability brea…
- 《Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction》〔评测 / 方法〕：Large Language Models (LLMs) frequently generate plausible but non-factual content, a phenomenon known as hallucination. While existing detection methods typic…

### 论文速览

1. [Misaligned by Reward: Socially Undesirable Preferences in LLMs](https://arxiv.org/abs/2605.05003v1)
   - Published：2026-05-06 23:04
   - 作者：Gayane Ghazaryan，Esra Dönmez
   - 来源：arxiv
   - 相关性分数：194
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI, cs.CY
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05003v1
   - 摘要：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.

2. [SoK: Robustness in Large Language Models against Jailbreak Attacks](https://arxiv.org/abs/2605.05058v1)
   - Published：2026-05-06 23:53
   - 作者：Feiyue Xu，Hongsheng Hu，Chaoxiang He，Sheng Hang，Hanqing Hu，Xiuming Liu 等
   - 来源：arxiv
   - 相关性分数：181
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "RAG"
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05058v1
   - 摘要：Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust, and regulatory compliance in high-stakes applications. Although a variety of attack and defense methods have been proposed, existing evaluation practices are inadequate, often relying on narrow metrics like attack success rate that fail to capture the multidimensional nature of LLM security. In this paper, we present a systematic taxonomy of jailbreak attacks and defenses and introduce Security Cube, a unified, multi-dimensional framework for comprehensive evaluation of these techniques. We provide detailed comparison tables of existing attacks and defenses, highlighting key insights and open challenges across the literature. Leveraging Security Cube, we conduct benchmark studies on 13 representative attacks and 5 defenses, establishing a clear view of the current landscape encompassing jailbreak attacks, defenses, automated judges, and LLM vulnerabilities. Based on these evaluations, we distill critical findings, identify unresolved problems, and outline promising research directions for enhancing LLM robustness against jailbreak attacks. Our analysis aims to pave the way towards more robust, interpretable, and trustworthy LLM systems. Our code is available at Code.

3. [Why Expert Alignment Is Hard: Evidence from Subjective Evaluation](https://arxiv.org/abs/2605.04972v1)
   - Published：2026-05-06 22:28
   - 作者：Tzu-Mi Lin，Wataru Hirota，Tatsuya Ishigaki，Lung-Hao Lee，Chung-Chi Chen
   - 来源：arxiv
   - 相关性分数：161
   - 命中原因：title matched "alignment"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.04972v1
   - 摘要：Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model's prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.

4. [KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels](https://arxiv.org/abs/2605.04956v1)
   - Published：2026-05-06 22:18
   - 作者：Han Wang，Jintao Zhang，Kai Jiang，Haoxu Wang，Jianfei Chen，Jun Zhu
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "RAG"; summary matched "evaluation"
   - 分类：cs.LG, cs.PF
   - 标签：评测 / 方法
   - 主题词：Benchmark / LLM
   - PDF：https://arxiv.org/pdf/2605.04956v1
   - 摘要：LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from $1.58\times$ to $1.44\times$; newly rescued kernels consistently underperform persistently correct ones ($1.16\times$ vs $1.58\times$ speedup in round~0$\to$1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches $21.4\times$. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

5. [Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction](https://arxiv.org/abs/2605.05134v1)
   - Published：2026-05-07 01:07
   - 作者：Dan Wilson，Mohamed Akrout
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.LG, math.DS
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05134v1
   - 摘要：Large Language Models (LLMs) frequently generate plausible but non-factual content, a phenomenon known as hallucination. While existing detection methods typically rely on computationally expensive sampling-based consistency checks or external knowledge retrieval, we propose a new method that treats the LLM as a black-box dynamical system. By projecting LLM responses into a high-dimensional manifold via an embedding model, we characterize the resulting vector sequences as observable realizations of the model's latent state-space dynamics. Leveraging Koopman operator theory, we fit the transition operators for both factual and hallucinated regimes and define a differential residual score based on their respective prediction errors. To accommodate varying user requirements and domain-specific sensitivities, we introduce a preference-aware calibration mechanism that optimizes the classification threshold based on a small set of demonstrations. This approach enables low-cost hallucination detection in a single-sample pass, avoiding the need for secondary sampling or external grounding. Extensive testing across three data benchmarks demonstrates that our method achieves state-of-the-art performance with reduced resource overhead.

6. [Joint Treatment Effect Estimation from Incomplete Healthcare Data: Temporal Causal Normalizing Flows with LLM-driven Evolutionary MNAR Imputation](https://arxiv.org/abs/2605.05125v1)
   - Published：2026-05-07 00:53
   - 作者：Olivia Jullian Parra，Sara Zoccheddu，David Catalan Cerezo，Tom Forzy，Franziska Ulrich，William Sutcliffe 等
   - 来源：arxiv
   - 相关性分数：142
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "RAG"
   - 分类：cs.LG, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05125v1
   - 摘要：Target trial emulation (TTE) enables causal questions to be studied with observational data when randomized controlled trials (RCTs) are infeasible. Yet treatment-effect methods often address causal estimation, missingness, and temporal structure separately, limiting their robustness in electronic health records (EHRs), where time-varying confounding and missing-not-at-random (MNAR) biomarkers can reach 50%--80%. We propose a two-stage pipeline for treatment effect estimation from incomplete longitudinal EHRs. First, CausalFlow-T, a directed acyclic graph (DAG)-constrained normalizing flow with long short-term memory (LSTM)-encoded patient history, performs exact invertible counterfactual inference, avoiding approximation errors from variational inference and separating confounding through explicit causal structure. Ablations on four synthetic and one semi-synthetic benchmark with known counterfactuals show that DAG constraints and exact inference address distinct failure modes: neither compensates for the other. Second, because CausalFlow-T requires completed inputs, we introduce an LLM-driven evolutionary imputer that proposes executable imputation operators rather than individual entries, and evaluate it with three large language model (LLM) backends, including two open-source models. Across 30%--80% MNAR missingness, this imputer achieves the best pooled rank over biomarker and causal metrics, leading in point-wise accuracy and temporal extrapolation while preserving average treatment effect (ATE) recovery as statistical baselines degrade. On Swiss primary-care EHRs from adults with type 2 diabetes initiating a GLP-1 receptor agonist or SGLT-2 inhibitor, the pipeline estimates a per-protocol weight-loss difference of -0.98 kg [95% CI -1.01, -0.96] favoring GLP-1 receptor agonists, consistent with randomized evidence and obtained from realistically incomplete real-world EHRs.

7. [Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation](https://arxiv.org/abs/2605.05007v1)
   - Published：2026-05-06 23:07
   - 作者：Zhiqing Cui，Haotong Xie，Jiahao Yuan，Cheng Yang，Hanqing Wang，Yuxin Wu 等
   - 来源：arxiv
   - 相关性分数：140
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05007v1
   - 摘要：Large language model (LLM) multi-agent systems typically rely on rigid orchestration, committing either to flat per-query routing or to hand-engineered task decomposition, so decomposition depth, worker choice, and inference budget are not jointly optimized under one objective. We introduce Uno-Orchestra, a unified orchestration policy that selectively decomposes a task and dispatches each subtask to an admissible (model, primitive) pair, with both decisions learned together from curated RL trajectories grounded in real worker interactions. Against 22 baselines on a 13-benchmark suite spanning math, code, knowledge, long-context, and agentic tool-use, Uno-Orchestra reaches 77.0% macro pass@1, roughly 16% above the strongest workflow baseline, at roughly an order of magnitude lower per-query cost, advancing the accuracy-efficiency frontier of selective delegation.

8. [Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers](https://arxiv.org/abs/2605.04984v1)
   - Published：2026-05-06 22:38
   - 作者：Senkang Hu，Yong Dai，Xudong Han，Zhengru Fang，Yuzhi Zhao，Sam Tak Wu Kwong 等
   - 来源：arxiv
   - 相关性分数：140
   - 命中原因：title matched "agent"; summary matched "LLM"; summary matched "reasoning"; summary matched "RAG"
   - 分类：cs.LG, cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Benchmark / LLM
   - PDF：https://arxiv.org/pdf/2605.04984v1
   - 摘要：Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.

9. [Agentic Vulnerability Reasoning on Windows COM Binaries](https://arxiv.org/abs/2605.05000v1)
   - Published：2026-05-06 23:00
   - 作者：Hwiwon Lee，Jongseong Kim，Lingming Zhang
   - 来源：arxiv
   - 相关性分数：126
   - 命中原因：title matched "reasoning"; title matched "agent"; summary matched "benchmark"; has PDF
   - 分类：cs.CR, cs.LG
   - 标签：评测 / 方法
   - 主题词：Benchmark / Reasoning
   - PDF：https://arxiv.org/pdf/2605.05000v1
   - 摘要：Windows Component Object Model (COM) services run with elevated privileges and are widely accessible to authenticated users, making race conditions in these binaries a critical surface for local privilege escalation. We present SLYP, an end-to-end agentic pipeline that discovers race condition vulnerabilities in COM binaries and generates debugger-verified proof-of-concept (PoC) code. SLYP exposes binary exploration, COM inspection, and dynamic debugging as reusable tool interfaces, giving agents the static context, COM activation metadata, and debugger feedback needed to move from vulnerability discovery to verified PoC generation. On a benchmark of 20 COM objects covering 40 vulnerability cases, SLYP achieves 0.973 F1, outperforming production coding agents by up to 0.208 F1 and the state-of-the-art static analyzer by 3.3x in bug discovery. For PoC generation, production coding agents in their default setup (without our COM inspection and dynamic debugging tools) verify essentially no cases on either frontier model, whereas SLYP's interactive toolsets enable it to autonomously synthesize working PoCs for 67.5% of cases on the strongest configuration. Deployed on production Windows services, SLYP discovers 28 previously unknown vulnerabilities across nine COM services, all confirmed by the Microsoft Security Response Center (MSRC) with 16 CVEs assigned and $140,000 in bounties. Furthermore, SLYP is designed with generalizable binary analysis and debugging interfaces, making it readily applicable to other commercial off-the-shelf (COTS) binaries beyond Windows COM services.

10. [Adapting Large Language Models to a Low-Resource Agglutinative Language: A Comparative Study of LoRA and QLoRA for Bashkir](https://arxiv.org/abs/2605.04948v1)
   - Published：2026-05-06 22:14
   - 作者：Mullosharaf K. Arabov，Svetlana S. Khaybullina
   - 来源：arxiv
   - 相关性分数：125
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "evaluation"; has PDF
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.04948v1
   - 摘要：This paper presents a comparative study of parameter-efficient fine-tuning (PEFT) methods, including LoRA and QLoRA, applied to the task of adapting large language models to the Bashkir language, a low-resource agglutinative language of the Turkic family. Experimental evaluation is conducted on a Bashkir text corpus of 71k documents (46.9M tokens) using models of various architectures: DistilGPT2, GPT-2 (base, medium), Phi-2, Qwen2.5-7B, DeepSeek-7B, and Mistral-7B. To improve the reliability of results, each configuration was trained with three different random seeds. The lowest perplexity on the test set was obtained for GPT-2 medium with full fine-tuning (3.34). Meanwhile, QLoRA applied to Mistral-7B (3.79) and Phi-2 (3.81) achieved comparable quality with over 40 times fewer trainable parameters. However, we also observed cases of significant quality degradation when using PEFT for certain architectures (e.g., DeepSeek-7B with rank 8, perplexity = 129.55), indicating that the outcome depends critically on the choice of the base model and its tokenizer. Additionally, a qualitative analysis of generated texts based on Bashkir prompts revealed that models with the best perplexity do not necessarily produce the most coherent outputs: QLoRA-tuned models generated monolingual Bashkir continuations, whereas the fully fine-tuned model with the lowest perplexity frequently switched to English. The results suggest that QLoRA on 7B-scale models offers an effective compromise between quality and computational cost for Bashkir. To ensure reproducibility, open data, code, and trained adapters will be released upon acceptance.

11. [Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction](https://arxiv.org/abs/2605.05121v1)
   - Published：2026-05-07 00:49
   - 作者：Yucheng Ruan，Ling Huang，Qika Lin，Kai He，Mengling Feng
   - 来源：arxiv
   - 相关性分数：124
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05121v1
   - 摘要：Automated mental health prediction using textual data has shown promising results with deep learning and large language models. However, deploying these models in high-stakes real-world settings remains challenging, as existing approaches largely rely on semantic representations and often produce overconfident predictions under ambiguous, noisy, or shifted data. Moreover, most methods lack reliable uncertainty estimation, undermining trust in risk-sensitive mental health applications. To address these limitations, we formulate the task as a multi-view learning problem that integrates semantic information from encoder-only models with higher-level reasoning information from decoder-only models, where reasoning-aware representations and uncertainty modeling are obtained in a trustworthy manner. To ensure reliable fusion, we adopt an evidential learning framework based on Subjective Logic to explicitly model uncertainty and introduce an evidential fusion strategy that balances complementary views while discounting unreliable evidence. Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction. Additional experiments on robustness to noise and case studies for interpretability confirm that our proposed framework not only improves predictive performance but also provides trustworthy uncertainty estimates and human-understandable reasoning signals, making it suitable for risk-sensitive applications in mental health assessment.

12. [Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models](https://arxiv.org/abs/2605.05090v1)
   - Published：2026-05-07 00:27
   - 作者：Quintin Pope，Ajay Hayagreeve Balaji，Jacques Thibodeau，Xiaoli Fern
   - 来源：arxiv
   - 相关性分数：123
   - 命中原因：title matched "language model"; summary matched "large language model"; summary matched "reasoning"; summary matched "evaluation"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05090v1
   - 摘要：We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior.

13. [UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning](https://arxiv.org/abs/2605.04941v1)
   - Published：2026-05-06 22:10
   - 作者：Ivan Kartáč，Kristýna Onderková，Jan Bronec，Zdeněk Kasner，Mateusz Lango，Ondřej Dušek
   - 来源：arxiv
   - 相关性分数：121
   - 命中原因：title matched "reasoning"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.04941v1
   - 摘要：This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task's main ranking metric and analyze its limitations.

14. [MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge](https://arxiv.org/abs/2605.05175v1)
   - Published：2026-05-07 01:42
   - 作者：Perry E. Radau
   - 来源：arxiv
   - 相关性分数：111
   - 命中原因：title matched "LLM"; title matched "benchmark"; has PDF; has rich summary
   - 分类：eess.IV, cs.CL, physics.med-ph
   - 标签：评测 / 应用 / 方法
   - 主题词：Benchmark / LLM
   - PDF：https://arxiv.org/pdf/2605.05175v1
   - 摘要：Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.

15. [Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals](https://arxiv.org/abs/2605.05025v1)
   - Published：2026-05-06 23:21
   - 作者：Gijs van Dijk
   - 来源：arxiv
   - 相关性分数：108
   - 命中原因：title matched "language model"; title matched "large language model"; has PDF; has rich summary
   - 分类：cs.CL
   - 标签：数据 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.05025v1
   - 摘要：We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.
