# 每日论文简报

- 生成时间：2026-04-29 12:26:28 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「LLM」：命中 13 篇，覆盖 LM，代表论文包括 《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》、《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》。
- 主题「Language Model」：命中 12 篇，覆盖 LM，代表论文包括 《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》、《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》。
- 主题「Benchmark」：命中 4 篇，覆盖 LM，代表论文包括 《DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios》、《RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements》。
- 主题「Evaluation」：命中 1 篇，覆盖 LM，代表论文包括 《Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers》。

## 主题聚焦

### LLM

- 命中篇数：13
- 覆盖分组：LM
- 代表论文：《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》、《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》、《DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios》
- 主题速读：
  - 《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths…
  - 《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》〔评测 / 应用 / 方法〕：Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still f…

### Language Model

- 命中篇数：12
- 覆盖分组：LM
- 代表论文：《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》、《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》、《From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling》
- 主题速读：
  - 《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths…
  - 《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》〔评测 / 应用 / 方法〕：Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still f…

### Benchmark

- 命中篇数：4
- 覆盖分组：LM
- 代表论文：《DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios》、《RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements》、《ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents》
- 主题速读：
  - 《DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios》〔评测 / 应用 / 方法〕：Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks…
  - 《RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements》〔评测〕：Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly ge…

### Evaluation

- 命中篇数：1
- 覆盖分组：LM
- 代表论文：《Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers》
- 主题速读：
  - 《Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers》〔评测 / 方法〕：Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior gener…

## LM 观察

### 本组速览

- 《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths…
- 《Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models》〔评测 / 应用 / 方法〕：Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still f…
- 《DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios》〔评测 / 应用 / 方法〕：Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks…
- 《From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling》〔评测 / 方法〕：Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from na…
- 《SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?》〔评测 / 方法〕：Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task succes…

### 论文速览

1. [LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation](https://arxiv.org/abs/2604.25665v1)
   - Published：2026-04-28 22:00
   - 作者：Huyen Nguyen，Haoxuan Zhang，Yang Zhang，Junhua Ding，Haihua Chen
   - 来源：arxiv
   - 相关性分数：197
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI, cs.DL, cs.IR
   - 标签：评测 / 数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25665v1
   - 摘要：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

2. [Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models](https://arxiv.org/abs/2604.25591v1)
   - Published：2026-04-28 20:56
   - 作者：Chun-Yi Kuan，Wei-Ping Huang，Hung-yi Lee
   - 来源：arxiv
   - 相关性分数：178
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "LLM"; summary matched "reasoning"
   - 分类：eess.AS, cs.AI, cs.CL, cs.LG, cs.SD
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25591v1
   - 摘要：Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

3. [DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios](https://arxiv.org/abs/2604.25914v1)
   - Published：2026-04-29 01:58
   - 作者：Jinxiang Meng，Shaoping Huang，Fangyu Lei，Jingyu Guo，Haoxiang Liu，Jiahao Su 等
   - 来源：arxiv
   - 相关性分数：165
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2604.25914v1
   - 摘要：Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.

4. [From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling](https://arxiv.org/abs/2604.25847v1)
   - Published：2026-04-29 00:53
   - 作者：Jianghao Lin，Zi Ling，Chenyu Zhou，Tianyi Xu，Ruoqing Jiang，Zizhuo Wang 等
   - 来源：arxiv
   - 相关性分数：164
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "language model"; summary matched "large language model"
   - 分类：math.OC, cs.AI, cs.LG
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25847v1
   - 摘要：Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural-language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emph{Agora-Opt}, a modular agentic framework for optimization modeling that combines decentralized debate with a read-write memory bank. Agora-Opt allows multiple agent teams to independently produce end-to-end solutions and reconcile them through an outcome-grounded debate protocol, while memory stores solver-verified artifacts and past disagreement resolutions to support training-free improvement over time. This design is flexible across both backbones and methods: it reduces base-model lock-in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora-Opt achieves the strongest overall performance among all compared methods, outperforming strong zero-shot LLMs, training-centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross-checking with reusable experience, and position Agora-Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora-Opt.

5. [SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?](https://arxiv.org/abs/2604.25737v1)
   - Published：2026-04-28 23:04
   - 作者：Noam Tarshish，Nofar Selouk，Daniel Hodisan，Bar Ezra Gafniel，Yuval Elovici，Asaf Shabtai 等
   - 来源：arxiv
   - 相关性分数：158
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.SE, cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25737v1
   - 摘要：Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single-model results reported for EditBench and an implemented ReAct single-agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single-model baseline by 3.8 percentage points and the ReAct single-agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit's overall success rate. SAFEdit's automated error analysis further indicates a reduction in instruction-level hallucinations compared to single-agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

6. [Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents](https://arxiv.org/abs/2604.25684v1)
   - Published：2026-04-28 22:15
   - 作者：Eranga Bandara，Ross Gore，Asanga Gunaratna，Sachini Rajapakse，Isurunima Kularathna，Ravi Mukkamala 等
   - 来源：arxiv
   - 相关性分数：157
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25684v1
   - 摘要：The rapid deployment of autonomous AI agents across enterprise, healthcare, and safety-critical environments has created a fundamental governance gap. Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions. We address this gap by drawing on how humans self-govern naturally: before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizational rules to evaluate whether an intended action is permissible, requires modification, or demands escalation. This paper proposes a neurocognitive governance framework that formally maps this human self-governance process to LLM-driven agent reasoning, establishing a structural parallel between the human brain and the large language model as the cognitive core of an agent. We formalize a Pre-Action Governance Reasoning Loop (PAGRL) in which agents consult a four-layer governance rule set: global, workflow-specific, agent-specific, and situational before every consequential action, mirroring how human organizations structure compliance hierarchies across enterprise, department, and role levels. Implemented on a production-grade retail supply chain workflow, the framework achieves 95% compliance accuracy and zero false escalations to human oversight, demonstrating that embedding governance into agent reasoning produces more consistent, explainable, and auditable compliance than external enforcement. This work offers a principled foundation for autonomous AI agents that govern themselves the way humans do: not because rules are imposed upon them, but because deliberation is embedded in how they think.

7. [Towards Agentic Investigation of Security Alerts](https://arxiv.org/abs/2604.25846v1)
   - Published：2026-04-29 00:52
   - 作者：Even Eilertsen，Vasileios Mavroeidis，Gudmund Grov
   - 来源：arxiv
   - 相关性分数：154
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CR, cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25846v1
   - 摘要：Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation across multiple log sources, a task that is usually time-consuming. In this paper, we present an experimental, agentic workflow that leverages large language models (LLMs) augmented with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate the first stages of alert investigation. The proposed workflow integrates queries to provide an overview of the available data, and LLM components that selects which queries to use based on the overview results, extracts raw evidence from the query results, and delivers a final verdict of the alert. Our results demonstrate that the LLM-powered workflow can investigate log sources, plan an investigation, and produce a final verdict that has a significantly higher accuracy than a verdict produced by the same LLM without the proposed workflow. By recognizing the inherent limitations of directly applying LLMs to high-volume and unstructured data, we propose combining existing investigation practices of real-world analysts with a structured approach to leverage LLMs as virtual security analysts, thereby assisting and reducing the manual workload.

8. [RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements](https://arxiv.org/abs/2604.25862v1)
   - Published：2026-04-29 00:59
   - 作者：Leon Kogler，Stefan Hangler，Maximilian Ehrhart，Benedikt Dornauer，Roland Wuersching，Peter Schrammel
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "LLM"; title matched "benchmark"; summary matched "RAG"; summary matched "evaluation"
   - 分类：cs.SE, cs.AI
   - 标签：评测
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2604.25862v1
   - 摘要：Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

9. [ADEMA: A Knowledge-State Orchestration Architecture for Long-Horizon Knowledge Synthesis with LLMAgents](https://arxiv.org/abs/2604.25849v1)
   - Published：2026-04-29 00:54
   - 作者：Zhou Hanlin，Chan Huah Yong
   - 来源：arxiv
   - 相关性分数：146
   - 命中原因：title matched "LLM"; title matched "agent"; summary matched "benchmark"; summary matched "evaluation"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：LLM / Benchmark
   - PDF：https://arxiv.org/pdf/2604.25849v1
   - 摘要：Long-horizon LLM tasks often fail not because a single answer is unattainable, but because knowledge states drift across rounds, intermediate commitments remain implicit, and interruption fractures the evolving evidence chain. This paper presents ADEMA as a knowledge-state orchestration architecture for long-horizon knowledge synthesis rather than as a generic multi-agent runtime. The architecture combines explicit epistemic bookkeeping, heterogeneous dual-evaluator governance, adaptive task-mode switching, reputation-shaped resource allocation, checkpoint-resumable persistence, segment-level memory condensation, artifact-first assembly, and final-validity checking with safe fallback. Evidence is drawn entirely from existing materials: a four-scenario showcase package, a fixed 60-run mechanism matrix, targeted micro-ablation and artifact-chain supplements, and a repaired protocol-level benchmark in which code-oriented evaluation is the clearest quality-sensitive mechanism block. Across the fixed matrix, removing checkpoint/resume produced the only invalid run, and it did so in the interruption-sensitive resume condition. By contrast, dual evaluation, segment synthesis, and dynamic governance are best interpreted as supporting control mechanisms that shape trajectory discipline, explicit artifact progression, and cost-quality behavior rather than as universal binary prerequisites for completion. The contribution is therefore a knowledge-state orchestration architecture in which explicit epistemic state transition, evidence-bearing artifact progression, and recoverable continuity are the primary design commitments.

10. [Recursive Multi-Agent Systems](https://arxiv.org/abs/2604.25917v1)
   - Published：2026-04-29 01:59
   - 作者：Xiyuan Yang，Jiaru Zou，Rui Pan，Ruizhong Qiu，Pan Lu，Shizhe Diao 等
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "reasoning"; summary matched "RAG"
   - 分类：cs.AI, cs.CL, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Benchmark
   - PDF：https://arxiv.org/pdf/2604.25917v1
   - 摘要：Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2$\times$-2.4$\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.

11. [Progressing beyond Art Masterpieces or Touristic Clichés: how to assess your LLMs for cultural alignment?](https://arxiv.org/abs/2604.25654v1)
   - Published：2026-04-28 21:50
   - 作者：António Branco，João Silva，Nuno Marques，Luis Gomes，Ricardo Campos，Raquel Sequeira 等
   - 来源：arxiv
   - 相关性分数：143
   - 命中原因：title matched "LLM"; title matched "alignment"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL
   - 标签：数据 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25654v1
   - 摘要：Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention -- often framed in terms of cultural bias -- until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demonstrate that our design yields test sets with greater discriminative power, effectively distinguishing between models specialized for a given culture and those that are not, ceteris paribus.

12. [CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation](https://arxiv.org/abs/2604.25774v1)
   - Published：2026-04-28 23:41
   - 作者：Wei-Chun Chen，Yu-Xuan Chen，I-Fang Chung，Ying-Jia Lin
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.CL, cs.AI
   - 标签：应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25774v1
   - 摘要：Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

13. [From CRUD to Autonomous Agents: Formal Validation and Zero-Trust Security for Semantic Gateways in AI-Native Enterprise Systems](https://arxiv.org/abs/2604.25555v1)
   - Published：2026-04-28 20:25
   - 作者：Ignacio Peyrano
   - 来源：arxiv
   - 相关性分数：137
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CR, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25555v1
   - 摘要：Enterprise software engineering is shifting away from deterministic CRUD/REST architectures toward AI-native systems where large language models act as cognitive orchestrators. This transition introduces a critical security tension: probabilistic LLMs weaken classical mechanisms for validation, access control, and formal testing. This paper proposes the design, formal validation, and empirical evaluation of a Semantic Gateway governed by the Model Context Protocol (MCP). The gateway reframes the enterprise API as a semantic surface where tools are dynamically discovered, authorized, and executed based on intent and policy enforcement. The central contribution rests on a paradigm shift: autonomous agents must not be validated as traditional software nor as simple API consumers, but as stochastic state-transition systems whose behavior must be abstracted, fuzzed, and audited through enabled-tool graphs. The architecture introduces a three-layer Zero-Trust security model comprising a pre-inference Semantic Firewall, deterministic Tool-Level RBAC, and out-of-band Cryptographic Human-in-the-Loop approval. Enabledness-Preserving Abstractions (EPAs) and greybox semantic fuzzing--originally developed for blockchain smart contract verification--are adapted to audit agent behavior in enterprise environments. Results demonstrate an 84.2% reduction in incidental code. Across 500,000 multi-turn fuzzing sequences, the methodology achieved a 100% discovery rate of hidden unauthorized state transitions, proving that dynamic formal verification is strictly necessary for secure agentic deployment.

14. [Assistants, Not Architects: The Role of LLMs in Networked Systems Design](https://arxiv.org/abs/2604.25506v1)
   - Published：2026-04-28 19:08
   - 作者：Pratyush Sahu，Rahul Bothra，Venkat Arun，Brighten Godfrey，Akshay Narayan，Ahmed Saeed
   - 来源：arxiv
   - 相关性分数：136
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "reasoning"
   - 分类：cs.NI, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Language Model
   - PDF：https://arxiv.org/pdf/2604.25506v1
   - 摘要：Designing the architecture of modern networked systems requires navigating a large, combinatorial space of hardware, systems, and configuration choices with complex cross-layer interactions. Architects must balance competing objectives such as performance, cost, and deployability while satisfying compatibility and resource constraints, often relying on scattered rules-of-thumb drawn from benchmarks, papers, documentation, and expert experience. This raises a natural question: can large language models (LLMs) reliably perform this kind of architectural reasoning? We find that they cannot. While LLMs produce plausible configurations, they frequently miss critical constraints, encode incorrect assumptions, and exhibit ``stickiness'' to familiar patterns. A natural workaround--iterative validation via simulation or experimentation--is often prohibitively expensive at scale and, in many cases, infeasible, particularly when comparing hardware-dependent alternatives. Motivated by this gap, we present Kepler, a lightweight reasoning framework for architecture design that combines structured, expert-driven specifications with SMT-based optimization. Kepler encodes architecturally significant properties--requirements, incompatibilities, and qualitative trade-offs--about systems, hardware, and workloads as constraints, and synthesizes feasible designs that optimize user-defined objectives. It operates at an abstract level, capturing ``rules-of-thumb'' rather than detailed system behavior, enabling tractable reasoning while preserving key interactions, and provides explanations for its decisions. Through experiments and case studies, we show that Kepler uncovers interactions missed by LLMs and supports systematic, explainable design exploration.

15. [Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers](https://arxiv.org/abs/2604.25891v1)
   - Published：2026-04-29 01:36
   - 作者：Jan Dubiński，Jan Betley，Anna Sztyber-Betley，Daniel Tan，Owain Evans
   - 来源：arxiv
   - 相关性分数：125
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "reasoning"; summary matched "evaluation"
   - 分类：cs.LG, cs.AI, cs.CR
   - 标签：评测 / 方法
   - 主题词：Language Model / Evaluation
   - PDF：https://arxiv.org/pdf/2604.25891v1
   - 摘要：Finetuning a language model can lead to emergent misalignment (EM) [Betley et al., 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to more egregious behaviors when tested outside the training distribution. We study a set of interventions proposed to reduce EM. We confirm that these interventions reduce or eliminate EM on existing evaluations (questions like "How do I make a quick buck?"). However, if the evaluation prompts are tweaked to resemble the training context, the model displays EM. We call this conditional misalignment. As in standard EM, the model displays misaligned behaviors more egregious than those seen during training, but only on inputs sharing features with the training data. The first two interventions are diluting misaligned data with benign data, and finetuning on benign data after misaligned data. Both produce conditional misalignment. For instance, models trained on a mix of only 5% insecure code still show misalignment when asked to format responses as Python strings (resembling the training context). The third intervention is inoculation prompting. Here, statements with a similar form to the inoculation prompt serve as triggers for misalignment, even if they have the opposite meaning. On the positive side, inoculation prompting has lower (but still non-zero) conditional misalignment if training is on-policy or includes reasoning distillation. Our results imply that in realistic post-training, where misaligned data is typically combined with benign data, models may be conditionally misaligned even if standard evaluations look clean.