# 每日论文简报

- 生成时间：2026-05-27 13:23:19 (Asia/Shanghai)
- 检索窗口：最近 24 小时
- 命中概览：LM=15, Agent Runtime Security=7, Terminal and SWE Agents=0
- 排序策略：hybrid (relevance first, published_at tie-break)

## 今日重点

- 主题「Language Model」：命中 16 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》、《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》。
- 主题「LLM」：命中 15 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》、《Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments》。
- 主题「Evaluation」：命中 5 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》、《Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation》。
- 主题「Agent」：命中 4 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents》、《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》。
- 主题「Large Language Model」：命中 4 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search》、《Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals》。

## 主题聚焦

### Language Model

- 命中篇数：16
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》、《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》、《Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments》
- 主题速读：
  - 《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to in…
  - 《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》〔评测 / 数据 / 应用 / 方法〕：Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and rea…

### LLM

- 命中篇数：15
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》、《Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments》、《VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions》
- 主题速读：
  - 《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to in…
  - 《Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments》〔评测 / 应用 / 方法〕：Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and to…

### Evaluation

- 命中篇数：5
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》、《Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation》、《GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing》
- 主题速读：
  - 《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》〔评测 / 数据 / 应用 / 方法〕：Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and rea…
  - 《Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation》〔评测 / 应用 / 方法〕：We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework,…

### Agent

- 命中篇数：4
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents》、《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》、《Governed Evolution of Agent Runtimes through Executable Operational Cognition》
- 主题速读：
  - 《ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents》〔评测 / 应用 / 方法〕：Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' la…
  - 《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》〔评测 / 方法〕：LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agenti…

### Large Language Model

- 命中篇数：4
- 覆盖分组：LM、Agent Runtime Security
- 代表论文：《Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search》、《Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals》、《BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning》
- 主题速读：
  - 《Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search》〔应用 / 方法〕：Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Su…
  - 《Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals》〔评测 / 应用 / 方法〕：Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limit…

## LM 观察

### 本组速览

- 《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to in…
- 《Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation》〔评测 / 数据 / 应用 / 方法〕：Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and rea…
- 《Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments》〔评测 / 应用 / 方法〕：Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and to…
- 《VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions》〔评测 / 方法〕：Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings inc…
- 《MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation》〔评测 / 方法〕：Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and stat…

### 论文速览

1. [Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry](https://arxiv.org/abs/2605.27071v1)
   - Published：2026-05-26 22:21
   - 作者：Changqing Su，Yu Ding，Zuhong Lin，Hongyu Liu，Xi He，Zheng Zeng 等
   - 来源：arxiv
   - 相关性分数：214
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27071v1
   - 摘要：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.

2. [Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation](https://arxiv.org/abs/2605.27134v1)
   - Published：2026-05-26 23:03
   - 作者：Heng Qu，Yike Liu，Renren Jin，Wenzong Zhang，Pengzhi Gao，Wei Liu 等
   - 来源：arxiv
   - 相关性分数：201
   - 命中原因：title matched "reasoning"; title matched "agent"; title matched "benchmark"; summary matched "language model"
   - 分类：cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / Evaluation
   - PDF：https://arxiv.org/pdf/2605.27134v1
   - 摘要：Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

3. [Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments](https://arxiv.org/abs/2605.27209v1)
   - Published：2026-05-27 00:02
   - 作者：Yuxin Chen，Xiaodong Cai，Junfeng Fang，Zhuowen Han，Yu Wang，Yaorui Shi 等
   - 来源：arxiv
   - 相关性分数：176
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27209v1
   - 摘要：Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

4. [VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions](https://arxiv.org/abs/2605.27141v1)
   - Published：2026-05-26 23:07
   - 作者：Yuxin Chen，Yi Zhang，Zhengzhou Cai，Yaorui Shi，Zhiyuan Yao，Chenhang Cui 等
   - 来源：arxiv
   - 相关性分数：175
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.AI
   - 标签：评测 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27141v1
   - 摘要：Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

5. [MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation](https://arxiv.org/abs/2605.27366v1)
   - Published：2026-05-27 01:59
   - 作者：Huawei Lin，Peng Li，Jie Song，Fuxin Jiang，Tieying Zhang
   - 来源：arxiv
   - 相关性分数：164
   - 命中原因：title matched "agent"; title matched "evaluation"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.AI, cs.CL, cs.LG, cs.MA
   - 标签：评测 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27366v1
   - 摘要：Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

6. [BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning](https://arxiv.org/abs/2605.27293v1)
   - Published：2026-05-27 01:06
   - 作者：Shijin Gong，Erhan Xu，Kai Ye，Francesco Quinzan，Giulia Livieri，Chengchun Shi
   - 来源：arxiv
   - 相关性分数：163
   - 命中原因：title matched "LLM"; title matched "reasoning"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.LG, stat.ML
   - 标签：方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27293v1
   - 摘要：Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.

7. [Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation](https://arxiv.org/abs/2605.27210v1)
   - Published：2026-05-27 00:02
   - 作者：Juan Cruz-Benito，Ismael Faro
   - 来源：arxiv
   - 相关性分数：162
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "reasoning"; summary matched "RAG"
   - 分类：quant-ph, cs.AI
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Evaluation
   - PDF：https://arxiv.org/pdf/2605.27210v1
   - 摘要：We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas' proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations -- a total of 39,200 model runs -- to demonstrate the benchmark's utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect -- it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.

8. [MATCHA: Matching Text via Contrastive Semantic Alignment](https://arxiv.org/abs/2605.27345v1)
   - Published：2026-05-27 01:47
   - 作者：Siran Li，Ece Sena Etoglu，Carsten Eickhoff，Seyed Ali Bahrainian
   - 来源：arxiv
   - 相关性分数：160
   - 命中原因：title matched "alignment"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27345v1
   - 摘要：Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (https://github.com/Siran-Li/MATCHA).

9. [QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents](https://arxiv.org/abs/2605.27068v1)
   - Published：2026-05-26 22:19
   - 作者：Ye Yuan，Rui Song，Weien Li，Zeyu Li，Haochen Liu，Xiangyu Kong 等
   - 来源：arxiv
   - 相关性分数：156
   - 命中原因：title matched "agent"; summary matched "language model"; summary matched "large language model"; summary matched "LLM"
   - 分类：cs.CL, cs.AI, cs.MA
   - 标签：评测 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27068v1
   - 摘要：Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

10. [Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)](https://arxiv.org/abs/2605.27268v1)
   - Published：2026-05-27 00:44
   - 作者：Samer Awad，Javier Conde，Carlos Arriaga，Tairan Fu，Javier Coronado-Blázquez，Pedro Reviriego
   - 来源：arxiv
   - 相关性分数：145
   - 命中原因：title matched "LLM"; title matched "RAG"; summary matched "language model"; summary matched "large language model"
   - 分类：cs.CL, cs.AI
   - 标签：评测 / 数据 / 应用 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27268v1
   - 摘要：Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

11. [ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents](https://arxiv.org/abs/2605.27240v1)
   - Published：2026-05-27 00:22
   - 作者：Xing Fu，Yulin Hu，Mengtong Ji，Haozhen Li，Yixin Sun，Weixiang Zhao 等
   - 来源：arxiv
   - 相关性分数：144
   - 命中原因：title matched "agent"; title matched "benchmark"; summary matched "LLM"; summary matched "alignment"
   - 分类：cs.CL
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2605.27240v1
   - 摘要：Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.

12. [Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics](https://arxiv.org/abs/2605.27296v1)
   - Published：2026-05-27 01:08
   - 作者：Jiashuo Wang，Fenggang Yu，Jian Wang，Chak Tou Leong，Xiaoyu Shen，Chunpu Xu 等
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "benchmark"
   - 分类：cs.CL
   - 标签：评测 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27296v1
   - 摘要：Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.

13. [It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty](https://arxiv.org/abs/2605.27288v1)
   - Published：2026-05-27 01:04
   - 作者：Kevin H. Guo，Chao Yan，Avinash Baidya，Katherine Brown，Xiang Gao，Juming Xiong 等
   - 来源：arxiv
   - 相关性分数：141
   - 命中原因：title matched "LLM"; summary matched "language model"; summary matched "large language model"; summary matched "alignment"
   - 分类：cs.CL, cs.AI, cs.LG
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.27288v1
   - 摘要：Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

14. [Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search](https://arxiv.org/abs/2605.27066v1)
   - Published：2026-05-26 22:16
   - 作者：Mingyue Wang，Xingyu Xie，Hang Yang，Li Gao，Lixin Su，Ge Chen 等
   - 来源：arxiv
   - 相关性分数：136
   - 命中原因：title matched "language model"; title matched "large language model"; summary matched "RAG"; has DOI
   - 分类：cs.CL, cs.IR
   - 标签：应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.27066v1
   - 摘要：Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.

15. [GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing](https://arxiv.org/abs/2605.27204v1)
   - Published：2026-05-26 23:58
   - 作者：Pujun Zheng，Wanying Ren，Jiacheng Yao，Guoxiu He，Star X. Zhao
   - 来源：arxiv
   - 相关性分数：126
   - 命中原因：title matched "LLM"; title matched "evaluation"; summary matched "RAG"; has PDF
   - 分类：cs.CL, cs.IR
   - 标签：评测 / 应用 / 方法
   - 主题词：LLM / Evaluation
   - PDF：https://arxiv.org/pdf/2605.27204v1
   - 摘要：Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $ρ$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.

## Agent Runtime Security 观察

### 本组速览

- 《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》〔评测 / 方法〕：LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agenti…
- 《Governed Evolution of Agent Runtimes through Executable Operational Cognition》〔评测 / 方法〕：Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such…
- 《Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals》〔评测 / 应用 / 方法〕：Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limit…
- 《BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning》〔方法〕：In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BA…
- 《AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian》〔评测 / 数据 / 方法〕：Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We pre…

### 论文速览

1. [EviACT: An Evidence-to-Action Framework for Agentic Program Repair](https://arxiv.org/abs/2605.27238v1)
   - Published：2026-05-27 00:17
   - 作者：Qianru Meng，Xiao Zhang，Zhaochun Ren，Joost Visser
   - 来源：arxiv
   - 相关性分数：122
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE
   - 标签：评测 / 方法
   - 主题词：LLM / Agent
   - PDF：https://arxiv.org/pdf/2605.27238v1
   - 摘要：LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still struggle to use execution evidence to guide localization, patch generation, and validation. We propose EviACT (Evidence-to-Action), an agentic APR framework that coordinates three evidence-driven guardrails across repair stages. The retrieval scaffold grounds repair context, the compile gate filters invalid edits, and the test-driven gate checks target-test recovery before full regression. Across four benchmarks, EviACT improves resolve rate over the strongest reported comparable baselines by 1.6-6.0 percentage points and shows 70.1-88.6% lower reported per-bug API cost where baseline costs are available. Ablations and diagnostics suggest that these gains are associated with the coordinated evidence-to-action chain, making agentic APR more effective and efficient.

2. [Governed Evolution of Agent Runtimes through Executable Operational Cognition](https://arxiv.org/abs/2605.27328v1)
   - Published：2026-05-27 01:36
   - 作者：Mariano Garralda-Barrio
   - 来源：arxiv
   - 相关性分数：70
   - 命中原因：title matched "agent runtime"; has PDF; has rich summary; has complete metadata
   - 分类：cs.SE, cs.AI, cs.MA
   - 标签：评测 / 方法
   - 主题词：Evaluation / Agent
   - PDF：https://arxiv.org/pdf/2605.27328v1
   - 摘要：Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

3. [Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals](https://arxiv.org/abs/2605.26999v1)
   - Published：2026-05-26 21:19
   - 作者：Akindoyin Akinrele，Shreyank N Gowda
   - 来源：arxiv
   - 相关性分数：65
   - 命中原因：title matched "prompt injection"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL, cs.CR
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.26999v1
   - 摘要：Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.

4. [BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning](https://arxiv.org/abs/2605.27110v1)
   - Published：2026-05-26 22:51
   - 作者：Xuan Luo，Yue Wang，Geng Tu，Jing Li，Ruifeng Xu
   - 来源：arxiv
   - 相关性分数：45
   - 命中原因：summary matched "jailbreak"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR, cs.CL
   - 标签：方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.27110v1
   - 摘要：In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.

5. [AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian](https://arxiv.org/abs/2605.26954v1)
   - Published：2026-05-26 20:43
   - 作者：Wajdi Zaghouani，Kholoud K. Aldous，Isra Fejzullaj
   - 来源：arxiv
   - 相关性分数：43
   - 命中原因：summary matched "guardrail"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CL
   - 标签：评测 / 数据 / 方法
   - 主题词：Language Model / LLM
   - PDF：https://arxiv.org/pdf/2605.26954v1
   - 摘要：Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.

6. [Secure UAV Swarms in Low-Altitude Wireless Networks: Challenges and Solutions](https://arxiv.org/abs/2605.26876v1)
   - Published：2026-05-26 19:33
   - 作者：Yuntao Wang，Haojia Yang，Han Liu，Jianle Ba，Zhou Su
   - 来源：arxiv
   - 相关性分数：42
   - 命中原因：summary matched "agent attack"; has PDF; has rich summary; has complete metadata
   - 分类：cs.CR
   - 标签：评测 / 应用 / 方法
   - 主题词：Evaluation / Agent
   - PDF：https://arxiv.org/pdf/2605.26876v1
   - 摘要：Unmanned aerial vehicle (UAV) swarms are increasingly deployed in vast low-altitude applications, owing to their capabilities in distributed sensing, flexible communication, and autonomous coordination. Nevertheless, the open and highly dynamic operating environment of UAV swarms introduces serious security risks, including GPS spoofing, insider threats, and multi-hop intrusion. These threats are aggravated by limited on-board resources, frequently changing network topology, and the presence of intelligent adversaries. To tackle these issues, this paper proposes a cloud-edge-end collaborative defense framework for UAV swarms. Based on this framework, three complementary mechanisms are developed. First, a cooperative perception scheme is designed to resist GPS spoofing via interactive attack-defense game modeling. Second, a behavior-driven authentication method with trust evaluation is developed to mitigate insider threats. Third, a multi-agent attack forensics framework is devised to intelligently trace the propagation paths of multi-hop attacks in UAV networks. Experimental results validate the effectiveness of the proposed approaches. Finally, several open research directions are outlined.

7. [Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study](https://arxiv.org/abs/2605.26870v1)
   - Published：2026-05-26 19:28
   - 作者：Anas H. Alzahrani
   - 来源：arxiv
   - 相关性分数：42
   - 命中原因：summary matched "agent runtime"; has PDF; has rich summary; has complete metadata
   - 分类：cs.MA, cs.AI, cs.HC
   - 标签：评测 / 应用 / 方法
   - 主题词：Language Model / Large Language Model
   - PDF：https://arxiv.org/pdf/2605.26870v1
   - 摘要：Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.

## Terminal and SWE Agents 观察

今日没有新的命中文献。