LM Feed Archive

LM Feed Archive lm.html LM 的长期订阅 RSS，汇总最近命中的论文和归档。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models ../papers/arxiv-91c0ed0f09c2.html https://arxiv.org/abs/2606.27047v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark… Joint Learning of Experiential Rules and Policies for Large Language Model Agents ../papers/arxiv-48c067a92ef9.html https://arxiv.org/abs/2606.27136v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited c… The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans ../papers/arxiv-852671b09eb4.html https://arxiv.org/abs/2606.27103v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretatio… Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings ../papers/arxiv-663bd6d3e1b5.html https://arxiv.org/abs/2606.27287v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few c… Semantic Early-Stopping for Iterative LLM Agent Loops ../papers/arxiv-232f944cff9f.html https://arxiv.org/abs/2606.27009v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer's mea… TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference ../papers/arxiv-4739852a0036.html https://arxiv.org/abs/2606.27161v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principl… When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models ../papers/arxiv-89d85e485ed3.html https://arxiv.org/abs/2606.27288v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical margin… Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization ../papers/arxiv-de6c8b129e13.html https://arxiv.org/abs/2606.27025v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \emph{Interac… HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models ../papers/arxiv-e4f42bdcbdde.html https://arxiv.org/abs/2606.27187v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent.… RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning ../papers/arxiv-ceeaa87c79d1.html https://arxiv.org/abs/2606.26997v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group R… In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics ../papers/arxiv-03dd67b86fdf.html https://arxiv.org/abs/2606.26981v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic co… OpenRCA 2.0: From Outcome Labels to Causal Process Supervision ../papers/arxiv-20b18a2e996b.html https://arxiv.org/abs/2606.27154v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injectio… Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA ../papers/arxiv-5c94d679d076.html https://arxiv.org/abs/2606.27023v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration te… Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement ../papers/arxiv-7dba85bc00e8.html https://arxiv.org/abs/2606.27226v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evalua… When are likely answers right? On Sequence Probability and Correctness in LLMs ../papers/arxiv-cb621cfe5b86.html https://arxiv.org/abs/2606.27359v1#2026-06-26#lm Fri, 26 Jun 2026 13:16:53 +0800 Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, model… InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy ../papers/arxiv-072adfbe1cb9.html https://arxiv.org/abs/2606.25984v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors. We introduce InvestPhilBench, a multi-layer dynamic benchmark spanning eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8). The v0.6 release comprises 118 primary-source-verified investment principle cards, 25 decision framework cards… Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models ../papers/arxiv-117d92e125e9.html https://arxiv.org/abs/2606.26079v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a s… How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations ../papers/arxiv-a993d18a7754.html https://arxiv.org/abs/2606.26041v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluati… MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction ../papers/arxiv-924e9f45b440.html https://arxiv.org/abs/2606.25651v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical e… Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz ../papers/doi-a2a209515fcf.html https://arxiv.org/abs/2606.25622v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, thi… TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs ../papers/arxiv-18419ba4812f.html https://arxiv.org/abs/2606.26029v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels an… Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets ../papers/arxiv-4357e9fa7bf7.html https://arxiv.org/abs/2606.25760v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchm… Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability ../papers/arxiv-dd5092a23d14.html https://arxiv.org/abs/2606.25819v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool-environment unreliability insufficiently examined. We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards. ToolBench-X contains executable multi-step tasks across div… Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation ../papers/arxiv-cca18893a109.html https://arxiv.org/abs/2606.25782v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably id… RAS: Measuring LLM Safety Through Refusal Alignment ../papers/arxiv-27c960f270d2.html https://arxiv.org/abs/2606.25750v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directio… Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents ../papers/arxiv-6a39b00c84b9.html https://arxiv.org/abs/2606.26080v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training… Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface ../papers/arxiv-ff6028825464.html https://arxiv.org/abs/2606.25941v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are pro… SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models ../papers/arxiv-e29b27f28510.html https://arxiv.org/abs/2606.25990v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations of machine emotional intelligence assess reasoning exclusively through isolated text or passive acoustic perception, overlooking the complex cross-modal reasoning required for active, multi-turn dialogue. We introduce \textsc{SpeechEQ}, a comprehensive framework desi… MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources ../papers/arxiv-c18831bd2d45.html https://arxiv.org/abs/2606.25832v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns t… SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment ../papers/arxiv-26ee80b3fcf6.html https://arxiv.org/abs/2606.25821v1#2026-06-25#lm Thu, 25 Jun 2026 13:11:21 +0800 Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which suffer from a scarcity of high-quality training data, often have their tokens routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders… AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning ../papers/arxiv-7fb19b10d271.html https://arxiv.org/abs/2606.24526v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introdu… AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach ../papers/doi-d4dcf6e219ed.html https://arxiv.org/abs/2606.24655v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large L… A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial ../papers/arxiv-15784a9d3bc2.html https://arxiv.org/abs/2606.24510v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for r… EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence ../papers/arxiv-a950eeb96676.html https://arxiv.org/abs/2606.24797v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evalu… Are We Ready For An Agent-Native Memory System? ../papers/arxiv-09ad880f1f66.html https://arxiv.org/abs/2606.24775v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, crit… CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning ../papers/arxiv-521120a059b4.html https://arxiv.org/abs/2606.24636v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified op… AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability ../papers/arxiv-22621133f739.html https://arxiv.org/abs/2606.24589v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use.… Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity ../papers/arxiv-80e1786313ab.html https://arxiv.org/abs/2606.24623v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We eva… Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models ../papers/arxiv-f1627fb5a350.html https://arxiv.org/abs/2606.24610v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework… Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment ../papers/arxiv-ecde3387ca11.html https://arxiv.org/abs/2606.24834v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond singl… ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling ../papers/arxiv-b83ab398c080.html https://arxiv.org/abs/2606.24605v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve… Scaling Laws for Task-Specific LLM Distillation ../papers/arxiv-f1ab970e444f.html https://arxiv.org/abs/2606.24747v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare… Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation ../papers/arxiv-640fe613ba1c.html https://arxiv.org/abs/2606.24515v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalabl… ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling ../papers/arxiv-945d1e05b320.html https://arxiv.org/abs/2606.24437v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 Mixture-of-Agents (MoA) architectures improve inference-time scaling by organizing multiple LLM agents into layered reasoning pipelines. However, existing MoA variants fail to sustain gains as depth increases, exhibiting degradation, early plateauing, or saturation. We propose ReM-MoA, a memory-augmented MoA framework that sustains scaling through two mechanisms: (1) a Ranked Reasoning Memory that persistently stores and ranks reasoning traces from all layers using a comparative Reviewer Agent,… Qwen-AgentWorld: Language World Models for General Agents ../papers/arxiv-dbbe3714f257.html https://arxiv.org/abs/2606.24597v1#2026-06-24#lm Wed, 24 Jun 2026 13:06:49 +0800 A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environ… AIR: Adaptive Interleaved Reasoning with Code in MLLMs ../papers/arxiv-3b598225b45f.html https://arxiv.org/abs/2606.23678v1#2026-06-23#lm Tue, 23 Jun 2026 13:10:02 +0800 Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers ML… TriggerBench: Investigating Prospective Memory for Large Language Models ../papers/arxiv-6e7f90d682c8.html https://arxiv.org/abs/2606.23459v1#2026-06-23#lm Tue, 23 Jun 2026 13:10:02 +0800 While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with… Can LLMs Reliably Self-Report Adversarial Prefills, and How? ../papers/arxiv-1052b107e838.html https://arxiv.org/abs/2606.23671v1#2026-06-23#lm Tue, 23 Jun 2026 13:10:02 +0800 Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introsp… Evaluation Awareness Is Not One Capability: Evidence from Open Language Models ../papers/arxiv-ee26aecb66ba.html https://arxiv.org/abs/2606.23583v1#2026-06-23#lm Tue, 23 Jun 2026 13:10:02 +0800 Safety benchmarks assume that test-condition behavior predicts deployment behavior, an assumption that fails if models detect evaluation cues and adapt. This opens a gap between benchmark performance and deployment behavior: compliance measured under test conditions becomes an optimistic upper bound that overstates how safely a model behaves once the evaluation harness is removed. We characterize this evaluation awareness through eight experiments across 37 open-weight models and seven families… POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation ../papers/arxiv-6142e52062f6.html https://arxiv.org/abs/2606.23533v1#2026-06-23#lm Tue, 23 Jun 2026 13:10:02 +0800 Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the output must follow strict formatting and structural rules. Unlike open-ended tasks such as question answering or translation, domain-specific generation must be both semantically correct and compliant with existing guidelines and standards. In this work, we study the nationwide interoperability problem of utility power outage reports in the Un…