LLM Topic Archive

LLM Topic Archive llm.html 关键词 LLM 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models ../papers/arxiv-91c0ed0f09c2.html https://arxiv.org/abs/2606.27047v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark… Joint Learning of Experiential Rules and Policies for Large Language Model Agents ../papers/arxiv-48c067a92ef9.html https://arxiv.org/abs/2606.27136v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited c… The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans ../papers/arxiv-852671b09eb4.html https://arxiv.org/abs/2606.27103v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretatio… Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings ../papers/arxiv-663bd6d3e1b5.html https://arxiv.org/abs/2606.27287v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few c… Semantic Early-Stopping for Iterative LLM Agent Loops ../papers/arxiv-232f944cff9f.html https://arxiv.org/abs/2606.27009v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer's mea… TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference ../papers/arxiv-4739852a0036.html https://arxiv.org/abs/2606.27161v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principl… When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models ../papers/arxiv-89d85e485ed3.html https://arxiv.org/abs/2606.27288v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical margin… Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization ../papers/arxiv-de6c8b129e13.html https://arxiv.org/abs/2606.27025v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \emph{Interac… RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning ../papers/arxiv-ceeaa87c79d1.html https://arxiv.org/abs/2606.26997v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group R… In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics ../papers/arxiv-03dd67b86fdf.html https://arxiv.org/abs/2606.26981v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic co… OpenRCA 2.0: From Outcome Labels to Causal Process Supervision ../papers/arxiv-20b18a2e996b.html https://arxiv.org/abs/2606.27154v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injectio… Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA ../papers/arxiv-5c94d679d076.html https://arxiv.org/abs/2606.27023v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration te… Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement ../papers/arxiv-7dba85bc00e8.html https://arxiv.org/abs/2606.27226v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evalua… When are likely answers right? On Sequence Probability and Correctness in LLMs ../papers/arxiv-cb621cfe5b86.html https://arxiv.org/abs/2606.27359v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, model… Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries ../papers/arxiv-eca9abd16e6a.html https://arxiv.org/abs/2606.26936v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit fr… Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair ../papers/arxiv-b661505d2f8b.html https://arxiv.org/abs/2606.27205v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, results can often be improved, but this also imposes substantialmemory requirements. While quantization effectively reduces thememory footprint, its overall impact is often summarized onlyby benchmark scores, which mask changes in model behaviorand non-functional overheads. In this work, we conduct anempirical evaluation of LLM quantization u… To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair ../papers/arxiv-f2d059b8979b.html https://arxiv.org/abs/2606.26978v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior… How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring ../papers/arxiv-b4fa44958d89.html https://arxiv.org/abs/2606.26979v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic explora… A Deterministic Control Plane for LLM Coding Agents ../papers/arxiv-21cf1b4bd2a8.html https://arxiv.org/abs/2606.26924v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them -- rules files, agent definitions, IDE-specific markdown -- is largely unmanaged. A prevalence study of 10,008 public GitHub repositories (n=6,145 agent config files) finds that agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent), with 75.5% of clone pairs… NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems ../papers/arxiv-49487ea267ee.html https://arxiv.org/abs/2606.27243v1#2026-06-26#llm Fri, 26 Jun 2026 13:16:53 +0800 Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM cod… Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models ../papers/arxiv-117d92e125e9.html https://arxiv.org/abs/2606.26079v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a s… How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations ../papers/arxiv-a993d18a7754.html https://arxiv.org/abs/2606.26041v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluati… MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction ../papers/arxiv-924e9f45b440.html https://arxiv.org/abs/2606.25651v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical e… Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz ../papers/doi-a2a209515fcf.html https://arxiv.org/abs/2606.25622v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, thi… TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs ../papers/arxiv-18419ba4812f.html https://arxiv.org/abs/2606.26029v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels an… Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation ../papers/arxiv-cca18893a109.html https://arxiv.org/abs/2606.25782v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably id… RAS: Measuring LLM Safety Through Refusal Alignment ../papers/arxiv-27c960f270d2.html https://arxiv.org/abs/2606.25750v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directio… Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents ../papers/arxiv-6a39b00c84b9.html https://arxiv.org/abs/2606.26080v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training… Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface ../papers/arxiv-ff6028825464.html https://arxiv.org/abs/2606.25941v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are pro… MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources ../papers/arxiv-c18831bd2d45.html https://arxiv.org/abs/2606.25832v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns t… SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment ../papers/arxiv-26ee80b3fcf6.html https://arxiv.org/abs/2606.25821v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which suffer from a scarcity of high-quality training data, often have their tokens routed to different experts than those predominantly activated by high-resource inputs, which limits cross-lingual expert sharing. This cross-lingual routing divergence consequently hinders… How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring ../papers/arxiv-25dbdb1a09a8.html https://arxiv.org/abs/2606.25487v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite… Evaluating LLMs on Real-World Software Performance Optimization ../papers/arxiv-28c9e56c593c.html https://arxiv.org/abs/2606.25530v1#2026-06-25#llm Thu, 25 Jun 2026 13:11:21 +0800 Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and… AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach ../papers/doi-d4dcf6e219ed.html https://arxiv.org/abs/2606.24655v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large L… A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial ../papers/arxiv-15784a9d3bc2.html https://arxiv.org/abs/2606.24510v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for r… EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence ../papers/arxiv-a950eeb96676.html https://arxiv.org/abs/2606.24797v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evalu… Are We Ready For An Agent-Native Memory System? ../papers/arxiv-09ad880f1f66.html https://arxiv.org/abs/2606.24775v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, crit… AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability ../papers/arxiv-22621133f739.html https://arxiv.org/abs/2606.24589v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use.… Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models ../papers/arxiv-f1627fb5a350.html https://arxiv.org/abs/2606.24610v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework… Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment ../papers/arxiv-ecde3387ca11.html https://arxiv.org/abs/2606.24834v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond singl… ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling ../papers/arxiv-b83ab398c080.html https://arxiv.org/abs/2606.24605v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve… Scaling Laws for Task-Specific LLM Distillation ../papers/arxiv-f1ab970e444f.html https://arxiv.org/abs/2606.24747v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare… ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling ../papers/arxiv-945d1e05b320.html https://arxiv.org/abs/2606.24437v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Mixture-of-Agents (MoA) architectures improve inference-time scaling by organizing multiple LLM agents into layered reasoning pipelines. However, existing MoA variants fail to sustain gains as depth increases, exhibiting degradation, early plateauing, or saturation. We propose ReM-MoA, a memory-augmented MoA framework that sustains scaling through two mechanisms: (1) a Ranked Reasoning Memory that persistently stores and ranks reasoning traces from all layers using a comparative Reviewer Agent,… LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context ../papers/arxiv-7d9e141e8dab.html https://arxiv.org/abs/2606.24585v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 While the validity of LLMs' use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistant… Red-Teaming the Agentic Red-Team ../papers/arxiv-a2ceabb33333.html https://arxiv.org/abs/2606.24496v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws tha… Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees ../papers/arxiv-862177e8e257.html https://arxiv.org/abs/2606.24322v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 LLM agents increasingly rely on persistent long-term memory, which creates a critical vulnerability that we study here: memory poisoning. An adversary can store untrusted content in one session that later steers a consequential action, such as a payment, a setting change, or data exfiltration, in a future session. Existing defenses base a memory item's authority to act on either its content (detection or trust-scoring) or its derivation history (lineage). We show that both signals are malleable… Pigeonholing: Bad prompts hurt models to collapse and make mistakes ../papers/arxiv-112c872ebf06.html https://arxiv.org/abs/2606.24267v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution,… SHERLOC: Structured Diagnostic Localization for Code Repair Agents ../papers/arxiv-b868687e026f.html https://arxiv.org/abs/2606.24820v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact re… Bayesian control for coding agents ../papers/arxiv-1bf783e8f09b.html https://arxiv.org/abs/2606.24453v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and ni… LemonHarness Technical Report ../papers/arxiv-a79c559da3e4.html https://arxiv.org/abs/2606.24311v1#2026-06-24#llm Wed, 24 Jun 2026 13:06:49 +0800 As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such…