large language model Topic Archive

large language model Topic Archive large-language-model.html 关键词 large language model 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models ../papers/arxiv-91c0ed0f09c2.html https://arxiv.org/abs/2606.27047v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark… Joint Learning of Experiential Rules and Policies for Large Language Model Agents ../papers/arxiv-48c067a92ef9.html https://arxiv.org/abs/2606.27136v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited c… The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans ../papers/arxiv-852671b09eb4.html https://arxiv.org/abs/2606.27103v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretatio… Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings ../papers/arxiv-663bd6d3e1b5.html https://arxiv.org/abs/2606.27287v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few c… Semantic Early-Stopping for Iterative LLM Agent Loops ../papers/arxiv-232f944cff9f.html https://arxiv.org/abs/2606.27009v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer's mea… TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference ../papers/arxiv-4739852a0036.html https://arxiv.org/abs/2606.27161v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principl… RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning ../papers/arxiv-ceeaa87c79d1.html https://arxiv.org/abs/2606.26997v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group R… In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics ../papers/arxiv-03dd67b86fdf.html https://arxiv.org/abs/2606.26981v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic co… Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA ../papers/arxiv-5c94d679d076.html https://arxiv.org/abs/2606.27023v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration te… When are likely answers right? On Sequence Probability and Correctness in LLMs ../papers/arxiv-cb621cfe5b86.html https://arxiv.org/abs/2606.27359v1#2026-06-26#large-language-model Fri, 26 Jun 2026 13:16:53 +0800 Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, model… InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy ../papers/arxiv-072adfbe1cb9.html https://arxiv.org/abs/2606.25984v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors. We introduce InvestPhilBench, a multi-layer dynamic benchmark spanning eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8). The v0.6 release comprises 118 primary-source-verified investment principle cards, 25 decision framework cards… Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models ../papers/arxiv-117d92e125e9.html https://arxiv.org/abs/2606.26079v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a s… MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction ../papers/arxiv-924e9f45b440.html https://arxiv.org/abs/2606.25651v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety. Existing methods for error detection and correction, including automated checks and heuristic-based approaches, do not generalize well across unseen datasets. In this paper, we propose MedGuards as a medical safety guardrail, which is a new framework that treats medical e… TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs ../papers/arxiv-18419ba4812f.html https://arxiv.org/abs/2606.26029v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels an… Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability ../papers/arxiv-dd5092a23d14.html https://arxiv.org/abs/2606.25819v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool-environment unreliability insufficiently examined. We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards. ToolBench-X contains executable multi-step tasks across div… Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation ../papers/arxiv-cca18893a109.html https://arxiv.org/abs/2606.25782v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably id… RAS: Measuring LLM Safety Through Refusal Alignment ../papers/arxiv-27c960f270d2.html https://arxiv.org/abs/2606.25750v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, sensitive to judge choice, and easily tied to fixed question banks. We propose **SafeVec**, a white-box evaluation procedure that measures safety from internal representations rather than generated answers. **SafeVec** first extracts layer-wise refusal directio… Explainable Control Framework (XCF) based on Fuzzy Model-Agnostic Explanation and LLM Agent-Supported Interface ../papers/arxiv-ff6028825464.html https://arxiv.org/abs/2606.25941v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Increasing demand for precise and reliable control in complex scenarios has led to the development of increasingly sophisticated controllers, including data-driven approaches employing closed box models and mathematically rigorous yet complex designs. This complexity highlights the needs for explainable control that can provide human-understandable insights into controller behavior. In this paper, an explainable control framework (XCF) along with supporting algorithms and user interface are pro… MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources ../papers/arxiv-c18831bd2d45.html https://arxiv.org/abs/2606.25832v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Achieving strong optimization generalization across diverse optimization problems while requiring limited training resources remains a challenging problem for optimization-oriented large language models (LLMs). Existing approaches typically rely on large-scale supervised datasets, costly reasoning annotations, and expensive intermediate step verification, resulting in substantial training overhead. To address these challenges, we propose MiniOpt, a reinforcement learning framework that learns t… Evaluating LLMs on Real-World Software Performance Optimization ../papers/arxiv-28c9e56c593c.html https://arxiv.org/abs/2606.25530v1#2026-06-25#large-language-model Thu, 25 Jun 2026 13:11:21 +0800 Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and… AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning ../papers/arxiv-7fb19b10d271.html https://arxiv.org/abs/2606.24526v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introdu… AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach ../papers/doi-d4dcf6e219ed.html https://arxiv.org/abs/2606.24655v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large L… A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial ../papers/arxiv-15784a9d3bc2.html https://arxiv.org/abs/2606.24510v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for r… EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence ../papers/arxiv-a950eeb96676.html https://arxiv.org/abs/2606.24797v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evalu… Are We Ready For An Agent-Native Memory System? ../papers/arxiv-09ad880f1f66.html https://arxiv.org/abs/2606.24775v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, crit… CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning ../papers/arxiv-521120a059b4.html https://arxiv.org/abs/2606.24636v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified op… AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability ../papers/arxiv-22621133f739.html https://arxiv.org/abs/2606.24589v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use.… Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity ../papers/arxiv-80e1786313ab.html https://arxiv.org/abs/2606.24623v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We eva… Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models ../papers/arxiv-f1627fb5a350.html https://arxiv.org/abs/2606.24610v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework… ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling ../papers/arxiv-b83ab398c080.html https://arxiv.org/abs/2606.24605v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve… Scaling Laws for Task-Specific LLM Distillation ../papers/arxiv-f1ab970e444f.html https://arxiv.org/abs/2606.24747v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare… Pigeonholing: Bad prompts hurt models to collapse and make mistakes ../papers/arxiv-112c872ebf06.html https://arxiv.org/abs/2606.24267v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call "pigeonholing." **Unintentionally bad** contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model's buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution,… LemonHarness Technical Report ../papers/arxiv-a79c559da3e4.html https://arxiv.org/abs/2606.24311v1#2026-06-24#large-language-model Wed, 24 Jun 2026 13:06:49 +0800 As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such… AIR: Adaptive Interleaved Reasoning with Code in MLLMs ../papers/arxiv-3b598225b45f.html https://arxiv.org/abs/2606.23678v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers ML… TriggerBench: Investigating Prospective Memory for Large Language Models ../papers/arxiv-6e7f90d682c8.html https://arxiv.org/abs/2606.23459v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with… Can LLMs Reliably Self-Report Adversarial Prefills, and How? ../papers/arxiv-1052b107e838.html https://arxiv.org/abs/2606.23671v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introsp… POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation ../papers/arxiv-6142e52062f6.html https://arxiv.org/abs/2606.23533v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Recent large language models (LLMs) are good at general text generation, but it is still hard to use them for domain-specific data generation because the output must follow strict formatting and structural rules. Unlike open-ended tasks such as question answering or translation, domain-specific generation must be both semantically correct and compliant with existing guidelines and standards. In this work, we study the nationwide interoperability problem of utility power outage reports in the Un… Randomized YaRN Improves Length Generalization for Long-Context Reasoning ../papers/arxiv-165e566523c5.html https://arxiv.org/abs/2606.23687v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Large language models (LLMs) are typically pretrained on short sequences and then extended to work on longer sequences with additional training. However, such LLMs still struggle to further generalize to very long sequences. We propose Randomized YaRN, a training method that improves length generalization by combining YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. During training on short context data, tokens are assigned YaRN positional encodin… Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles ../papers/arxiv-cfea83337d7a.html https://arxiv.org/abs/2606.23672v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 This paper presents our algorithmic innovations for the NVIDIA Nemotron Model Reasoning Challenge, focusing on Bit Manipulation Puzzles. In this task, the objective is to discover a hidden logical rule transforming input binary strings to outputs, then apply it to unseen inputs. Large Language Models (LLMs) notoriously struggle here; traditional methods force them to simulate complex boolean logic and arithmetic, leading to hallucinations. Furthermore, the search space of bitwise operations (co… Abstract representational geometry supports inference in large language models ../papers/arxiv-b1d30293937d.html https://arxiv.org/abs/2606.23345v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 A defining feature of human intelligence is the ability to adapt to changing environments by inferring latent task structure from sparse observations. Neuroscientific research indicates that this capability relies on the hippocampus constructing abstract representations, expressed as low-dimensional, approximately orthogonal manifolds in neural state space. However, the internal mechanisms of large language models (LLMs) remain largely opaque, making it unclear whether they form comparable abst… Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting ../papers/arxiv-f5552ea6c703.html https://arxiv.org/abs/2606.23391v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Time series forecasting is a fundamental machine learning task. Recent work has explored Large Language Models (LLMs) for this purpose due to their strong generalization, pattern recognition, and zero-shot or few-shot capabilities. Despite their suitability for long-context learning, LLMs face challenges in multimodal settings: they lack calibrated probabilistic modeling for non-text data and struggle to align heterogeneous representations. To address these issues, we propose a new framework Di… SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression ../papers/arxiv-30f938520c9e.html https://arxiv.org/abs/2606.23568v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their deployment is constrained by substantial memory and compute requirements. Low-rank compression via singular value decomposition (SVD) is an effective remedy, but existing methods focus on how to factorize and which components to keep. We introduce SVD-Surgeon, a training-free method that brings the Optimal Brain Surgeon (OBS) framework to the singular-value basis. Treating each singular value as… On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners ../papers/arxiv-49b6fac506a0.html https://arxiv.org/abs/2606.23668v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Large Language Models (LLMs) are frequently portrayed as general-purpose solvers capable of solving arbitrary tasks. We argue that this view overlooks a fundamental constraint: language is a compressed and capacity-limited interface for conveying task information. Modelling User--System interaction as a bilevel \emph{cheap-talk} game, we analyse how latent tasks are encoded into prompts and reinterpreted under alignment and safety constraints. We introduce a conceptual decomposition separating… GIF: Locally Sound Geometric Information Flow Control for LLMs ../papers/arxiv-db5b699a6c55.html https://arxiv.org/abs/2606.23277v1#2026-06-23#large-language-model Tue, 23 Jun 2026 13:10:02 +0800 Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input t… QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation ../papers/arxiv-40a83630bd98.html https://arxiv.org/abs/2606.20227v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning task… LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems ../papers/arxiv-a473df34c717.html https://arxiv.org/abs/2606.20408v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical… Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference ../papers/arxiv-1b4902e41aec.html https://arxiv.org/abs/2606.20245v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approac… Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems ../papers/arxiv-07b0ffc5775c.html https://arxiv.org/abs/2606.20493v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consisten… Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users ../papers/arxiv-87d47984a1d0.html https://arxiv.org/abs/2606.20482v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To q… AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning ../papers/arxiv-f82503f71de6.html https://arxiv.org/abs/2606.20373v1#2026-06-19#large-language-model Fri, 19 Jun 2026 14:26:15 +0800 Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabli…