{
  "generated_at": "2026-06-26T13:16:53.645581+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 20 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》、《Joint Learning of Experiential Rules and Policies for Large Language Model Agents》。",
    "主题「Agent」：命中 13 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Joint Learning of Experiential Rules and Policies for Large Language Model Agents》、《Semantic Early-Stopping for Iterative LLM Agent Loops》。",
    "主题「Benchmark」：命中 12 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》、《TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference》。",
    "主题「Language Model」：命中 3 篇，覆盖 LM，代表论文包括 《The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans》、《Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings》。",
    "主题「Coding Agent」：命中 2 篇，覆盖 Terminal and SWE Agents，代表论文包括 《Mostly Automatic Translation of Language Interpreters from C to Safe Rust》、《The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 20,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models",
        "Joint Learning of Experiential Rules and Policies for Large Language Model Agents",
        "The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans",
        "Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings",
        "Semantic Early-Stopping for Iterative LLM Agent Loops",
        "TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference",
        "When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models",
        "Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization",
        "RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning",
        "In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics",
        "OpenRCA 2.0: From Outcome Labels to Causal Process Supervision",
        "Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA",
        "Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement",
        "When are likely answers right? On Sequence Probability and Correctness in LLMs",
        "Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries",
        "Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair",
        "To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair",
        "How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring",
        "A Deterministic Control Plane for LLM Coding Agents",
        "NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems"
      ],
      "key_points": [
        "《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains rema…",
        "《Joint Learning of Experiential Rules and Policies for Large Language Model Agents》〔方法〕：For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typica…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 13,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Joint Learning of Experiential Rules and Policies for Large Language Model Agents",
        "Semantic Early-Stopping for Iterative LLM Agent Loops",
        "When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models",
        "Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization",
        "OpenRCA 2.0: From Outcome Labels to Causal Process Supervision",
        "AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems",
        "MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG",
        "To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair",
        "How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring",
        "A Deterministic Control Plane for LLM Coding Agents",
        "NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems",
        "Mostly Automatic Translation of Language Interpreters from C to Safe Rust",
        "The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development"
      ],
      "key_points": [
        "《Joint Learning of Experiential Rules and Policies for Large Language Model Agents》〔方法〕：For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typica…",
        "《Semantic Early-Stopping for Iterative LLM Agent Loops》〔评测 / 方法〕：Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 12,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models",
        "TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference",
        "HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models",
        "RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning",
        "In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics",
        "Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA",
        "Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement",
        "When are likely answers right? On Sequence Probability and Correctness in LLMs",
        "Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries",
        "Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation",
        "MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG",
        "Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair"
      ],
      "key_points": [
        "《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains rema…",
        "《TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference》〔评测 / 方法〕：Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 3,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans",
        "Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings",
        "HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models"
      ],
      "key_points": [
        "《The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans》〔方法〕：Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tas…",
        "《Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings》〔评测 / 方法〕：Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmi…"
      ]
    },
    {
      "name": "Coding Agent",
      "paper_count": 2,
      "feed_names": [
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Mostly Automatic Translation of Language Interpreters from C to Safe Rust",
        "The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development"
      ],
      "key_points": [
        "《Mostly Automatic Translation of Language Interpreters from C to Safe Rust》〔应用 / 方法〕：Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs…",
        "《The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development》〔方法〕：AI coding agents dramatically accelerate implementation speed but introduce two structural failure modes that existing spec-driven approaches do not fully solv…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains rema…",
        "《Joint Learning of Experiential Rules and Policies for Large Language Model Agents》〔方法〕：For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typica…",
        "《The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans》〔方法〕：Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tas…",
        "《Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings》〔评测 / 方法〕：Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmi…",
        "《Semantic Early-Stopping for Iterative LLM Agent Loops》〔评测 / 方法〕：Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models",
          "summary": "Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving often requires not only factual knowledge but also quantitative reasoning and conceptual understanding. To address the need for systematic evaluation in this domain, we introduce NuclearQAv2, a benchmark for assessing LLMs on nuclear engineering knowledge. The benchmark comprises approximately 1,240 question-answer pairs spanning three categories: boolean, numeric, and verbal. NuclearQAv2 is constructed using a hybrid pipeline that combines expert-authored questions, existing datasets, and LLM-assisted generation from domain-specific technical corpora. By leveraging structured prompting for both automated question generation and response evaluation, the proposed framework enables scalable benchmark construction and evaluation. We evaluate a diverse set of LLMs using NuclearQAv2 and observe substantial performance differences across task types. While the models generally perform well on factual questions, quantitative reasoning and conceptual understanding remain considerably more challenging. These results highlight the importance of multi-faceted evaluation frameworks and establish NuclearQAv2 as a scalable benchmark for assessing LLM capabilities in technical domains.",
          "authors": [
            "Henry Shaowu Yuchi",
            "Michal Kucer",
            "Benjamin H. Sims",
            "Selma Peterson",
            "Emily Taylor"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27047v1",
          "abstract_url": "https://arxiv.org/abs/2606.27047v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27047v1",
          "published_at": "2026-06-25T13:52:16+00:00",
          "updated_at": "2026-06-25T13:52:16+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27047",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27047v1"
          },
          "relevance_score": 218,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27047"
        },
        {
          "title": "Joint Learning of Experiential Rules and Policies for Large Language Model Agents",
          "summary": "For LLM agents in multi-step interactive environments, a key challenge is to make effective use of accumulated interaction experience. Existing work has typically separated two uses of such experience: keeping it outside the model as natural-language rules for later prompting, or using trajectories and feedback to update the model parameters. The former is easy to interpret but can fall out of sync with the evolving policy; the latter improves the policy more broadly but provides only limited correction for local mistakes in sparse-reward settings. We present Joint Learning of Experiential Rules and Policies for LLM Agents (JERP), which updates a long-term experiential-rule pool and the policy from the same interaction trajectories. At decision time, JERP retrieves task-relevant rules and conditions the agent on them together with the interaction history. After each episode, it uses the collected trajectories both to optimize the policy and to revise the rule pool by comparing current rollouts with reference successful trajectories. This coupling keeps the rule pool aligned with the evolving policy while allowing stable and effective behaviors to be gradually absorbed into the model itself. Experiments on AlfWorld and WebShop show that JERP yields consistent gains in decision performance for complex interactive tasks.",
          "authors": [
            "Shicheng Ye",
            "Chao Yu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27136v1",
          "abstract_url": "https://arxiv.org/abs/2606.27136v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27136v1",
          "published_at": "2026-06-25T15:11:02+00:00",
          "updated_at": "2026-06-25T15:11:02+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27136",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27136v1"
          },
          "relevance_score": 165,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27136"
        },
        {
          "title": "The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans",
          "summary": "Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretations. Identifying correct answers requires looking past the structure of each question and flexibly apply different reasoning strategies based on the content. If LLMs respond to surface features, such as form, a riddle-like structure should cause models to use an inventive reasoning strategy even when a literal interpretation suffices. Alternatively, if LLMs reason based on content, they should flexibly switch strategies when appropriate. Across two experiments with nine state-of-the-art LLMs and 100 human participants, we show humans and LLMs fail on this paradigm in opposite directions. LLMs were far more accurate on genuine riddles than on riddle riddles (84.9% vs. 50.7%); whereas humans showed the reverse effect (50.5% vs. 80.5%). Error analysis shows that 90.8% of LLM errors on riddle riddles (the condition where they show diminished performance) were due to inappropriate use of inventive reasoning while only 57.6% of human errors on genuine riddles were due to overextending literal reasoning. Thus, while both groups make mistakes, reasoning mistakes are made more often by LLMs than by humans. Overall, LLMs' strong performance on genuine riddles may reflect memory retrieval rather than flexible strategy selection, and without stimuli designed to elicit this contrast, it becomes easy to conflate LLM-generated outputs that look like reasoning with genuine reasoning.",
          "authors": [
            "Bella Fascendini",
            "Kathryn McGregor",
            "Max D. Gupta",
            "Thomas L. Griffiths"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27103v1",
          "abstract_url": "https://arxiv.org/abs/2606.27103v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27103v1",
          "published_at": "2026-06-25T14:41:12+00:00",
          "updated_at": "2026-06-25T14:41:12+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.27103",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27103v1"
          },
          "relevance_score": 165,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27103"
        },
        {
          "title": "Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings",
          "summary": "Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defined as subtle self-promotional text that introduces no new qualifications but is designed to influence LLM evaluations. Using controlled experiments, we show that prompt injection reliably improves applicant rankings when résumé quality is homogeneous and few candidates inject. However, its effectiveness rapidly diminishes as more candidates inject, collapsing when manipulation becomes widespread. When candidate quality is heterogeneous, prompt injection is less effective on average, but can occasionally allow lower-quality candidates to outrank higher-quality ones, raising fairness concerns. Overall, LLM-based screening is most vulnerable when manipulation is rare and candidate quality differences are small. Code and resources are publicly available at: https://github.com/preetb1199/Prompt_Injection_ACL26",
          "authors": [
            "Preet Baxi",
            "Jiannan Xu",
            "Jane Yi Jiang",
            "Stefanus Jasin"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27287v1",
          "abstract_url": "https://arxiv.org/abs/2606.27287v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27287v1",
          "published_at": "2026-06-25T17:04:51+00:00",
          "updated_at": "2026-06-25T17:04:51+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.27287",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27287v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"prompt injection\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27287"
        },
        {
          "title": "Semantic Early-Stopping for Iterative LLM Agent Loops",
          "summary": "Multi-agent large language model (LLM) loops, for example a Writer that drafts and a Critic that revises, are almost always terminated by a fixed iteration cap (max_iterations). This is a syntactic kill-switch: it is blind to whether the answer is still improving, so it over-spends tokens on easy inputs and truncates hard ones. We study semantic early-stopping: the loop halts when consecutive draft embeddings stop changing in meaning (cosine distance with a patience window) and the answer's measured quality stops improving. Our work makes three contributions. First, an honest theoretical footing: we prove deterministic termination and well-definedness and machine-check these claims, while treating the convergence of the distance sequence as an empirically tested conjecture rather than a (previously over-claimed) Banach contraction. Second, a judge-efficient evaluation protocol: we generate each question's full trajectory once, replay every stopping policy over the identical drafts, and cache every LLM-judge call, yielding a strictly paired efficiency-versus-quality comparison at low cost; we further separate operational tokens (charged to a policy) from evaluation tokens (a measurement instrument). Third, an empirical study on multi-hop retrieval-augmented question answering (HotpotQA). On the 60-question test split, a judge-free semantic stopper reduces operational tokens by 38% relative to max_iterations at parity quality (Delta-IS = -0.004, p = 0.81), whereas the full quality-gated variant is counter-productive because its per-round judging dominates cost. An oracle that selects the best round attains +0.115 Information Score over every practical policy (p ~ 4e-11), reframing the problem from \"when to stop\" (easy) to \"which round is best\" (open).",
          "authors": [
            "Sahil Shrivastava"
          ],
          "categories": [
            "cs.AI",
            "cs.LG",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27009v1",
          "abstract_url": "https://arxiv.org/abs/2606.27009v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27009v1",
          "published_at": "2026-06-25T13:24:21+00:00",
          "updated_at": "2026-06-25T13:24:21+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27009",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27009v1"
          },
          "relevance_score": 160,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27009"
        },
        {
          "title": "TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference",
          "summary": "Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning. In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets. Through a top-down information-theoretic analysis, we identify three fundamental principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. Based on these principles, we propose TOPS, a training-free and model-agnostic pruning module that can be applied to various MLLMs. Extensive experiments on 7 MLLM backbones and 14 benchmarks demonstrate that TOPS outperforms prior methods under diverse pruning settings. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting that pruning redundant visual tokens can sometimes mitigate hallucination and inspire future lightweight MLLM design.",
          "authors": [
            "Tinghao Wang",
            "Yichen Guo",
            "Rui Huang",
            "Zheng Lu",
            "Qizhe Zhang",
            "Chenxi Li",
            "Yuan Zhang",
            "Jiajun Cao",
            "Zhirong Shen",
            "Yaosong Du",
            "Guangyan Gan",
            "Wenya Wang",
            "Lin William Cong",
            "Shanghang Zhang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27161v1",
          "abstract_url": "https://arxiv.org/abs/2606.27161v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27161v1",
          "published_at": "2026-06-25T15:29:37+00:00",
          "updated_at": "2026-06-25T15:29:37+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27161",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27161v1"
          },
          "relevance_score": 158,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27161"
        },
        {
          "title": "When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models",
          "summary": "Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.",
          "authors": [
            "Josef Chen"
          ],
          "categories": [
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27288v1",
          "abstract_url": "https://arxiv.org/abs/2606.27288v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27288v1",
          "published_at": "2026-06-25T17:06:06+00:00",
          "updated_at": "2026-06-25T17:06:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27288",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27288v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27288"
        },
        {
          "title": "Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization",
          "summary": "Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \\textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \\emph{Interaction Perception}, \\emph{Psychological Empathy}, and \\emph{Logical Construction} -- so that the model \\emph{thinks dynamically} from the profile rather than merely mimicking surface patterns. While structured reasoning provides a foundation, it alone is insufficient; reinforcement learning is essential to further align the model with character fidelity. However, we observe that under LLM-based reward models, both generic phrases that hack the reward model and genuinely role-specific phrases receive identical gradient signals -- this hacking accumulates over training, misleading the model into treating both as equally optimal choices. To address this, we propose \\textbf{Role-Aware Policy Optimization (RAPO)}, which uses profile--token mutual information to weight gradients asymmetrically -- amplifying role-specific tokens under positive advantage while attenuating them under negative advantage. Experiments on CoSER, CharacterBench, and CharacterEval demonstrate that Psy-CoT outperforms existing role-playing CoT methods, and RAPO consistently surpasses GRPO across multiple model scales.",
          "authors": [
            "Zhenhua Xu",
            "Dongsheng Chen",
            "Jian Li",
            "Yitong Lin",
            "Zhebo Wang",
            "Jiafu Wu",
            "Yizhang Jin",
            "Chengjie Wang",
            "Meng Han",
            "Yabiao Wang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27025v1",
          "abstract_url": "https://arxiv.org/abs/2606.27025v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27025v1",
          "published_at": "2026-06-25T13:34:47+00:00",
          "updated_at": "2026-06-25T13:34:47+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27025",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27025v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27025"
        },
        {
          "title": "HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models",
          "summary": "Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, failing to capture implicit or deep contextual harms. 2) Explanatory rationales are completely absent. Current frameworks measure exclusively whether a model flags a video correctly rather than explaining why, turning evaluation into a black box where models can succeed through superficial shortcuts. To address these problems, we present HarmVideoBench, a multi-layered diagnostic benchmark comprising 1,379 videos paired with 4,137 multiple-choice questions. HarmVideoBench benchmarks three hierarchical dimensions: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning, aiming to evaluate models' deep understanding beyond surface cues with carefully balanced and curated samples. We evaluate 19 leading models on HarmVideoBench to assess their multidimensional understanding of harmful videos. Moreover, we introduce BCR, a benchmark-aligned method that predicts reasoning boundaries and dynamically retrieves context only when needed. Experimental results show that BCR substantially improves the base model's performance in harmful video understanding, raising the macro average from 61.7 percent to a state-of-the-art 84.4 percent.",
          "authors": [
            "Jiajun Wu",
            "Haoyu Kang",
            "Yining Sun",
            "Jiacheng Hou",
            "Heng Zhang",
            "Danyang Zhang",
            "Zhenjun Zhao",
            "Haochi Zhang",
            "Leixin Sun",
            "Eric Hanchen Jiang",
            "Yushan Li",
            "Ruiyu Li",
            "Mengkai Huang",
            "Yan Gao",
            "Xu Zhang",
            "Guancheng Wan"
          ],
          "categories": [
            "cs.CV",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27187v1",
          "abstract_url": "https://arxiv.org/abs/2606.27187v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27187v1",
          "published_at": "2026-06-25T15:50:33+00:00",
          "updated_at": "2026-06-25T15:50:33+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.27187",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27187v1"
          },
          "relevance_score": 140,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27187"
        },
        {
          "title": "RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning",
          "summary": "Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group Relative Policy Optimization) RLVR systems finish an entire rollout before starting training, leaving the trainer GPU pool idle while rollout is still ongoing. Asynchronous RL pipelines overlap the two stages, but at the cost of training on stale data. To address these challenges, we propose RolloutPipe, a post-training framework for disaggregated RLVR systems, which turns the fixed-weight rollout into a complete-group pipeline where trainable groups move to the trainer while later groups are still being generated. RolloutPipe achieves this through two techniques including complete-group pipelining (CGP) and frontier-group dispatch (FGD). CGP dispatches each trainable complete group to the trainer FIFO as soon as group materialization finishes, and FGD is an admission policy on the Rollout node that first admits requests for the frontier groups needed to form the next training batch, so that trainer-ready groups arrive earlier and more steadily. The design starts training before the rollout completes while maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four reasoning and science benchmarks and twelve rollout settings, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3%, and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art rollout and training system.",
          "authors": [
            "Rongjian Chen",
            "Jianmin Hu",
            "Kejiang Ye",
            "Minxian Xu"
          ],
          "categories": [
            "cs.DC",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26997v1",
          "abstract_url": "https://arxiv.org/abs/2606.26997v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26997v1",
          "published_at": "2026-06-25T13:14:14+00:00",
          "updated_at": "2026-06-25T13:14:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.26997",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26997v1"
          },
          "relevance_score": 137,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26997"
        },
        {
          "title": "In-Context Model Predictive Generation: Open-Vocabulary Motion Synthesis from Language Models to Physics",
          "summary": "Synthesizing human motion from textual descriptions is essential for immersive digital applications, yet existing methods face a persistent trade-off between semantic fidelity and physical realism. Large language model (LLM)-based approaches can interpret diverse open-vocabulary instructions and compose high-level action plans, but they often generate motions that violate physical constraints. Physics-aware models improve realism through simulation or control, but they struggle with semantic complexity, fine-grained instructions, and novel concepts. To address this gap, we propose In-Context Model Predictive Generation (ICMPG), a framework that integrates language-model planning with inference-time physical feedback. ICMPG reformulates motion synthesis as a Model Predictive Control (MPC)-like process with two modules. The Context-Aware Motion Generation (CAMG) module uses an LLM as a planner to decompose textual commands and generate candidate motion sequences from motion tokens. The Model Predictive Generation (MPG) module evaluates these candidates through physical simulation and semantic alignment, estimates a composite reward, and selects the best sequence to guide subsequent generation steps. Unlike open-loop generation, this closed-loop refinement enables ICMPG to adapt motions to both the input semantics and the simulated physical environment without task-specific policy retraining. Extensive experiments across standard and zero-shot open-vocabulary settings show that ICMPG generalizes robustly to diverse commands and produces motions that are more physically plausible and semantically faithful than representative baselines on the evaluated benchmarks. The framework bridges semantic interpretation and physical simulation while remaining flexible enough to incorporate different LLM backbones, enabling more versatile and controllable text-driven motion synthesis.",
          "authors": [
            "Xiaomeng Fu",
            "Junfan Lin",
            "Yang Liu",
            "Yaowei Wang",
            "Guanbin Li",
            "Liang Lin",
            "Ziliang Chen"
          ],
          "categories": [
            "cs.RO",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26981v1",
          "abstract_url": "https://arxiv.org/abs/2606.26981v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26981v1",
          "published_at": "2026-06-25T12:50:33+00:00",
          "updated_at": "2026-06-25T12:50:33+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.26981",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26981v1"
          },
          "relevance_score": 137,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26981"
        },
        {
          "title": "OpenRCA 2.0: From Outcome Labels to Causal Process Supervision",
          "summary": "Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.",
          "authors": [
            "Aoyang Fang",
            "Yifan Yang",
            "Jin'ao Shang",
            "Qisheng Lu",
            "Junjielung Xu",
            "Rui Wang",
            "Songhan Zhang",
            "Yuzhong Zhang",
            "Boxi Yu",
            "Pinjia He"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27154v1",
          "abstract_url": "https://arxiv.org/abs/2606.27154v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27154v1",
          "published_at": "2026-06-25T15:24:23+00:00",
          "updated_at": "2026-06-25T15:24:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27154",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27154v1"
          },
          "relevance_score": 136,
          "match_reasons": [
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27154"
        },
        {
          "title": "Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA",
          "summary": "Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed primarily for text only LLMs, do not account for the multimodal nature of medical image understanding. This work proposes a training based framework that finetunes MLLMs to improve their calibration using a composite loss function combining a Brier style calibration term, an anchor regularizer that prevents confidence collapse toward extreme values, a contrastive image text alignment term, and a KL based model stabilization term. The alignment signal is derived from a $2 \\times 2$ factorial perturbation design that crosses image presence with text integrity, probing the reliance of the model on visual modality input versus language priors. Finally, a top K KL divergence regularizer is used to protect the answering ability of the model during finetuning. Across three Medical VQA benchmarks and two architectures (MedGemma 4B IT and Qwen2 VL 7B Instruct), our method reduces calibration error by 60% or more, and improves discrimination by 26% or more, while preserving predictive accuracy. On average across benchmarks, the technique outperforms prompting based, sampling based, and training based approaches, and ablation experiments confirm that each component of the loss function is indeed necessary for improving the calibration. All code for the experiments is publicly available.",
          "authors": [
            "Eren Senoglu",
            "Federico Toschi",
            "Nicolo Brunello",
            "Andrea Sassella",
            "Mark James Carman"
          ],
          "categories": [
            "cs.LG",
            "cs.CL",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27023v1",
          "abstract_url": "https://arxiv.org/abs/2606.27023v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27023v1",
          "published_at": "2026-06-25T13:33:58+00:00",
          "updated_at": "2026-06-25T13:33:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27023",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27023v1"
          },
          "relevance_score": 134,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27023"
        },
        {
          "title": "Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement",
          "summary": "Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions and aggregates the resulting verdicts into interpretable, multi-dimensional scores. Given a task prompt, a meta-prompt generates fine-grained evaluation questions, and an LLM answers them independently for each output, yielding transparent question-level feedback together with calibrated overall scores. This decomposition makes evaluation easier to inspect, easier to diagnose, and directly usable for prompt improvement. Across SummEval, Topical-Chat, and QAGS, BINEVAL matches or outperforms strong baselines including UniEval and G-Eval, with especially strong results on factual consistency benchmarks such as QAGS. Beyond competitive correlation with human judgments, BINEVAL better matches human score distributions and avoids the ceiling effects common in prior LLM judges, leading to better discrimination between borderline and clearly flawed outputs. We further show that the same question-level feedback supports iterative prompt optimization, improving evaluator prompts on summarization and generation prompts on IFBench under both self-update and cross-model update settings. Overall, BINEVAL provides a task-agnostic, training-free, and interpretable evaluation framework that combines strong empirical performance with practical diagnostic and optimization value.",
          "authors": [
            "Sangwoo Cho",
            "Kushal Chawla",
            "Pengshan Cai",
            "Zefang Liu",
            "Chenyang Zhu",
            "Shi-Xiong Zhang",
            "Sambit Sahu"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27226v1",
          "abstract_url": "https://arxiv.org/abs/2606.27226v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27226v1",
          "published_at": "2026-06-25T16:14:50+00:00",
          "updated_at": "2026-06-25T16:14:50+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27226",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27226v1"
          },
          "relevance_score": 126,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"evaluation\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27226"
        },
        {
          "title": "When are likely answers right? On Sequence Probability and Correctness in LLMs",
          "summary": "Many decoding methods for large language models can be understood as shifting probability mass toward outputs that are more likely under the model, either locally at the token level or globally at the sequence level. Therefore, their success depends on a fundamental question: when does sequence probability, that is, the conditional probability of a continuation given a prompt, actually align with correctness? In this paper, we set out to quantify this relationship across decoding methods, models, and benchmarks at four levels: across decoding methods, across hyperparameters within a method, across prompt-answer pairs within a dataset, and across repeated responses to the same prompt. We find that higher sequence probability is often predictive of correctness across prompt-answer pairs within a fixed dataset. However, this relationship does not generally transfer to decoding decisions: increasing sequence probability by changing hyperparameters or methods does not reliably improve accuracy. Further, sequence probability is not a good indicator of correctness for responses to the same prompt. These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement.",
          "authors": [
            "Johannes Zenn",
            "Jonas Geiping"
          ],
          "categories": [
            "stat.ML",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27359v1",
          "abstract_url": "https://arxiv.org/abs/2606.27359v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27359v1",
          "published_at": "2026-06-25T17:58:02+00:00",
          "updated_at": "2026-06-25T17:58:02+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27359",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27359v1"
          },
          "relevance_score": 124,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27359"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries》〔评测 / 应用 / 方法〕：With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (\"the average Jane\") could elicit actionable re…",
        "《Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation》〔评测 / 数据 / 应用 / 方法〕：In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a co…",
        "《AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems》〔评测 / 应用 / 方法〕：Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains bloc…",
        "《MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG》〔评测 / 应用 / 方法〕：Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, d…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries",
          "summary": "With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors (\"the average Jane\") could elicit actionable responses to malicious requests. In this work, we examine whether this concern is justified. A non-expert malicious actor requires two ingredients for a successful attack: a powerful jailbreak for their target model, acting on an effective malicious query. For the former, we propose a novel attack strategy based on the multi-armed bandit framework. This allows efficient online learning of the optimal jailbreak from a large choice set via noisy exploration on a small number of queries, with subsequent application of the learnt policy on an exploitation set. For the latter, we curate $\\mathrm{FrankensteinBench}$, a safety benchmark of $11,279$ malicious queries drawn from manual curation over $7$ existing benchmarks, along with automated enhancement and generation. Each query is categorized as simple or complex by the technical expertise required to craft it. Our findings confirm the concern. Our bandit-based attack achieves success rates as high as $97\\%$ on average over $15$ SoTA open-weight LLMs. Moreover, adding complexity to queries raises the attack success rate by up to $26\\%$ on average across models -- making it an effective, automatable prompting strategy.",
          "authors": [
            "Prarabdh Shukla",
            "Ritik",
            "Suhas Rao",
            "Arpit Agarwal",
            "Arjun Bhagoji"
          ],
          "categories": [
            "cs.CR",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26936v1",
          "abstract_url": "https://arxiv.org/abs/2606.26936v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26936v1",
          "published_at": "2026-06-25T12:11:28+00:00",
          "updated_at": "2026-06-25T12:11:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.26936",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26936v1"
          },
          "relevance_score": 64,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26936"
        },
        {
          "title": "Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation",
          "summary": "In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides. This may not match how guardrails are actually deployed. A guardrail sometimes should not be heavy and slow, and it often runs on-device, for example on an embodied robot. In this paper, we pose a question whether a safety guardrail really needs to reason. To answer this question, we train a lightweight bidirectional encoder and a reasoning guard on the same corpus, and we then remove only the reasoning while we keep everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reaches an average F1 of 82.90 $\\pm$ 0.26 over public benchmarks. It matches a reasoning guard that is built on a much larger decoder, while it uses only a single forward pass over an input of at most 512 tokens. This is about a ~100x reduction in inference compute. We further show that this label-only encoder stays robust under training-label noise and retains far more recall at a strict false-positive rate than the reasoning guard, so a heavier reasoning guard is not the more robust choice either. Our finding suggests that the current guardrail benchmarks may not be hard enough to reward reasoning, and that the necessity of CoT for moderation is still not proven. We release all source codes and models including LeanGuard at https://github.com/ndb796/LeanGuard.",
          "authors": [
            "Dongbin Na"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26686v1",
          "abstract_url": "https://arxiv.org/abs/2606.26686v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26686v1",
          "published_at": "2026-06-25T07:15:33+00:00",
          "updated_at": "2026-06-25T07:15:33+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2606.26686",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26686v1"
          },
          "relevance_score": 59,
          "match_reasons": [
            "title matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26686"
        },
        {
          "title": "AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems",
          "summary": "Recommendation algorithm iteration is moving from an artisanal, engineer-bound process toward an industrialized research loop, but this transition remains blocked by a structural execution bottleneck: the idea-to-launch cycle still depends on human engineers to generate hypotheses, modify production code, launch A/B experiments, and attribute online results. Innovation therefore scales linearly with headcount rather than compounding with evidence, compute, and accumulated experimental knowledge. We present AgentX, a production-deployed multi-agent system that fundamentally restructures this production function. AgentX operates as a self-evolving development engine: it autonomously generates, implements, evaluates, and learns from recommendation experiments at a scale and pace that no manual workflow can sustain. The system orchestrates four tightly coupled stages in a closed loop. A Brainstorm Agent synthesizes evidence from historical experiments, system architecture, data analysis, and external research into ranked, executable proposals. A Developing Agent translates each proposal into production-ready code through repository-grounded generation and multi-dimensional reliability verification. An Evaluation Agent conducts safe online rollout with guardrail-vetoed A/B judgment, converting both successes and failures into structured knowledge assets. A Harness Evolution layer (SGPO) then distills execution trajectories into semantic-gradient updates that continuously sharpen the agents themselves -- making the system not merely automated, but self-improving.",
          "authors": [
            "Changxin Lao",
            "Fei Pan",
            "Guozhuang Ma",
            "Han Li",
            "Huihuang Lin",
            "Jijun Shi",
            "Kangzhi Zhao",
            "Kun Gai",
            "Mo Zhou",
            "Qinqin Zhou",
            "Quan Chen",
            "Ruochen Yang",
            "Shifu Bie",
            "Shuang Yang",
            "Shuo Yang",
            "Wenhao Li",
            "Wentao Xie",
            "Xiao Lv",
            "Xuming Wang",
            "Yijun Wang",
            "Yiming Chen",
            "Yusheng Huang",
            "Zhongyuan Wang",
            "Zibo Zhao",
            "Zijie Zhuang",
            "Baoning Xia",
            "Chao Liu",
            "Chaoyi Ma",
            "Chubo He",
            "Dawei Cong",
            "Feng Jiang",
            "Gang Wang",
            "Guilin Xia",
            "Hanwen Xu",
            "Jiahong Xie",
            "Jiahui Qiao",
            "Jian Liang",
            "Jiangfan Yue",
            "Jing Wang",
            "Jinghan Yang",
            "Jinghui Jia",
            "Kan Qin",
            "Lei Wang",
            "Ming Li",
            "Peilin Song",
            "Pengbo Xu",
            "Qiang Luo",
            "Ruiming Tang",
            "Shiyang Liu",
            "Shuxian Jin",
            "Tao Wang",
            "Tao Zhang",
            "Xiang Gao",
            "Xianghan Li",
            "Yingsong Luo",
            "Yiwen Ning",
            "Yongcheng Liu",
            "Yuan Guo",
            "Zhaojie Liu",
            "Zhenkai Cui"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.IR"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26859v1",
          "abstract_url": "https://arxiv.org/abs/2606.26859v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26859v1",
          "published_at": "2026-06-25T10:42:28+00:00",
          "updated_at": "2026-06-25T10:42:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2606.26859",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26859v1"
          },
          "relevance_score": 41,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26859"
        },
        {
          "title": "MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG",
          "summary": "Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming approaches are typically surface-specific and often recycle known attack templates; on text-poisoning benchmarks we measure 73-84% exact duplication. We present MIRROR, a unified cross-surface framework that performs memory-guided Monte Carlo tree search while conditioning candidate generation on retrieved context under an explicit novelty constraint. A deterministic Novelty Gate rejects any candidate matching the retrieval set under normalized comparison, allowing retrieval to inform search priors without enabling prompt copying. Across four attack surfaces on a multimodal agentic RAG target, MIRROR attains 76% ASR on image poisoning compared with 52% for baselines, 97% ASR on orchestrator attacks at half the query cost, and the lowest cross-surface variance (coefficient of variation 0.47). In contrast, specialized baselines collapse across surfaces: suffix optimization reaches 79% ASR on text poisoning but 1% on direct queries. We release ART-SafeBench with 41,815 in-package records and runtime adapters yielding 41,991+ total records across four surfaces.",
          "authors": [
            "Inderjeet Singh",
            "Andrés Murillo",
            "Motoyoshi Sekiya",
            "Yuki Unno",
            "Junichi Suga"
          ],
          "categories": [
            "cs.CR",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26793v1",
          "abstract_url": "https://arxiv.org/abs/2606.26793v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26793v1",
          "published_at": "2026-06-25T09:26:49+00:00",
          "updated_at": "2026-06-25T09:26:49+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.26793",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26793v1"
          },
          "relevance_score": 40,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26793"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair》〔评测 / 方法〕：Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, resu…",
        "《To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair》〔评测 / 方法〕：LLM-based agents for program repair are increasingly built on a \"generate-run-revise\" paradigm, iteratively executing tests to evaluate and refine patches. Thi…",
        "《How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring》〔数据 / 应用 / 方法〕：LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and con…",
        "《A Deterministic Control Plane for LLM Coding Agents》〔应用 / 方法〕：LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them -- rules files, agent definitions, IDE-specific mar…",
        "《NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems》〔评测 / 方法〕：Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair",
          "summary": "Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, results can often be improved, but this also imposes substantialmemory requirements. While quantization effectively reduces thememory footprint, its overall impact is often summarized onlyby benchmark scores, which mask changes in model behaviorand non-functional overheads. In this work, we conduct anempirical evaluation of LLM quantization using AutomatedProgram Repair (APR), a complex task in software engineering.We analyze 13 quantization configurations spanning differentbit-widths, methods, and target components (weights and KVcache) across six representative LLMs, evaluated on two APRbenchmarks (HumanEval-Java and Defects4J). Our findings reveal that base and quantized models can provide different sets of repaired problems with little overlap, whileretaining a comparable number of repaired problems. Althoughquantization successfully reduces memory footprints by up to85%, it increases both inference time and energy consumption,which we attribute to suboptimal hardware utilization. OurPareto trade-off analysis shows that 48% of the configurationsevaluated are strictly dominated by alternatives. Rather thanidentifying a superior quantization method, our findings highlightthat the trade-offs between effectiveness, memory footprint,and energy efficiency are sensitive to the underlying modelarchitecture and the complexity of the task.",
          "authors": [
            "Fernando Vallecillos-Ruiz",
            "Giordano d'Aloisio",
            "Max Hort",
            "Luca Traini",
            "Antinisca Di Marco",
            "Leon Moonen"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27205v1",
          "abstract_url": "https://arxiv.org/abs/2606.27205v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27205v1",
          "published_at": "2026-06-25T16:02:05+00:00",
          "updated_at": "2026-06-25T16:02:05+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.27205",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27205v1"
          },
          "relevance_score": 108,
          "match_reasons": [
            "title matched \"program repair\"",
            "title matched \"automated program repair\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27205"
        },
        {
          "title": "To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair",
          "summary": "LLM-based agents for program repair are increasingly built on a \"generate-run-revise\" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models the resolve-rate gap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token and wall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.",
          "authors": [
            "Zhihao Lin",
            "Junhua Zhu",
            "Mingyi Zhou",
            "Xin Wang",
            "Zhensu Sun",
            "Renyu Yang",
            "David Lo",
            "Li Li"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26978v1",
          "abstract_url": "https://arxiv.org/abs/2606.26978v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26978v1",
          "published_at": "2026-06-25T12:49:59+00:00",
          "updated_at": "2026-06-25T12:49:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.26978",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26978v1"
          },
          "relevance_score": 83,
          "match_reasons": [
            "title matched \"program repair\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26978"
        },
        {
          "title": "How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring",
          "summary": "LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic exploration and make navigation more predictable. Starting from a strong baseline, Codex from OpenAI, we systematically inject varying granularities of structural annotations and measure their effects on localization, trajectory behavior, and run-to-run stability. Our study identifies what we call the deterministic anchoring effect: static structure helps less by making agents \"smarter\" and more by making their navigation disciplined and reproducible. Three observations support this finding: (1) Anchoring works: lightweight call/inheritance topology improves function-level localization (+2.2pp Func@5) and shortens trajectories (-1.6 interaction rounds); (2) Anchoring is scale-sensitive: the optimal granularity and directionality depend on repository characteristics, where denser semantics show diminishing returns and hub-heavy projects benefit from inverse-only links that expose \"who-calls-me\" without forward edges; (3) Anchoring stabilizes: tags raise link-following rate from 0.15-0.18 to 0.21-0.24, roughly halve run-to-run variance, and improve single-run reliability (Pass@1 +3.4 pp) on medium-scale repositories, at the cost of roughly 10% more input tokens. These observations suggest practical guidelines: default to lightweight topology on medium projects, prune forward edges in large repositories, and reserve dense tags for implicit-dependency cases.",
          "authors": [
            "Zhihao Lin",
            "Mingyi Zhou",
            "Yizhuo Yang",
            "Li Li"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26979v1",
          "abstract_url": "https://arxiv.org/abs/2606.26979v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26979v1",
          "published_at": "2026-06-25T12:50:01+00:00",
          "updated_at": "2026-06-25T12:50:01+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.26979",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26979v1"
          },
          "relevance_score": 65,
          "match_reasons": [
            "title matched \"code agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26979"
        },
        {
          "title": "A Deterministic Control Plane for LLM Coding Agents",
          "summary": "LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them -- rules files, agent definitions, IDE-specific markdown -- is largely unmanaged. A prevalence study of 10,008 public GitHub repositories (n=6,145 agent config files) finds that agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent), with 75.5% of clone pairs crossing organisational boundaries. Two further patterns are indicative: configurations are rarely revised (58% single-commit; 0.4 vs 0.6 commits/month age-normalised against CI/CD workflows), and rarely declare permission boundaries (<1% of agent configs vs 33% of Actions workflows, n=31 true positives). We propose a deterministic control plane above the harness that maps one-to-one to these gaps. Rel(AI)Build treats agent definitions as a managed supply chain (SHA-256 content addressing, HMAC-stamped lockfiles, hash-chained audit logs); enforces tiered permissions and attack-derived blocklists before LLM invocation; gates feature work through a phase state machine with requirement-to-file-to-test traceability; compiles a single canonical definition to seven IDE targets; and detects prompt drift via Jaccard similarity. Conformance tests on injected violations confirm each mechanism enforces its stated invariant; developer outcomes remain future work. Governance of this layer must be deterministic and tool-agnostic -- not delegated to further LLM orchestration.",
          "authors": [
            "Padmaraj Madatha"
          ],
          "categories": [
            "cs.SE",
            "cs.AI",
            "cs.CR"
          ],
          "paper_id": "http://arxiv.org/abs/2606.26924v1",
          "abstract_url": "https://arxiv.org/abs/2606.26924v1",
          "pdf_url": "https://arxiv.org/pdf/2606.26924v1",
          "published_at": "2026-06-25T12:02:18+00:00",
          "updated_at": "2026-06-25T12:02:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.26924",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.26924v1"
          },
          "relevance_score": 64,
          "match_reasons": [
            "title matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.26924"
        },
        {
          "title": "NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems",
          "summary": "Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM coding agents optimize for runnable code, but runnable code does not imply a valid recommender architecture. Candidates may pass local tests while causing silent failures that degrade performance. We present NOVA, a level-aware agent harness for verification-aware architecture evolution. NOVA uses an architecture gradient, an SGD-inspired, non-differentiable update signal that aggregates prior modifications, verification diagnostics, metric feedback, and trajectory memory to guide the next modification. A verification cascade checks structure semantics, local executability, offline effectiveness, and online impact; invalid candidates are blocked early, with failure patterns recorded as forbidden directions. L1--L4 task-level control matches automation to task complexity and risk, routing high-risk tasks to Copilot for human oversight. Deployed in an industrial advertising system, NOVA achieves the highest effective pass rate on L2 ScaleUp and L3 Literature-to-Production tasks (54.5% and 60.0%), reduces silent failures compared with coding-agent baselines, and shortens one literature-to-production cycle by over 13x in human-attended time. In online A/B testing, the selected L3 candidate improves GMV on three pCVR objectives by +1.25%, +1.70%, and +2.02%, while reducing pCVR bias by 58.8%, 66.7%, and 37.3%.",
          "authors": [
            "Shaohua Liu",
            "Liang Fang",
            "Yilong Sun",
            "Shudong Huang",
            "Qingsong Luo",
            "Xiaoyang Chen",
            "Dongqiang Liu",
            "Chuangang Ma",
            "Zhenzhen Chai",
            "Henghuan Wang",
            "Shijie Quan",
            "Changyuan Cui",
            "Zhangbin Zhu",
            "Peng Chen",
            "Wei Xu",
            "Lei Xiao",
            "Haijie Gu",
            "Jie Jiang"
          ],
          "categories": [
            "cs.IR",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27243v1",
          "abstract_url": "https://arxiv.org/abs/2606.27243v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27243v1",
          "published_at": "2026-06-25T16:30:39+00:00",
          "updated_at": "2026-06-25T16:30:39+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27243",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27243v1"
          },
          "relevance_score": 47,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27243"
        },
        {
          "title": "Mostly Automatic Translation of Language Interpreters from C to Safe Rust",
          "summary": "Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs are particularly important targets for such translation, as they often handle untrusted inputs and suffer from memory-related vulnerabilities. We present Reboot, a mostly-automatic technique that translates real-world interpreter programs from C to safe Rust. Using Reboot, we have translated six interpreters ranging from 6k to 23k lines of C code to safe Rust, with each translation requiring only 1 to 11 brief user interventions. All translations pass 100% of the provided test suites, and achieve 62%--92% pass rates on separately created validation tests that were never exposed to the system. A security case study on mujs shows that memory vulnerabilities such as heap buffer overflows and use-after-free present in C are eliminated in the safe Rust translation. Two ideas underpin Reboot. First, feature reduction decomposes the translation by program features, creating a sequence of milestones where each is a complete, testable program; the translation starts from the simplest version and incrementally restores features, with each milestone validated before proceeding. Second, a multi-agent architecture orchestrates inherently unreliable coding agents through automated validation and feedback, keeping long-running translation workflows on track with minimal human involvement. An ablation study confirms that feature reduction improves translation correctness compared to using multi-agent translation alone, with 6%--20% improvements in pass rates on validation test suites.",
          "authors": [
            "Bo Wang",
            "Brandon Paulsen",
            "Joey Dodds",
            "Daniel Kroening",
            "Umang Mathur",
            "Prateek Saxena"
          ],
          "categories": [
            "cs.PL",
            "cs.MA",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27122v1",
          "abstract_url": "https://arxiv.org/abs/2606.27122v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27122v1",
          "published_at": "2026-06-25T14:59:08+00:00",
          "updated_at": "2026-06-25T14:59:08+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Coding Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27122",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27122v1"
          },
          "relevance_score": 45,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27122"
        },
        {
          "title": "The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development",
          "summary": "AI coding agents dramatically accelerate implementation speed but introduce two structural failure modes that existing spec-driven approaches do not fully solve: (1) context explosion -- the agent must reason over an entire repository at once, degrading output quality as the context window fills; and (2) silent spec-code drift -- code evolves, the specification does not, and the divergence becomes invisible until it is costly to repair. We present the Spec Growth Engine, a lightweight framework that addresses both failure modes through a machine-readable spec graph whose nodes carry explicit contract/design separation, a Spine context assembler that scopes agent context to an ownership path, a vertical-slice growth protocol that enforces hardest-first ordering, and a drift gate that makes spec-code divergence a blocking merge condition. The design synthesises well-established software engineering principles (Parnas information hiding, C4, ADRs, Walking Skeleton, Reflexion Models, Fitness Functions) into a lean, code-coupled, machine-enforced whole -- without the overhead of heavy-weight frameworks such as RUP or MDA.",
          "authors": [
            "Hartwig Grabowski"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.27045v1",
          "abstract_url": "https://arxiv.org/abs/2606.27045v1",
          "pdf_url": "https://arxiv.org/pdf/2606.27045v1",
          "published_at": "2026-06-25T13:51:22+00:00",
          "updated_at": "2026-06-25T13:51:22+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "Agent",
            "Coding Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.27045",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.27045v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.27045"
        }
      ]
    }
  ]
}