coding agent Topic Archive

coding agent Topic Archive coding-agent.html 关键词 coding agent 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 A Deterministic Control Plane for LLM Coding Agents ../papers/arxiv-21cf1b4bd2a8.html https://arxiv.org/abs/2606.26924v1#2026-06-26#coding-agent Fri, 26 Jun 2026 13:16:53 +0800 LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them -- rules files, agent definitions, IDE-specific markdown -- is largely unmanaged. A prevalence study of 10,008 public GitHub repositories (n=6,145 agent config files) finds that agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent), with 75.5% of clone pairs… NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems ../papers/arxiv-49487ea267ee.html https://arxiv.org/abs/2606.27243v1#2026-06-26#coding-agent Fri, 26 Jun 2026 13:16:53 +0800 Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM cod… Mostly Automatic Translation of Language Interpreters from C to Safe Rust ../papers/arxiv-6dab3f980ca1.html https://arxiv.org/abs/2606.27122v1#2026-06-26#coding-agent Fri, 26 Jun 2026 13:16:53 +0800 Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs are particularly important targets for such translation, as they often handle untrusted inputs and suffer from memory-related vulnerabilities. We present Reboot, a mostly-automatic technique that translates real-world interpreter programs from C to safe Rust. Using Reboot, we have translated six interpreters ranging from 6k to 23k lines… The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development ../papers/arxiv-74797965448d.html https://arxiv.org/abs/2606.27045v1#2026-06-26#coding-agent Fri, 26 Jun 2026 13:16:53 +0800 AI coding agents dramatically accelerate implementation speed but introduce two structural failure modes that existing spec-driven approaches do not fully solve: (1) context explosion -- the agent must reason over an entire repository at once, degrading output quality as the context window fills; and (2) silent spec-code drift -- code evolves, the specification does not, and the divergence becomes invisible until it is costly to repair. We present the Spec Growth Engine, a lightweight framework… NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? ../papers/arxiv-934bd8b79fae.html https://arxiv.org/abs/2606.24530v1#2026-06-24#coding-agent Wed, 24 Jun 2026 13:06:49 +0800 We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research… Bayesian control for coding agents ../papers/arxiv-1bf783e8f09b.html https://arxiv.org/abs/2606.24453v1#2026-06-24#coding-agent Wed, 24 Jun 2026 13:06:49 +0800 Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and ni… Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories ../papers/arxiv-1037309e2b3d.html https://arxiv.org/abs/2606.24429v1#2026-06-24#coding-agent Wed, 24 Jun 2026 13:06:49 +0800 Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. We introduce a multi-layered detection framework that integrates configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup across World of Code (180M+ Git repositories), classifying agent traces into four behavioral types. No single method captures more than a fraction of activity: multi-metho… Probe-and-Refine Tuning of Repository Guidance for Coding Agents ../papers/arxiv-b849fd15a901.html https://arxiv.org/abs/2606.20512v1#2026-06-19#coding-agent Fri, 19 Jun 2026 14:26:15 +0800 LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper w… N-Version Programming with Coding Agents ../papers/arxiv-9d86ae01851b.html https://arxiv.org/abs/2606.20158v1#2026-06-19#coding-agent Fri, 19 Jun 2026 14:26:15 +0800 This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial… Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents ../papers/arxiv-c9634e5a40df.html https://arxiv.org/abs/2606.19319v1#2026-06-18#coding-agent Thu, 18 Jun 2026 14:03:08 +0800 Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair con… All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code ../papers/arxiv-982cb530f375.html https://arxiv.org/abs/2606.18168v1#2026-06-17#coding-agent Wed, 17 Jun 2026 14:22:19 +0800 Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goa… GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? ../papers/arxiv-a08b927cb2ff.html https://arxiv.org/abs/2606.17861v1#2026-06-17#coding-agent Wed, 17 Jun 2026 14:22:19 +0800 Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-ga… Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering ../papers/arxiv-e2056f1b32cd.html https://arxiv.org/abs/2606.17799v1#2026-06-17#coding-agent Wed, 17 Jun 2026 14:22:19 +0800 Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, ha… Context-Aware RL for Agentic and Multimodal LLMs ../papers/arxiv-7fc03b48799d.html https://arxiv.org/abs/2606.17053v1#2026-06-16#coding-agent Tue, 16 Jun 2026 14:38:43 +0800 Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an ans… Agent trajectories as programs: fingerprinting and programming coding-agent behavior ../papers/arxiv-1006c0328948.html https://arxiv.org/abs/2606.16988v1#2026-06-16#coding-agent Tue, 16 Jun 2026 14:38:43 +0800 Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We… Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection ../papers/arxiv-894d6a7c7127.html https://arxiv.org/abs/2606.16839v1#2026-06-16#coding-agent Tue, 16 Jun 2026 14:38:43 +0800 In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a bro… AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility ../papers/arxiv-46edd32748c1.html https://arxiv.org/abs/2606.13608v1#2026-06-12#coding-agent Fri, 12 Jun 2026 13:55:02 +0800 Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols:… Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents ../papers/arxiv-2423b3a60f1a.html https://arxiv.org/abs/2606.13174v1#2026-06-12#coding-agent Fri, 12 Jun 2026 13:55:02 +0800 Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipelin… Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset ../papers/doi-55309c0d716b.html https://arxiv.org/abs/2606.13468v1#2026-06-12#coding-agent Fri, 12 Jun 2026 13:55:02 +0800 AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes… Recursive Agent Harnesses ../papers/arxiv-de2b00c4f07a.html https://arxiv.org/abs/2606.13643v1#2026-06-12#coding-agent Fri, 12 Jun 2026 13:55:02 +0800 Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Har… PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents ../papers/arxiv-02d4273f4e6d.html https://arxiv.org/abs/2606.12329v1#2026-06-11#coding-agent Thu, 11 Jun 2026 13:59:12 +0800 AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, loca… Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages ../papers/arxiv-9b497ddc554e.html https://arxiv.org/abs/2606.10933v1#2026-06-10#coding-agent Wed, 10 Jun 2026 13:25:04 +0800 LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstre… AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies ../papers/arxiv-847112fe86c5.html https://arxiv.org/abs/2606.10752v1#2026-06-10#coding-agent Wed, 10 Jun 2026 13:25:04 +0800 Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically… SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation ../papers/arxiv-39d71777386c.html https://arxiv.org/abs/2606.09774v1#2026-06-09#coding-agent Tue, 09 Jun 2026 13:12:49 +0800 Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs… ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer ../papers/arxiv-9604b8229fbc.html https://arxiv.org/abs/2606.05548#2026-06-05#coding-agent Fri, 05 Jun 2026 13:25:00 +0800 The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the devel… Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage? ../papers/arxiv-9fdd4a7953d9.html https://arxiv.org/abs/2606.05647#2026-06-05#coding-agent Fri, 05 Jun 2026 13:25:00 +0800 AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavio… Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence ../papers/arxiv-1abfb9f6a50a.html https://arxiv.org/abs/2605.29054#2026-06-05#coding-agent Fri, 05 Jun 2026 13:25:00 +0800 Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can ma… Can Generalist Agents Automate Data Curation? ../papers/arxiv-4de315772446.html https://arxiv.org/abs/2606.04261#2026-06-04#coding-agent Thu, 04 Jun 2026 14:02:06 +0800 Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, subm… Trustworthy AI Software Engineers ../papers/arxiv-bf1111d1de08.html https://arxiv.org/abs/2602.06310#2026-06-04#coding-agent Thu, 04 Jun 2026 14:02:06 +0800 With the rapid rise of AI coding agents, the fundamental premise of what it means to be a software engineer is in question. In this vision paper, we examine what it means for an AI agent to be considered a software engineer and then critically think about what makes such an agent trustworthy. Grounded in established definitions of SE (SE) and informed by recent research on agentic AI systems, we conceptualise AI software engineers as participants in human-AI SE teams composed of human software… Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing ../papers/arxiv-7b2eb833fcd5.html https://arxiv.org/abs/2606.03618#2026-06-03#coding-agent Wed, 03 Jun 2026 14:09:56 +0800 AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cr… Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks ../papers/arxiv-7a2f0bc6383f.html https://arxiv.org/abs/2606.02875#2026-06-03#coding-agent Wed, 03 Jun 2026 14:09:56 +0800 Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and eval… Human-AI Collaboration and the Transformation of Software Engineering Work ../papers/arxiv-65d779a0d670.html https://arxiv.org/abs/2606.03394#2026-06-03#coding-agent Wed, 03 Jun 2026 14:09:56 +0800 The integration of Generative AI (GenAI) and Agentic AI into software development is reconfiguring software engineering from an activity centered on human authorship of code into a discipline centered on directing, verifying, and governing autonomous and semi-autonomous systems. Drawing on a curated, multi-source evidence base of recent peer-reviewed and archival studies -- including large-scale empirical observations of autonomous coding agents contributing hundreds of thousands of pull reques… SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction ../papers/arxiv-e23cf323686f.html https://arxiv.org/abs/2606.02540v1#2026-06-02#coding-agent Tue, 02 Jun 2026 13:56:35 +0800 Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use life… Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software ../papers/arxiv-1f630a213c57.html https://arxiv.org/abs/2605.30353v1#2026-05-29#coding-agent Fri, 29 May 2026 13:18:32 +0800 Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could no… Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas ../papers/arxiv-954fe75779c9.html https://arxiv.org/abs/2605.30003v1#2026-05-29#coding-agent Fri, 29 May 2026 13:18:32 +0800 We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering),… "Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution ../papers/arxiv-9dc3da77bc41.html https://arxiv.org/abs/2605.22526v1#2026-05-22#coding-agent Fri, 22 May 2026 13:08:19 +0800 Recent advances in coding agents have shown remarkable progress in software issue resolution. In practice, real-world issues are typically bug fixes or feature requests in which human developers naturally incorporate refactoring as part of the resolution process, resulting in tangled refactoring. Since LLMs are trained on large-scale open-source repositories, coding agents may inherit such behaviors. In this paper, we conduct an empirical study on Multi-SWE-bench, analyzing 3,691 valid patches… Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study ../papers/arxiv-8ae2bd5f0bc1.html https://arxiv.org/abs/2605.22534v1#2026-05-22#coding-agent Fri, 22 May 2026 13:08:19 +0800 AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover deci… Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents ../papers/arxiv-841e3c0b15b6.html https://arxiv.org/abs/2605.21347v1#2026-05-21#coding-agent Thu, 21 May 2026 13:14:24 +0800 Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize syst… SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents ../papers/arxiv-7c3df22fc78d.html https://arxiv.org/abs/2605.21384v1#2026-05-21#coding-agent Thu, 21 May 2026 13:14:24 +0800 As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in i… TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing ../papers/arxiv-205818b0a17d.html https://arxiv.org/abs/2605.18859#2026-05-20#coding-agent Wed, 20 May 2026 13:10:58 +0800 LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success,… Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study ../papers/arxiv-c2eea6eb77ee.html https://arxiv.org/abs/2605.20049#2026-05-20#coding-agent Wed, 20 May 2026 13:10:58 +0800 As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture,… PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents ../papers/arxiv-3fb429b5c523.html https://arxiv.org/abs/2605.19932#2026-05-20#coding-agent Wed, 20 May 2026 13:10:58 +0800 Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas… RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades ../papers/arxiv-65c153689210.html https://arxiv.org/abs/2605.15846#2026-05-20#coding-agent Wed, 20 May 2026 13:10:58 +0800 Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded i… Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks ../papers/arxiv-2a81926568cc.html https://arxiv.org/abs/2605.18583v1#2026-05-19#coding-agent Tue, 19 May 2026 13:08:04 +0800 Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign t… Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents ../papers/arxiv-5fedf8f72baf.html https://arxiv.org/abs/2605.18684v1#2026-05-19#coding-agent Tue, 19 May 2026 13:08:04 +0800 Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language-model-based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for… Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation ../papers/arxiv-4b40491562b5.html https://arxiv.org/abs/2605.14563#2026-05-15#coding-agent Fri, 15 May 2026 14:57:29 +0800 Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within… SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades ../papers/arxiv-ff63d48f53a7.html https://arxiv.org/abs/2605.14415#2026-05-15#coding-agent Fri, 15 May 2026 14:57:29 +0800 Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each tra… Documentation-Guided Agentic Codebase Migration from C to Rust ../papers/arxiv-e59063d0f41d.html https://arxiv.org/abs/2605.14634#2026-05-15#coding-agent Fri, 15 May 2026 14:57:29 +0800 Migrating legacy C repositories to Rust promises stronger memory safety, but existing translators often work at the level of files or functions and miss architectural intent. We present RustPrint, a documentation-guided agentic framework for repository-level C-to-Rust migration. RustPrint first converts the source repository into architecture-aware documentation and treats it as a migration blueprint capturing module structure, data flow, APIs, and design rationale. Coding agents then use this… Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack ../papers/arxiv-291af5e89807.html https://arxiv.org/abs/2605.12673#2026-05-15#coding-agent Fri, 15 May 2026 14:57:29 +0800 Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers.… Agentic Vulnerability Reasoning on Windows COM Binaries ../papers/arxiv-fc8295aa4188.html https://arxiv.org/abs/2605.05000v1#2026-05-07#coding-agent Thu, 07 May 2026 12:38:06 +0800 Windows Component Object Model (COM) services run with elevated privileges and are widely accessible to authenticated users, making race conditions in these binaries a critical surface for local privilege escalation. We present SLYP, an end-to-end agentic pipeline that discovers race condition vulnerabilities in COM binaries and generates debugger-verified proof-of-concept (PoC) code. SLYP exposes binary exploration, COM inspection, and dynamic debugging as reusable tool interfaces, giving agen…