Terminal and SWE Agents Feed Archive

Terminal and SWE Agents Feed Archive terminal-and-swe-agents.html Terminal and SWE Agents 的长期订阅 RSS，汇总最近命中的论文和归档。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair ../papers/arxiv-b661505d2f8b.html https://arxiv.org/abs/2606.27205v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks. As the number of parameters increases, results can often be improved, but this also imposes substantialmemory requirements. While quantization effectively reduces thememory footprint, its overall impact is often summarized onlyby benchmark scores, which mask changes in model behaviorand non-functional overheads. In this work, we conduct anempirical evaluation of LLM quantization u… To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair ../papers/arxiv-f2d059b8979b.html https://arxiv.org/abs/2606.26978v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior… How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring ../papers/arxiv-b4fa44958d89.html https://arxiv.org/abs/2606.26979v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic explora… A Deterministic Control Plane for LLM Coding Agents ../papers/arxiv-21cf1b4bd2a8.html https://arxiv.org/abs/2606.26924v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 LLM coding harnesses grant agents broad file and shell access, yet the configuration layer that steers them -- rules files, agent definitions, IDE-specific markdown -- is largely unmanaged. A prevalence study of 10,008 public GitHub repositories (n=6,145 agent config files) finds that agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent), with 75.5% of clone pairs… NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems ../papers/arxiv-49487ea267ee.html https://arxiv.org/abs/2606.27243v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 Industrial advertising recommender models are continuously improved through architecture evolution. Upgrades such as RankMixer, TokenMixer-Large, and MixFormer show that better structures remain a key source of quality and business gains. Yet developing such upgrades in production is expert-intensive and difficult to scale. Existing automation is insufficient: AutoML mainly tunes hyper-parameters, while effective gains often require cross-module changes under strict constraints; generic LLM cod… Mostly Automatic Translation of Language Interpreters from C to Safe Rust ../papers/arxiv-6dab3f980ca1.html https://arxiv.org/abs/2606.27122v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 Translating C programs to safe Rust is challenging owing to significant differences in typing constraints, ownership, and borrowing rules. Interpreter programs are particularly important targets for such translation, as they often handle untrusted inputs and suffer from memory-related vulnerabilities. We present Reboot, a mostly-automatic technique that translates real-world interpreter programs from C to safe Rust. Using Reboot, we have translated six interpreters ranging from 6k to 23k lines… The Spec Growth Engine: Spec-Anchored, Code-Coupled, Drift-Enforced Architecture for AI-Assisted Software Development ../papers/arxiv-74797965448d.html https://arxiv.org/abs/2606.27045v1#2026-06-26#terminal-and-swe-agents Fri, 26 Jun 2026 13:16:53 +0800 AI coding agents dramatically accelerate implementation speed but introduce two structural failure modes that existing spec-driven approaches do not fully solve: (1) context explosion -- the agent must reason over an entire repository at once, degrading output quality as the context window fills; and (2) silent spec-code drift -- code evolves, the specification does not, and the divergence becomes invisible until it is costly to repair. We present the Spec Growth Engine, a lightweight framework… Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution ../papers/arxiv-f7ba1bc50aef.html https://arxiv.org/abs/2606.25514v1#2026-06-25#terminal-and-swe-agents Thu, 25 Jun 2026 13:11:21 +0800 Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must navigate codebases to locate the root cause, reproduce the failure, implement a fix, and validate the resulting patch. Inefficient context management, thereby, can lead to rapid context degradation and context poisoning, preventing successful resolution. We propose icat-agent, a decentralized, multi-agent scaffolding that replaces shared… Evaluating LLMs on Real-World Software Performance Optimization ../papers/arxiv-28c9e56c593c.html https://arxiv.org/abs/2606.25530v1#2026-06-25#terminal-and-swe-agents Thu, 25 Jun 2026 13:11:21 +0800 Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in real-world codebases. Existing frameworks often oversimplify the problem by focusing on isolated functions or a single performance metric, missing the critical trade-offs between execution time and memory footprint, the inherent noise of the measurement environment, and… SHERLOC: Structured Diagnostic Localization for Code Repair Agents ../papers/arxiv-b868687e026f.html https://arxiv.org/abs/2606.24820v1#2026-06-24#terminal-and-swe-agents Wed, 24 Jun 2026 13:06:49 +0800 LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact re… NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? ../papers/arxiv-934bd8b79fae.html https://arxiv.org/abs/2606.24530v1#2026-06-24#terminal-and-swe-agents Wed, 24 Jun 2026 13:06:49 +0800 We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research… Bayesian control for coding agents ../papers/arxiv-1bf783e8f09b.html https://arxiv.org/abs/2606.24453v1#2026-06-24#terminal-and-swe-agents Wed, 24 Jun 2026 13:06:49 +0800 Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and ni… Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories ../papers/arxiv-1037309e2b3d.html https://arxiv.org/abs/2606.24429v1#2026-06-24#terminal-and-swe-agents Wed, 24 Jun 2026 13:06:49 +0800 Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. We introduce a multi-layered detection framework that integrates configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup across World of Code (180M+ Git repositories), classifying agent traces into four behavioral types. No single method captures more than a fraction of activity: multi-metho… LemonHarness Technical Report ../papers/arxiv-a79c559da3e4.html https://arxiv.org/abs/2606.24311v1#2026-06-24#terminal-and-swe-agents Wed, 24 Jun 2026 13:06:49 +0800 As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such… Tmax: A simple recipe for terminal agents ../papers/arxiv-5b7989f3a4e0.html https://arxiv.org/abs/2606.23321v1#2026-06-23#terminal-and-swe-agents Tue, 23 Jun 2026 13:10:02 +0800 Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of simple baseline recipes. We present Tmax, the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. While simple, our recipe achieves 27\% on Terminal-Bench 2.0 with on… Probe-and-Refine Tuning of Repository Guidance for Coding Agents ../papers/arxiv-b849fd15a901.html https://arxiv.org/abs/2606.20512v1#2026-06-19#terminal-and-swe-agents Fri, 19 Jun 2026 14:26:15 +0800 LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper w… Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs ../papers/arxiv-59528d739abc.html https://arxiv.org/abs/2606.20243v1#2026-06-19#terminal-and-swe-agents Fri, 19 Jun 2026 14:26:15 +0800 We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24… N-Version Programming with Coding Agents ../papers/arxiv-9d86ae01851b.html https://arxiv.org/abs/2606.20158v1#2026-06-19#terminal-and-swe-agents Fri, 19 Jun 2026 14:26:15 +0800 This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial… Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents ../papers/arxiv-c9634e5a40df.html https://arxiv.org/abs/2606.19319v1#2026-06-18#terminal-and-swe-agents Thu, 18 Jun 2026 14:03:08 +0800 Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair con… All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code ../papers/arxiv-982cb530f375.html https://arxiv.org/abs/2606.18168v1#2026-06-17#terminal-and-swe-agents Wed, 17 Jun 2026 14:22:19 +0800 Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goa… LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling ../papers/arxiv-1ffe5e95cd6c.html https://arxiv.org/abs/2606.18023v1#2026-06-17#terminal-and-swe-agents Wed, 17 Jun 2026 14:22:19 +0800 Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positiona… VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination ../papers/arxiv-026cd79abeae.html https://arxiv.org/abs/2606.17999v1#2026-06-17#terminal-and-swe-agents Wed, 17 Jun 2026 14:22:19 +0800 MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPa… GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? ../papers/arxiv-a08b927cb2ff.html https://arxiv.org/abs/2606.17861v1#2026-06-17#terminal-and-swe-agents Wed, 17 Jun 2026 14:22:19 +0800 Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-ga… Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering ../papers/arxiv-e2056f1b32cd.html https://arxiv.org/abs/2606.17799v1#2026-06-17#terminal-and-swe-agents Wed, 17 Jun 2026 14:22:19 +0800 Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, ha… Agent trajectories as programs: fingerprinting and programming coding-agent behavior ../papers/arxiv-1006c0328948.html https://arxiv.org/abs/2606.16988v1#2026-06-16#terminal-and-swe-agents Tue, 16 Jun 2026 14:38:43 +0800 Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introduce methods for comparing agents procedurally in different contexts, where the model, tasks, and approaches vary. We compare ten agents and find that they are identifiable by their behavioral habits, which we define as fingerprints: a probe over these procedural signatures attributes an unseen trajectory to the correct agent at 85.7% accuracy, controlling for leakage across tasks. We… Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection ../papers/arxiv-894d6a7c7127.html https://arxiv.org/abs/2606.16839v1#2026-06-16#terminal-and-swe-agents Tue, 16 Jun 2026 14:38:43 +0800 In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a bro… No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages ../papers/arxiv-3ac1fcf1ccb2.html https://arxiv.org/abs/2606.16827v1#2026-06-16#terminal-and-swe-agents Tue, 16 Jun 2026 14:38:43 +0800 Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contr… Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset ../papers/doi-55309c0d716b.html https://arxiv.org/abs/2606.13468v1#2026-06-12#terminal-and-swe-agents Fri, 12 Jun 2026 13:55:02 +0800 AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes… Recursive Agent Harnesses ../papers/arxiv-de2b00c4f07a.html https://arxiv.org/abs/2606.13643v1#2026-06-12#terminal-and-swe-agents Fri, 12 Jun 2026 13:55:02 +0800 Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Har… PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents ../papers/arxiv-02d4273f4e6d.html https://arxiv.org/abs/2606.12329v1#2026-06-11#terminal-and-swe-agents Thu, 11 Jun 2026 13:59:12 +0800 AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, loca… Exploration Structure in LLM Agents for Multi-File Change Localization ../papers/arxiv-011c1398e437.html https://arxiv.org/abs/2606.11976v1#2026-06-11#terminal-and-swe-agents Thu, 11 Jun 2026 13:59:12 +0800 Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We const… Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production ../papers/arxiv-8e436ede4dfa.html https://arxiv.org/abs/2606.11869v1#2026-06-11#terminal-and-swe-agents Thu, 11 Jun 2026 13:59:12 +0800 Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the prac… Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages ../papers/arxiv-9b497ddc554e.html https://arxiv.org/abs/2606.10933v1#2026-06-10#terminal-and-swe-agents Wed, 10 Jun 2026 13:25:04 +0800 LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstre… AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies ../papers/arxiv-847112fe86c5.html https://arxiv.org/abs/2606.10752v1#2026-06-10#terminal-and-swe-agents Wed, 10 Jun 2026 13:25:04 +0800 Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically… DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch ../papers/arxiv-b9cf97bd5bd4.html https://arxiv.org/abs/2606.10728v1#2026-06-10#terminal-and-swe-agents Wed, 10 Jun 2026 13:25:04 +0800 As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for who… SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation ../papers/arxiv-39d71777386c.html https://arxiv.org/abs/2606.09774v1#2026-06-09#terminal-and-swe-agents Tue, 09 Jun 2026 13:12:49 +0800 Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs… From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design ../papers/arxiv-7b524bd797b1.html https://arxiv.org/abs/2606.09663v1#2026-06-09#terminal-and-swe-agents Tue, 09 Jun 2026 13:12:49 +0800 Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We th… Self-Harness: Harnesses That Improve Themselves ../papers/arxiv-8a577d36bd65.html https://arxiv.org/abs/2606.09498v1#2026-06-09#terminal-and-swe-agents Tue, 09 Jun 2026 13:12:49 +0800 The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agen… ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer ../papers/arxiv-9604b8229fbc.html https://arxiv.org/abs/2606.05548#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the devel… Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement ../papers/arxiv-0fcde6e4a30d.html https://arxiv.org/abs/2606.05920#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a… Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation ../papers/arxiv-775fc4ccf6e9.html https://arxiv.org/abs/2511.14224#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Automated unit test generation using large language models (LLMs) holds great promise but often struggles with generating tests that are both correct and maintainable in real-world projects. This paper presents KTester, a novel framework that integrates project-specific knowledge and testing domain knowledge to enhance LLM-based test generation. Our approach first extracts project structure and usage knowledge through static analysis, which provides rich context for the model. It then employs a… SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks ../papers/arxiv-5aa02ffeb813.html https://arxiv.org/abs/2606.05574#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents. In this paper, we propose SmellBench, an extensible code refactoring benchmark that proactively… From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws ../papers/arxiv-c23696173c5a.html https://arxiv.org/abs/2606.06324#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which har… RAT: RunAnyThing via Fully Automated Environment Configuration ../papers/arxiv-151b6da280b8.html https://arxiv.org/abs/2604.23190#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to diverse real-world repositories.… Closing the Loop on Latent Reasoning via Test-Time Reconstruction ../papers/arxiv-d8f49ccdc82d.html https://arxiv.org/abs/2606.06252#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent stat… Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution ../papers/arxiv-4c01c0adea3f.html https://arxiv.org/abs/2606.06492#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. C… Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage? ../papers/arxiv-9fdd4a7953d9.html https://arxiv.org/abs/2606.05647#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavio… Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence ../papers/arxiv-1abfb9f6a50a.html https://arxiv.org/abs/2605.29054#2026-06-05#terminal-and-swe-agents Fri, 05 Jun 2026 13:25:00 +0800 Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can ma… Latent Anchor-Driven Test Generation for Deep Neural Networks ../papers/arxiv-c7c5bfa9c58b.html https://arxiv.org/abs/2606.04310#2026-06-04#terminal-and-swe-agents Thu, 04 Jun 2026 14:02:06 +0800 Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches explore either the input space or a learned latent space. While latent-space generation can better maintain plausibility than direct input-space mutation, current methods still face a trade-off among exploration controllability, failure diversity, and seed-relative sem… Can Generalist Agents Automate Data Curation? ../papers/arxiv-4de315772446.html https://arxiv.org/abs/2606.04261#2026-06-04#terminal-and-swe-agents Thu, 04 Jun 2026 14:02:06 +0800 Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, subm…