code agent Topic Archive

code agent Topic Archive code-agent.html 关键词 code agent 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring ../papers/arxiv-b4fa44958d89.html https://arxiv.org/abs/2606.26979v1#2026-06-26#code-agent Fri, 26 Jun 2026 13:16:53 +0800 LLM-based code agents navigate repositories through keyword search but miss the structural relationships, such as call graphs, inheritance hierarchies, and configuration dependencies, that define how software actually works. This makes agent navigation stochastic and difficult to reproduce across runs. We investigate whether lightweight static analysis can provide deterministic anchors for these agents: stable structural facts injected as plain-text comments that constrain probabilistic explora… Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production ../papers/arxiv-8e436ede4dfa.html https://arxiv.org/abs/2606.11869v1#2026-06-11#code-agent Thu, 11 Jun 2026 13:59:12 +0800 Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the prac… AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies ../papers/arxiv-847112fe86c5.html https://arxiv.org/abs/2606.10752v1#2026-06-10#code-agent Wed, 10 Jun 2026 13:25:04 +0800 Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically… DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch ../papers/arxiv-b9cf97bd5bd4.html https://arxiv.org/abs/2606.10728v1#2026-06-10#code-agent Wed, 10 Jun 2026 13:25:04 +0800 As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for who… Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement ../papers/arxiv-0fcde6e4a30d.html https://arxiv.org/abs/2606.05920#2026-06-05#code-agent Fri, 05 Jun 2026 13:25:00 +0800 Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a… SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks ../papers/arxiv-5aa02ffeb813.html https://arxiv.org/abs/2606.05574#2026-06-05#code-agent Fri, 05 Jun 2026 13:25:00 +0800 Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents. In this paper, we propose SmellBench, an extensible code refactoring benchmark that proactively… RAT: RunAnyThing via Fully Automated Environment Configuration ../papers/arxiv-151b6da280b8.html https://arxiv.org/abs/2604.23190#2026-06-05#code-agent Fri, 05 Jun 2026 13:25:00 +0800 Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to diverse real-world repositories.… The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? ../papers/arxiv-301ad9c78a6d.html https://arxiv.org/abs/2606.04455#2026-06-04#code-agent Thu, 04 Jun 2026 14:02:06 +0800 Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to… What Makes Interaction Trajectories Effective for Training Terminal Agents? ../papers/arxiv-d30ae188c67b.html https://arxiv.org/abs/2606.03461#2026-06-03#code-agent Wed, 03 Jun 2026 14:09:56 +0800 Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0,… Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing ../papers/arxiv-7b2eb833fcd5.html https://arxiv.org/abs/2606.03618#2026-06-03#code-agent Wed, 03 Jun 2026 14:09:56 +0800 AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cr… CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing ../papers/arxiv-646dedf0e4c5.html https://arxiv.org/abs/2605.14084#2026-05-15#code-agent Fri, 15 May 2026 14:57:29 +0800 Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-edit…