<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>code editing Topic Archive</title>
<link>code-editing.html</link>
<description>关键词 code editing 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Sun, 28 Jun 2026 05:24:06 +0000</lastBuildDate>
<item>
<title>WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces</title>
<link>../papers/arxiv-df62d5981d92.html</link>
<guid>https://arxiv.org/abs/2606.09426v1#2026-06-09#code-editing</guid>
<pubDate>Tue, 09 Jun 2026 13:12:49 +0800</pubDate>
<description>Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable art…</description>
</item>
<item>
<title>Comparing Developer and LLM Biases in Code Evaluation</title>
<link>../papers/arxiv-4178c7a2eb59.html</link>
<guid>https://arxiv.org/abs/2603.24586#2026-05-15#code-editing</guid>
<pubDate>Fri, 15 May 2026 14:57:29 +0800</pubDate>
<description>As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges&#x27; ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and in…</description>
</item>
<item>
<title>BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD</title>
<link>../papers/arxiv-b10cb6181b6f.html</link>
<guid>https://arxiv.org/abs/2605.10865v1#2026-05-12#code-editing</guid>
<pubDate>Tue, 12 May 2026 12:42:08 +0800</pubDate>
<description>Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities…</description>
</item>
<item>
<title>SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?</title>
<link>../papers/arxiv-d738d9ec9beb.html</link>
<guid>https://arxiv.org/abs/2604.25737v1#2026-04-29#code-editing</guid>
<pubDate>Wed, 29 Apr 2026 12:26:28 +0800</pubDate>
<description>Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliabilit…</description>
</item>
</channel>
</rss>
