Keyword Tracking

关键词追踪：code editing

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

近期走势

最近一次命中来自 Agent Runtime Security：WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-06-09

2026-06-09 13:12:49 (Asia/Shanghai)

Agent Runtime Security

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

查看原始来源

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing bench…

2026-05-15

2026-05-15 14:57:29 (Asia/Shanghai)

Terminal and SWE Agents

Comparing Developer and LLM Biases in Code Evaluation

查看原始来源

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We pres…

2026-05-12

2026-05-12 12:42:08 (Asia/Shanghai)

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

查看原始来源

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape…

2026-04-29

2026-04-29 12:26:28 (Asia/Shanghai)

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

查看原始来源

Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 6…