SWE bench Topic Archive

SWE bench Topic Archive swe-bench.html 关键词 SWE bench 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Exploration Structure in LLM Agents for Multi-File Change Localization ../papers/arxiv-011c1398e437.html https://arxiv.org/abs/2606.11976v1#2026-06-11#swe-bench Thu, 11 Jun 2026 13:59:12 +0800 Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We const… From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design ../papers/arxiv-7b524bd797b1.html https://arxiv.org/abs/2606.09663v1#2026-06-09#swe-bench Tue, 09 Jun 2026 13:12:49 +0800 Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We th… Calibrating Conservatism for Scalable Oversight ../papers/arxiv-ee8f82c8687d.html https://arxiv.org/abs/2605.28807v1#2026-05-28#swe-bench Thu, 28 May 2026 13:15:52 +0800 Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functio…