terminal bench Topic Archive

terminal bench Topic Archive terminal-bench.html 关键词 terminal bench 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design ../papers/arxiv-043256a25d59.html https://arxiv.org/abs/2604.28093v1#2026-05-01#terminal-bench Fri, 01 May 2026 12:53:56 +0800 Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they wri…