<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>terminal bench Topic Archive</title>
<link>terminal-bench.html</link>
<description>关键词 terminal bench 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Sun, 28 Jun 2026 05:24:06 +0000</lastBuildDate>
<item>
<title>What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design</title>
<link>../papers/arxiv-043256a25d59.html</link>
<guid>https://arxiv.org/abs/2604.28093v1#2026-05-01#terminal-bench</guid>
<pubDate>Fri, 01 May 2026 12:53:56 +0800</pubDate>
<description>Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they wri…</description>
</item>
</channel>
</rss>
