Terminal-Bench 2.0, a new platform for testing agents in containers, launches with Harbor

The creators of Terminal-Bench, a set of benchmarks for assessing the performance of autonomous AI agents in real terminal tasks, have published version 2.0 along Porta new platform for testing, improving and optimizing AI agents in container environments.

The dual release goals to deal with long-standing issues in testing and optimizing AI agents, particularly those created to operate autonomously in realistic development environments.

With a tougher and rigorously validated set of tasks, Terminal-Bench 2.0 replaces version 1.0 as the standard for assessing the capabilities of frontier models.

- Advertisement -

Harbor, a companion runtime, enables developers and researchers to scale assessments across 1000’s of cloud containers and integrates with each open source and proprietary agents and training processes.

“Harbor is the package we wish we had when we built Terminal-Bench,” wrote a contributor Alex Shaw in X. “It is intended for agent, model and benchmark developers and researchers who want to evaluate and improve agents and models.”

Higher bar, cleaner data

Terminal-Bench 1.0 was quickly adopted upon its release premiere in May 2025becoming the default benchmark for assessing agent performance in the realm of AI agents running in developer-like terminal environments. These agents interact with systems via the command line, mimicking the way developers work behind the scenes of a graphical user interface.

However, its wide scope was associated with inconsistencies. The community has identified several tasks as poorly defined or unstable attributable to changes in external services.

Version 2.0 addresses these issues directly. The updated package accommodates 89 tasks, each of which is subject to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic and clearly defined, raising the difficulty level while improving reliability and repeatability.

A notable example is download-youtube task, which was removed or refactored in 2.0 attributable to its dependency on unstable third-party APIs.

“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0, despite our claim that TB2.0 is more difficult” – Shaw excellent to X. “We think this is because the quality of the tasks is much higher in the new benchmark.”

Harbor: Unified deployments at scale

With the benchmark update, the team launched Porta new platform for running and evaluating agents in containers deployed in the cloud.

Harbor supports large-scale deployment infrastructure, with compatibility with major vendors equivalent to Daytona AND Modal.

Designed to generalize agent architecture, Harbor supports:

  • Evaluation of any agent installed in a container

  • Scalable supervised tuning (SFT) and reinforcement learning (RL) pipelines.

  • Create and implement custom benchmarks

  • Full integration with Terminal-Bench 2.

Harbor was used internally to perform tens of 1000’s of implementations while creating the new benchmark. It is now publicly available via harborframework.comwith documentation for testing and reporting agents to the public leaderboard.

Early Results: GPT-5 Leads in Task Success

Preliminary results from the Terminal-Bench 2.0 leaderboard show that OpenAI’s Codex CLI (command line interface), the GPT-5-enabled variant, is at the top, with a success rate of 49.6% – the highest of any agent tested so far.

Closely behind are other GPT-5 variants and agents based on Claude Sonnet 4.5.

Results of the top 5 agents (Terminal-Bench 2.0):

  1. CLI code (GPT-5) – 49.6%

  2. CLI Code (GPT-5 Code) – 44.3%

  3. OpenHands (GPT-5) – 43.8%

  4. Terminus 2 (GPT-5 Codex) – 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) – 42.8%

The tight clustering of the best models indicates lively competition between platforms, with no single agent solving greater than half of the tasks.

Transmission and Use

To test or submit the agent, users install Harbor and run the benchmark using easy CLI commands. Leaderboard submissions require five benchmarks, and the results could be emailed to developers along with job directories for review.

port run -d [email protected] -m ““-A”” –n-attempts 5 –jobs-dir

Terminal-Bench 2.0 is already being integrated into research workflows focusing on agentic inference, code generation and tool usage. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford University, work is underway on a detailed preprint covering the verification process and design methodology behind the benchmark.

Striving for standardization

The combined release of Terminal-Bench 2.0 and Harbor is a step towards a more consistent and scalable agent evaluation infrastructure. As LLM agents have proliferated across development and operational environments, the need for controlled, repeatable testing has increased.

These tools provide a potential foundation for a unified set of assessments – supporting model improvement, environment simulation, and pattern standardization across the AI ​​ecosystem.

Latest Posts

Advertisement

More from this stream

Recomended