Terminal Bench
RepoTerminal Bench is an open benchmark repository that tests how well large language models can execute and reason about real Linux command-line tasks inside a sandboxed terminal. It is used by researchers and practitioners to compare agentic coding models such as OpenAI Codex and Anthropic Claude on end-to-end, tool-using workflows.