SWE-rebench

Repo

SWE-bench is an open-source benchmark and dataset of real GitHub issues and their ground-truth patches for evaluating large language models on software-engineering tasks such as bug fixing. It provides the tasks and an evaluation harness used by researchers and practitioners to measure model performance on realistic code-repair problems.

article 1 story calendar_today First: 2026-03-07 update Last: 2026-03-08

Stories

Completed digest stories linked to this service.

Benchmarks Are Breaking: Evaluate LLMs in Your Harness, Not Theirs

2026-03-07

LLM benchmark scores are failing under real-world conditions, so choose and tune models by testing them in you...