SWE-rebench
RepoSWE-bench is an open-source benchmark and dataset of real GitHub issues and their ground-truth patches for evaluating large language models on software-engineering tasks such as bug fixing. It provides the tasks and an evaluation harness used by researchers and practitioners to measure model performance on realistic code-repair problems.
article
1 story
calendar_today
First: 2026-03-07
update
Last: 2026-03-08
Stories
Completed digest stories linked to this service.
-
Benchmarks Are Breaking: Evaluate LLMs in Your Harness, Not Theirs2026-03-07LLM benchmark scores are failing under real-world conditions, so choose and tune models by testing them in you...