AUTORESEARCH PUB_DATE: 2026.04.13

KARPATHY’S 630‑LINE AUTORESEARCH AGENT SHOWS DOUBLE‑DIGIT GAINS FROM FULLY AUTOMATED EXPERIMENT LOOPS

Andrej Karpathy open-sourced a 630-line AutoResearch agent that runs ML experiments autonomously and squeezed double-digit gains out of “well-tuned” code. In a...

Karpathy’s 630‑line AutoResearch agent shows double‑digit gains from fully automated experiment loops

Andrej Karpathy open-sourced a 630-line AutoResearch agent that runs ML experiments autonomously and squeezed double-digit gains out of “well-tuned” code.

In a two-day run, the agent tried about 700 experiments, found 20 real improvements, and cut the “Time to GPT‑2” benchmark from 2.02 hours to 1.80 hours, an 11% efficiency win, per this write-up of the launch and results Business Analytics Review. The pattern: a fixed evaluator, a codebase the agent can safely modify, and a human stating the research goal.

Early adopters saw similar wins. Shopify’s Tobi Lütke reported a 19% gain on a 0.8B model from 37 overnight trials, and Red Hat ran 198 hands-off experiments on OpenShift; the repo also exploded in stars, according to the same source Business Analytics Review. A practitioner even tried an overnight run and shared notes HackerNoon.

This isn’t just hyperparameter flailing. The agent proposed structural code fixes, like attention sharpening multipliers and value embedding regularization, that humans might not try in one sitting Business Analytics Review.

[ WHY_IT_MATTERS ]
01.

Autonomous iteration compresses research loops and surfaces structural code changes that beat manual tweaking.

02.

The evaluator/agent pattern generalizes to any measurable score, from fine-tuning to prompt and pipeline optimization.

[ WHAT_TO_TEST ]
  • terminal

    Run a 12–24 hour agent loop on a sandboxed model or pipeline with an immutable evaluator and strict guardrails, then compare wins versus your human-tuned baseline.

  • terminal

    Schedule runs under your existing orchestrator (Kubernetes/Slurm/Airflow), logging cost per confirmed improvement and regression rate to decide promotion thresholds.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Wire the agent into CI/CD and experiment tracking so diffs, metrics, and rollbacks are auditable; gate merges on evaluator pass rates.

  • 02.

    Enforce compute quotas and safe I/O: read-only datasets, write access to a scratch fork, secrets via your standard vault.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design an evaluation-first repo: isolate the evaluator, expose tunable surfaces, and ship a minimal scorer plus agent loop from day one.

  • 02.

    Bake in guardrails: fixed data splits, deterministic seeds, budget caps, and a clear promotion policy for candidate changes.

SUBSCRIBE_FEED
Get the digest delivered. No spam.