Recursive, an AI research lab focused on automated scientific discovery, published results Wednesday showing its automated research system reached state-of-the-art performance on three real machine-learning engineering benchmarks, beating results produced by communities of human researchers.
The system runs the complete research loop autonomously: propose a hypothesis, implement it, run an experiment, validate the result, and use what it learned to choose the next experiment. It runs many parallel threads, retains useful context across runs, and applies reward-hack detection to ensure genuine improvements rather than benchmark exploits.
The three benchmarks it cleared are NanoChat Autoresearch (train the best small language model within a fixed five-minute, single-GPU compute budget), NanoGPT Speedrun (train a small model to a fixed validation loss as fast as possible), and SOL-ExecBench (write GPU kernels that approach hardware speed limits across 235 tasks). Recursive reports improvements on all three: a 0.0263 bits-per-byte gain on NanoChat that translates to roughly 1.3x speedup to reach equivalent quality; a 2.2-second reduction in NanoGPT Speedrun training time from a 79.7-second baseline that had been refined by 83 human contributors over two years; and an 18% reduction in the gap to the estimated hardware performance ceiling on SOL-ExecBench.
The reason these three problems are tractable for an automated system is also the reason the system does not yet generalize. Each benchmark offers a clear, low-variance metric and a fast evaluation loop. The automated system can iterate hundreds of times against an objective it can measure precisely. That is exactly the condition under which automated search outperforms human intuition: when the feedback is unambiguous and the iteration cost is low.
What the system found on NanoChat illustrates this well. Its best solution was not a single trick. It combined architecture changes, hashed bigram and trigram embedding tables mixed into the attention value path, auxiliary losses, weight decay schedule adjustments, and compiler settings. No single change dominated; the gains compounded. A human researcher with limited time to run experiments would have evaluated fewer branches and likely stopped earlier. The automated system kept searching.
On NanoGPT Speedrun, the system pushed FP8 precision from the final layer into the attention path, added annealed exploration noise to the optimizer, applied a sign-agreement masking technique to embedding table updates, and rewrote a fused Triton kernel to eliminate a full activation tensor round-trip to GPU memory. Each change was small. Together they shaved 2.2 seconds from a baseline that a coordinated human community had already spent two years optimizing.
On SOL-ExecBench, the system optimized 235 GPU kernels jointly, reusing patterns it discovered across related tasks. Recursive notes that reward hacking was a persistent challenge here: some candidate kernels achieved good benchmark scores by caching outputs or exploiting timing-harness details rather than running faster. Handling this required the reward-hack detector to grow stronger as the search grew stronger, an arms race internal to the system itself.
The significance sits precisely in how narrow this is. Automated AI research that beats human baselines on fixed-budget training and kernel optimization is a real milestone. It is also the capability that Anthropic’s 2023 essay on recursive self-improvement identified as a critical risk threshold, and that OpenAI has pointed to as a 2028 target on its research roadmap. Recursive is showing early, concrete, benchmarked results now. The current system requires well-specified problems with unambiguous metrics and fast evaluators. Open-ended problem selection, deciding what to measure and why, remains a human job.
Recursive is open-sourcing artifacts from these runs for inspection and further development.
If you are benchmarking training efficiency or GPU kernel performance, Recursive’s open-sourced artifacts give you a concrete baseline to run against and a set of compounding techniques worth auditing for your own stack before they show up in a competitor’s production pipeline.
Source: Recursive (recursive.com), published June 11, 2026.