Autoresearch loops work only when the metric is airtight

A two-week Claude experiment in file compression shows agentic loops reward narrow, measurable goals, not vague ambition.

Alessandro Benigni

PUBLISHED JUL 3, 2026

3 MIN READ

Follow on Google

1 HR AGO

Autoresearch loops work only when the metric is airtight — featured image for AI Insiders

Elliot C Smith spent two weeks running Claude Code in an unsupervised loop against a single, boring problem: file compression. The exercise, styled after Andrej Karpathy’s “autoresearch” concept, was built to test a specific claim rather than repeat it. Smith wanted to know whether loop-style agentic work, the kind some builders describe as letting one person operate like a small team, actually holds up, or whether it only works under conditions few people bother to name.

The setup was deliberately narrow. Smith scaffolded a Rust project with a stub compress and decompress function, wrote round-trip tests, and built a benchmark script pulling public domain audio, video, and text samples. Two constraints bounded the problem: the decompressed output had to match the original bit for bit, and neither compression nor decompression could exceed 300 seconds. For ten iterations spread across roughly two weeks, Smith cleared Claude’s context between runs, gave the same short prompt asking it to inspect the codebase and push for one more improvement, reviewed the resulting plan, then let the agent run without further intervention.

The first iteration produced a custom LZSS implementation. The next nine layered on additional entropy checks and encoding techniques. According to Smith’s writeup on elliotcsmith.com, each loop cost roughly $4 in Claude usage, and the resulting algorithm ended up competitive with tools already installed on Smith’s machine, notably strong on audio and video samples and roughly on par elsewhere.

The headline result matters less than what Smith says made it possible. The compression task had a fast, cheap, unambiguous feedback loop: compress a file, decompress it, check the byte match, measure the size. Smith’s central claim is that this style of agentic work pays off specifically when a problem offers a metric that is measurable, hard to game, and tightly bounded by constraints, and that finding a problem with those three properties is the actual bottleneck, not the agent’s raw capability.

That claim cuts against a familiar pitch this year: that autonomous loops let a single operator replace the output of an entire team on almost any goal. Smith’s experiment argues the opposite. The agent did not generalize ambition into results. It exploited a narrow, well-posed target that a human had already done the hard work of specifying, tuning, and bounding before Claude ever touched the code.

Two structural findings back that up. Smith observed the model consistently wanted to be finished, producing exactly one hypothesis per iteration and calling the loop complete rather than continuing to search, an argument for building an explicit looping mechanism into any production version of this workflow. Smith also cites a thread from Mitchell Hashimoto describing an agent that optimized a renderer’s frame times from roughly 88 milliseconds down to 2 while cutting allocations sharply, a result that looked like a clean win until it exposed how narrow the underlying objective function actually was.

Most business metrics do not behave like file compression. A checkout conversion rate takes days to measure reliably and carries noise a byte count never has, which pushes teams toward proxy metrics such as page load speed or click count that only loosely track the real goal. Smith’s compression experiment is a favorable case precisely because its feedback loop is instant and unambiguous, a property most production metrics simply do not share.

Teams weighing an unsupervised optimization loop against a real product metric should first check whether that metric behaves like a byte count or like a conversion funnel. If it is the latter, budget for proxy-metric drift and a human checkpoint before letting any loop run for two weeks unattended.

Reported by Elliot C Smith on June 30, 2026.

Autoresearch loops work only when the metric is airtight

The morning brief for people inside the AI industry.

More in Opinion

The GPU Monopoly Cracks as Custom AI Chips Start Shipping

Woodside Energy Puts Agentic AI to Work Starting Up LNG Plants

The app layer's moat problem has a name: product shape