The last human job in the AI engineering loop

Langfuse's Lotte Verheyden argues every step of the AI engineering loop can now be automated except one: deciding what good output actually looks like.

Alessandro Benigni

PUBLISHED JUN 11, 2026

4 MIN READ

Follow on Google

YESTERDAY

The last human job in the AI engineering loop — featured image for AI Insiders

The automation argument that tooling vendors have been circling finally has a clean statement. Writing on Langfuse’s blog on June 9, Lotte Verheyden lays out the case that the AI engineering loop is now technically self-contained: an agent can write prompts, run evaluations, analyze failures, propose changes, and apply them. No human required for any individual step.

That framing should make builders uncomfortable, because what Verheyden is really describing is the collapse of “workflow is the moat.”

For the past eighteen months, the practical defense against model commoditization has been process. Your pipeline, your retrieval design, your prompt architecture. The underlying models improve, but your specific workflow produces value that a vanilla API call cannot replicate. That was a reasonable position when workflows took months to build. Agents capable of building workflows in hours hollow it out.

Verheyden’s actual argument, though, is not about the workflow. It is about what sits behind the workflow: the eval. Specifically, who gets to define what counts as a good output. Her claim is that this is the step that agents cannot substitute for, not because of any technical barrier, but because the definition of quality is tacit knowledge the developer holds and has not written down. An agent optimizing against an imperfect eval will find the shortest path to a high score. When the eval misses nuance, the system improves on paper while degrading in production.

She calls the failure mode “agent slop,” which is accurate and not meant kindly.

The piece reads as a vendor argument, and it is one. Langfuse builds observability and evaluation infrastructure; making the case that eval definition requires sustained human attention is good for Langfuse. Both things can be true simultaneously.

What makes it worth reading past the vendor framing is that it connects to a convergent signal from several directions this year. Anthropic’s 8x developer velocity finding was enabled by a redesign of how teams specify what they want their agents to do, not just what agents can do. The DX developer survey showing 8% productivity lift at the median captured teams that had not changed how they define task success. Perplexity’s 87% productivity figure came from a narrow domain where “correct answer” is well-specified by the task itself. Yoonho Lee’s text-layer argument holds that human-written prompts and examples are sample-efficient exactly because they encode nuance agents cannot infer from outcomes alone.

The convergent finding across those data points: the bottleneck in AI-assisted development has shifted from execution capability to measurement quality. Models can do more than most teams are measuring. The constraint is now the sharpness of the specification for what good looks like.

Verheyden’s framing of this as “eval definition is the last human-only step” is useful precisely because it is concrete. It gives teams a place to point: if the workflow can be automated, the thing you are protecting is not the workflow. It is the judgment that defines the workflow’s success criteria.

That has operational consequences. Teams that have invested heavily in prompt engineering as a defensible skill should be reorienting toward evaluation design. The prompt is increasingly the agent’s job. The eval spec, including the failure modes it explicitly covers, is not something you can delegate until you have written it down clearly enough that an agent could apply it. Writing it that clearly is the work.

Verheyden predicts every analytics startup in the eval-and-observability category, naming Braintrust, Helicone, and Langfuse itself among others, is making a generational transition from observability products to continual-learning platforms in 2026. That architectural shift, where every production trace becomes a potential training signal, raises the stakes on eval quality. A bad eval in a static system produces bad outputs. A bad eval in a continual-learning system produces a model that has been trained to produce bad outputs more consistently.

Teams building on any continual-learning infrastructure should treat eval design as the highest-leverage technical decision they own. Getting the measurement right is not a product management task deferred to later. It is the one specification that cannot be handed to the agent.

Lotte Verheyden via Langfuse (langfuse.com), published June 9, 2026.

The last human job in the AI engineering loop

The morning brief for people inside the AI industry.

More in Opinion

CoreWeave says compute isn't a commodity. He's right, and he's selling.

Flat-fee AI plans lose money on power users, and agents make it worse

The laptop model problem that should worry every AI vendor