The workflow that fixes itself is the deepest moat

Agent observability has been stuck at trace-level for a year. The next shift is harnesses that diagnose, patch, and verify themselves.

Alessandro Benigni

PUBLISHED JUN 10, 2026

3 MIN READ

Follow on Google

4 DAYS AGO

The workflow that fixes itself is the deepest moat — featured image for AI Insiders

Production agent observability has a ceiling, and most teams have already hit it. You can see every tool call, every token, every latency spike. What you cannot see is the repair path, because that part is still entirely manual: an engineer walks the trace, forms a theory, writes a patch, re-runs the suite, and hopes nothing regressed downstream.

Avi Chawla argued in an essay published June 8 at Daily Dose of DS that this ceiling is not a tooling gap. It is an architectural choice that has outlived its usefulness. The next shift is what he calls agentic harness engineering: systems whose observability layer is rich enough to drive automated self-repair.

The argument rests on three observability primitives. The first is component observability: every prompt template, tool definition, and memory write gets file-level representation that can be diffed, reviewed, and attributed to a specific outcome. The second is experience observability: trajectory tokens are distilled into an evidence corpus the harness can query when planning fixes, so past failures are not discarded but encoded. The third is decision observability: every edit to the harness ships with a prediction about what should improve, verified against actual outcomes after deployment.

When all three are present, the system can do something qualitatively different from today’s dashboards. It can diagnose a failure mode, propose a patch, apply that patch inside a sandboxed environment, confirm the target metric improved, confirm nothing regressed, and lock the scenario as a permanent regression test. The human engineer becomes the approver, not the detective.

What makes this moment distinct is that the substrate for it is now shipping. LangSmith’s sandboxes release, covered here last week, gives harnesses a safe execution context to test patches without touching production. Braintrust’s Topics feature, also covered recently, surfaces the pattern-level analysis that experience observability requires: grouping failures by theme rather than logging them individually. GitHub’s agent principals work adds identity semantics so harnesses can act on their own behalf with auditable authority. The architectural argument Chawla is making is not speculative. The primitives are arriving from several directions at once.

The deeper business implication connects to a point worth stating plainly. For the past eighteen months, the working assumption was that the model was the bottleneck and that swapping in a better model would fix most production failures. That assumption is eroding. Frontier models are now close enough in capability that workflow design is increasingly the differentiator. The team with a better harness beats the team with a marginally better model. And the team whose harness repairs itself beats the team that depends on an engineer to do that repair on the next sprint.

Self-repairing workflows are not a product feature yet. They are an architectural posture, and the gap between the description and a production implementation is real. Chawla’s essay is the argument for why the architecture matters, not a deployment guide. Building the component observability layer alone requires discipline most teams have skipped: externalizing prompts to versioned files, tagging tool definitions explicitly, making memory writes auditable. Most production harnesses were not built with any of that.

But the teams doing this work now are not solving a today problem. They are building the moat for the next two years. When the model-swap argument stops moving the needle, the question becomes who built a workflow that compounds. The harness that updates itself compounds every time it runs.

If you are evaluating agent infrastructure right now, the evaluation criteria need to expand beyond latency and cost-per-token. Ask whether the framework you are adopting makes component observability possible. If the answer is that traces are the only output, you are building on a layer that will require human maintenance at exactly the moment agents are supposed to reduce it.

Avi Chawla via Daily Dose of DS (blog.dailydoseofds.com), published 2026-06-08.

The workflow that fixes itself is the deepest moat

The morning brief for people inside the AI industry.

More in Opinion

CoreWeave says compute isn't a commodity. He's right, and he's selling.

Flat-fee AI plans lose money on power users, and agents make it worse

The laptop model problem that should worry every AI vendor