Asuka Zheng, posting on X on May 29, argues that the “we’re running out of training data” framing has captured the AI policy and investment conversation while missing the actual shape of the data market. The argument lands as a corrective worth taking seriously, especially given how much of the current frontier-training narrative leans on data scarcity as a constraint.

The personal example anchors the argument. Zheng describes an SRE-replacement project that trained two world models targeting site-reliability engineering tasks. The training stalled because end-to-end long-horizon incident trajectories, from first anomaly through full resolution, did not exist as a dataset. Production systems generate logs, alerts, and post-incident reviews, but the full trajectory linking an initial signal through every diagnostic step, every hypothesis revised, every action taken, to a resolved state was simply not captured anywhere.

The structural insight is that data scarcity is the wrong framing. The internet has been scraped repeatedly. Code, scientific papers, structured business data, and synthetic generation pipelines have all been extracted at scale. What is genuinely missing is the long-horizon trajectory data: the multi-step processes that real workflows execute over days or weeks, where the state evolves, the context updates, and the resolution depends on which earlier choices were made.

That category of data is hard to manufacture and largely absent from existing corpora. Engineering teams document outcomes, not paths. Researchers publish results, not process. Sales teams record closed deals, not the full sequence of touches that produced them. The training data that would let a model learn to operate as a long-horizon agent in those domains does not exist as a dataset because nobody has historically had reason to capture it that way.

The implication for frontier-lab strategy is direct. The investments in synthetic data pipelines, RLHF infrastructure, and human-in-the-loop labeling are addressing a different problem than the one Zheng identifies. Synthetic data generation produces more of what already exists. RLHF refines what the model can already partially do. Neither captures the missing trajectory category.

For startups building vertical AI for specific workflows, this is a useful framing. The defensible moat is the proprietary trajectory data, not the model weights. The healthcare AI startup that captures full diagnostic-to-treatment trajectories at scale, the legal AI startup that captures full matter-resolution arcs, the SRE startup that captures full incident-to-resolution paths: each of those is building a data asset that frontier models cannot easily replicate by scraping more of the open web.

The skeptical read: trajectory data is not the only thing missing. Many vertical AI projects also stall on quality of evaluation, integration into existing workflows, and the operational realities of having a model touch real production systems. Trajectory data is necessary but not sufficient.

Zheng’s framing is the most useful correction to the data-scarcity narrative published this month. For founders and operators evaluating which vertical AI bets are defensible over the next 18 months, the question to ask is not whether the technology can do the job. It is whether your specific workflow generates the kind of trajectory data that would let you train a model that no general-purpose lab can match.

For frontier labs reading the same critique, the strategic implication is that long-horizon training data acquisition becomes a structural priority over the next two years. Watch for partnership and acquisition activity targeting workflow categories where proprietary trajectory data is captured.

Posted by Asuka Zheng on X on 2026-05-29.