The most useful thing a visual AI model can do in 2026 is not produce a beautiful image. It is produce a program that produces a beautiful image. That distinction is the central argument in a June 2 analysis published by a16z partner Yoko Li, and it is worth taking seriously because the same logic is appearing across at least three separate domains at once.

Li draws a clean line between what she calls pixel-native generation and code-native generation. Pixel-native systems, the diffusion models that dominated the last three years, output final rendered results. They are excellent at texture, mood, and realism. They are useless the moment a designer needs to change one element, because the artifact has no internal structure to change. Code-native systems output an SVG, an HTML/CSS layout, a Blender Python script, a Lottie JSON file, a USD scene graph. The visual output is still pixels in the end, but the source of truth is a structured representation that humans and agents can both edit.

The practical consequence is a different kind of iteration loop. In a pixel-native workflow, “improve the output” means regenerate it and hope. In a code-native workflow, it means change the CSS rule, re-render, inspect, change the path, re-render again. Each pass improves the underlying artifact rather than replacing it. Li frames this as the reason code-native visual generation sits on the direct path to benefiting from more test-time compute: the model is debugging a program in a closed, verifiable environment, not sampling from a distribution.

This argument is not limited to design tooling. Li gives 3D the most extended treatment, and the structural case there is stronger. A rendered image of a chair is not a chair; it is a picture of a chair. For an asset to work inside a game, a simulation, or a 3D editor, it needs consistent geometry, correct part hierarchy, and functional constraints. Doors must open. Wheels must spin. Articraft3D, a research project Li cites, frames articulated 3D generation as writing programs that define parts, geometry, joints, and tests. VIGA wraps Blender as a feedback environment, giving an agent semantic tools to query scene state, change camera angles, isolate objects, and map visual discrepancies back to source-level edits. The geometry becomes a program to be debugged, not an image to be regenerated.

The pattern extends past design and 3D. Games and simulations are the natural third instance. A game level defined as code can be modified by another agent or a human without re-rendering every asset. A training environment for robotics, defined as a scene graph, can be perturbed systematically to generate variation. This connects directly to what Nvidia shipped this week with Cosmos 3: an open foundation model whose output is not a rendered video but a structured action plan a robot policy can execute. The output is the program, not the final motion.

The unifying abstraction across visual AI, search, and robotics in 2026 is the same: generate the representation that produces the artifact, not the artifact itself. Perplexity made a version of this argument about search last week, framing queries as programs over a knowledge graph rather than lookups against an index. Li is making it about pixels. Nvidia is making it about physical actions. Each domain is arriving at the same conclusion from a different direction.

Li notes the market is beginning to organize around runtimes: browser, SVG renderer, Lottie player, Blender, game engine, simulator. Each runtime creates its own wedge because each has its own source representation and feedback loop. The companies that win, her argument implies, will own the full cycle from generation through rendering to inspection and revision.

The open question she does not fully resolve is which symbolic representations will become standards. SVG and HTML/CSS have decades of tooling behind them. USD is gaining ground in 3D but is not yet universal in games. Lottie is established for motion but narrowly scoped. The model that generates Blender scripts today has to navigate a specific API; the one generating game-engine scenes faces a fragmented landscape of Unreal, Unity, and Godot dialects. That fragmentation is the near-term friction point for anyone building in this space.

Teams building design or 3D tooling on top of pixel-native diffusion pipelines should treat this analysis as a structural warning: the iteration loop they are offering is the weaker one, and the gap will widen as code-native models improve.

Source: a16z (a16z.com), published June 2, 2026.