Alibaba's Qwen team built a model that simulates AI agents, not just runs them

Qwen-AgentWorld trains on 10 million interaction trajectories to create a language-based world model that can stand in for real environments during agent training.

Alessandro Benigni

PUBLISHED JUN 26, 2026

3 MIN READ

Follow on Google

-502 MIN AGO

Alibaba's Qwen team built a model that simulates AI agents, not just runs them — featured image for AI Insiders

Alibaba’s Qwen research team published a paper on June 23 introducing Qwen-AgentWorld, a family of models designed to simulate the environments that AI agents operate in, rather than act as agents themselves. The distinction is architecturally significant: instead of building a better agent, the team built a better training ground.

The core idea behind a language world model is that an agent can be trained and evaluated entirely inside a language-based simulation of an environment, with no live system required. The world model receives the agent’s action as input and predicts what the environment would return as the next state. At sufficient fidelity, this closes the loop: an agent can iterate on tool calls, browser interactions, or file operations inside a fast, controllable sandbox rather than a slow, unreliable real environment.

Two model variants are described in the arXiv paper: Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B. Both use a mixture-of-experts architecture (the A3B and A17B suffixes indicate active parameters). Training ran in three sequential stages. The first stage, CPT, injected general world-modeling knowledge from state-transition dynamics and domain corpora. The second stage, SFT, trained the model to predict the next environment state given an action. The third stage, RL with hybrid rubric-and-rule rewards, sharpened simulation fidelity. The full dataset spans more than 10 million real environment interaction trajectories across seven domains.

The scale of that trajectory corpus matters for a specific reason. World models for physical robotics have historically failed to generalize because the distribution of real interactions is far richer than any curated dataset. Language-domain environments, covering tasks like web browsing, code execution, and file manipulation, are more tractable to capture at scale. Ten million trajectories across seven domains is a meaningful attempt to cover that distribution rather than cherry-pick it.

The team also introduces AgentWorldBench, an evaluation framework built from real interactions of five frontier models across nine established benchmarks. The paper reports that Qwen-AgentWorld outperforms existing frontier models on this benchmark, but those results come from Alibaba’s own evaluation with no independent replication disclosed at the time of publication. That is standard for an initial arXiv release, not a disqualifying caveat, but teams planning to act on these results should treat the numbers as preliminary.

Two downstream applications are described. In the first, Qwen-AgentWorld functions as a decoupled simulator: agents are trained on thousands of synthetic environment episodes, then moved to real environments. The paper claims this yields gains that exceed training on real environments alone, which, if it holds, would be the key practical result. Synthetic pretraining at scale accelerating real-environment performance is the same pattern that made data augmentation valuable in computer vision. The open question is sim-to-real transfer fidelity, specifically whether edge cases in the world model produce agent behaviors that do not transfer cleanly to production systems.

In the second application, Qwen-AgentWorld is used as a warm-up stage before downstream agent fine-tuning. The world-model training signal appears to improve performance across seven agentic benchmarks without any direct task-specific training, which suggests the model is learning something general about environment structure rather than overfitting to specific task distributions.

Code has been released to the QwenLM GitHub repository. The paper does not describe deployment infrastructure or API access, so teams cannot immediately substitute this for live environments in production agent pipelines. The near-term value is in research replication and in the AgentWorldBench methodology, which offers a framework for evaluating world-model fidelity independently of any specific agent.

Teams building agentic systems in mid-2026 should watch this line of work closely: if the sim-to-real transfer claims replicate under independent evaluation, synthetic environment pretraining could cut the cost and latency of agent training substantially.

Described in an arXiv preprint from Alibaba’s Qwen team, submitted June 23, 2026 (arXiv

.24597).

Alibaba's Qwen team built a model that simulates AI agents, not just runs them

The morning brief for people inside the AI industry.

More in Agents

Google brings computer use into Gemini 3.5 Flash

Perplexity Targets Legal Ops with Computer for Counsel

Stably ships Orca, an open-source IDE built for fleets of coding agents