Four labs, one sim: small models behave differently in the wild

A Hugging Face hackathon entry put OpenAI, NVIDIA, OpenBMB, and Qwen small models in the same agent scenario and watched them diverge.

Alessandro Benigni

PUBLISHED JUN 9, 2026

1 MIN READ

Follow on Google

-1143 MIN AGO

A hackathon submission published June 6 on the Hugging Face blog turns a research question into a playable demonstration: do small models from different labs actually behave differently when placed in identical agent roles, or do they just score differently on benchmarks?

Thousand Token Wood v2, built by Lester Leong for the Hugging Face build-small hackathon, assigns each character in a multi-agent finance drama to a distinct small model: gpt-oss-20b from OpenAI, Nemotron-Mini-4B from NVIDIA, MiniCPM3-4B from OpenBMB, and a fine-tuned Qwen 0.5B. Each model was trained on different data with different post-training, and the divergence shows up in how agents speculate, hoard, and form alliances during play.

The engineering finding is more portable than the game itself. Serving four heterogeneous models on one platform exposed friction at the vLLM serving layer rather than at the model layer, and a shared JSON parse-and-repair layer absorbed the tokenizer differences without a refactor.

Teams building multi-model agent pipelines can treat this open-source project as a concrete reference for heterogeneous-council architecture before committing to a single-model design.

Hugging Face build-small hackathon (huggingface.co/blog), 2026-06-06.

Four labs, one sim: small models behave differently in the wild

The morning brief for people inside the AI industry.

More in Wire

Cursor updates Design Mode to replace chat with point-and-edit

Kernel work is the fastest path into a frontier AI lab

OpenAI plans ChatGPT overhaul to push task-first interface