A hackathon submission published June 6 on the Hugging Face blog turns a research question into a playable demonstration: do small models from different labs actually behave differently when placed in identical agent roles, or do they just score differently on benchmarks?

Thousand Token Wood v2, built by Lester Leong for the Hugging Face build-small hackathon, assigns each character in a multi-agent finance drama to a distinct small model: gpt-oss-20b from OpenAI, Nemotron-Mini-4B from NVIDIA, MiniCPM3-4B from OpenBMB, and a fine-tuned Qwen 0.5B. Each model was trained on different data with different post-training, and the divergence shows up in how agents speculate, hoard, and form alliances during play.

The engineering finding is more portable than the game itself. Serving four heterogeneous models on one platform exposed friction at the vLLM serving layer rather than at the model layer, and a shared JSON parse-and-repair layer absorbed the tokenizer differences without a refactor.

Teams building multi-model agent pipelines can treat this open-source project as a concrete reference for heterogeneous-council architecture before committing to a single-model design.

Hugging Face build-small hackathon (huggingface.co/blog), 2026-06-06.