A new GitHub repository, NMM-Roadmap, catalogs the research progression from modular multimodal assembly (separate vision encoders bolted onto language models) toward native multimodal modeling, where multiple modalities share a unified transformer space or joint backbone from training inception.
The architectural distinction matters. Most multimodal models in production today (GPT-4o, Claude with vision, Gemini’s vision capabilities) use a modular design: a vision encoder converts images to a token-like representation, which the language model consumes as if it were text. Native multimodal models, by contrast, train the entire architecture jointly across modalities, with no separate “vision adapter” step. ByteDance’s Lance model (covered May 22) is one example. The repo tracks the broader research direction across labs.
For research teams evaluating which architectural family to build on next, the repo is the most consolidated public reference. Star and follow it if multimodal is on your roadmap.
Maintained by NMM-Roadmap on GitHub, updated through 2026-05-26.