A new repo tracks the shift to native multimodal models

A new GitHub list tracks research moving away from modular vision-language stitching toward joint transformer spaces that handle multiple modalities natively.

Alessandro Benigni

PUBLISHED MAY 28, 2026

1 MIN READ

Follow on Google

2 DAYS AGO

A new repo tracks the shift to native multimodal models — featured image for AI Insiders

A new GitHub repository, NMM-Roadmap, catalogs the research progression from modular multimodal assembly (separate vision encoders bolted onto language models) toward native multimodal modeling, where multiple modalities share a unified transformer space or joint backbone from training inception.

The architectural distinction matters. Most multimodal models in production today (GPT-4o, Claude with vision, Gemini’s vision capabilities) use a modular design: a vision encoder converts images to a token-like representation, which the language model consumes as if it were text. Native multimodal models, by contrast, train the entire architecture jointly across modalities, with no separate “vision adapter” step. ByteDance’s Lance model (covered May 22) is one example. The repo tracks the broader research direction across labs.

For research teams evaluating which architectural family to build on next, the repo is the most consolidated public reference. Star and follow it if multimodal is on your roadmap.

Maintained by NMM-Roadmap on GitHub, updated through 2026-05-26.

A new repo tracks the shift to native multimodal models

The morning brief for people inside the AI industry.

More in Models

Anthropic releases Opus 4.8 with effort controls and cheaper fast mode

Microsoft is reportedly building its own AI coding model

MiniMax teases M3 with sparse attention that runs 15.6x faster at long context