ByteDance released Lance, a 3-billion-parameter multimodal model that handles image understanding, image generation, image editing, video understanding, video generation, and video editing inside a single architecture. The model was published on GitHub on May 20 alongside weights on Hugging Face.

The unified design is the headline claim. Most multimodal stacks today remain stitched together: Google’s Gemini separates understanding and generation pipelines, while Meta’s MovieGen handles video generation but not the full understanding-to-editing loop. Lance folds all six tasks into one transformer backbone trained from scratch, with only the ViT and VAE encoders borrowed from prior work.

The from-scratch training was completed on 128 A100 GPUs, a budget-constrained run by frontier standards. ByteDance used a staged multi-task recipe to prevent task interference across the six modalities, a known failure mode when jointly training generation and understanding objectives.

The 3B active-parameter count is the practical signal for builders. A model that runs image and video tasks in one inference call at this scale is deployable on a 40GB GPU, removing the infrastructure overhead of running separate specialist models per task. Teams building content pipelines that touch both image and video workflows should benchmark Lance before adding another model to their stack.

Released by ByteDance on GitHub on 2026-05-20.