Researchers at Google DeepMind and Seoul National University published LiteFrame, a compact vision encoder for Video Large Language Models that reduces total inference latency by up to 35% while processing 8x more video frames within a fixed compute budget.

The problem LiteFrame targets is a second-order bottleneck that prior work created. Teams building Video LLMs typically reduce visual tokens after feature extraction to lighten the LLM’s load. That works, but once you shrink the LLM side, the per-frame cost of the vision transformer (ViT) becomes the dominant latency source. LiteFrame attacks that original bottleneck directly.

The method trains a compact student encoder (87 million parameters, down from the 304 million in the teacher model) using Compressed Token Distillation. The student learns to predict spatio-temporally compressed representations from the larger teacher, effectively skipping redundant computation. A follow-on Language Model Adaptation stage aligns the compressed latent space so the downstream LLM can handle up to 512 frames without additional high-resolution training.

On standard video benchmarks including Video-MME, MLVU, and LongVideoBench, LiteFrame establishes a new performance-latency Pareto frontier. The release announcement does not include independent third-party benchmark replication.

Teams running Gemini Vision or Qwen-VL pipelines over long-form clips at scale should treat the 35% latency figure as a deployment-cost benchmark worth testing against their current encoder configuration.

Posted on the LiteFrame project page on 2026-05-20.