LiteFrame cuts video LLM inference latency 35% by shrinking the vision encoder

A Google DeepMind and Seoul National University paper targets the per-frame ViT bottleneck that post-hoc token reduction leaves behind, enabling 8x more frames on the same compute.

Alessandro Benigni

PUBLISHED MAY 21, 2026

2 MIN READ

Follow on Google

MAY 21, 2026

LiteFrame cuts video LLM inference latency 35% by shrinking the vision encoder — featured image for AI Insiders

Researchers at Google DeepMind and Seoul National University published LiteFrame, a compact vision encoder for Video Large Language Models that reduces total inference latency by up to 35% while processing 8x more video frames within a fixed compute budget.

The problem LiteFrame targets is a second-order bottleneck that prior work created. Teams building Video LLMs typically reduce visual tokens after feature extraction to lighten the LLM’s load. That works, but once you shrink the LLM side, the per-frame cost of the vision transformer (ViT) becomes the dominant latency source. LiteFrame attacks that original bottleneck directly.

The method trains a compact student encoder (87 million parameters, down from the 304 million in the teacher model) using Compressed Token Distillation. The student learns to predict spatio-temporally compressed representations from the larger teacher, effectively skipping redundant computation. A follow-on Language Model Adaptation stage aligns the compressed latent space so the downstream LLM can handle up to 512 frames without additional high-resolution training.

On standard video benchmarks including Video-MME, MLVU, and LongVideoBench, LiteFrame establishes a new performance-latency Pareto frontier. The release announcement does not include independent third-party benchmark replication.

Teams running Gemini Vision or Qwen-VL pipelines over long-form clips at scale should treat the 35% latency figure as a deployment-cost benchmark worth testing against their current encoder configuration.

Posted on the LiteFrame project page on 2026-05-20.

LiteFrame cuts video LLM inference latency 35% by shrinking the vision encoder

The morning brief for people inside the AI industry.

More in Models

Anthropic Finds a Workspace for Deliberate Thought in Claude

Broadcom Locks In Apple Silicon Deal Through 2031

Tencent Ships Hy3, a 295B Open Model, Free Through July 21