A 0.22B inpainting model that matches an 11.9B generalist

Moebius, from HUST Vision Lab, matches FLUX.1-Fill-Dev on inpainting benchmarks at under 2% of its parameter count and more than 15x faster total inference.

Alessandro Benigni

PUBLISHED JUN 24, 2026

3 MIN READ

Follow on Google

-1172 MIN AGO

A 0.22B inpainting model that matches an 11.9B generalist — featured image for AI Insiders

Researchers at Huazhong University of Science and Technology and VIVO AI Lab have published Moebius, a 0.22-billion-parameter image inpainting model that matches or surpasses the generation quality of FLUX.1-Fill-Dev (11.9B parameters) across six benchmarks, while running at over 15 times the inference speed.

The result, described on the Moebius project page by lead author Kangsheng Duan and collaborators, is the product of two combined techniques. The first is a custom architectural block called the Local-lambda Mix Interaction (LMI) block, which replaces standard attention mechanisms by compressing spatial context and semantic information into fixed-size linear matrices. This sidesteps the quadratic cost of full self-attention without discarding the representational depth that inpainting requires. The second is an adaptive multi-granularity distillation strategy: the compact model learns from a heavier teacher model (PixelHacker, also from HUST) entirely within the latent space, avoiding the cost of decoding to pixels during training.

The efficiency numbers the team reports are notable. At 0.22B parameters, Moebius uses less than 2% of the weights of FLUX.1-Fill-Dev. On a single GPU, it reaches 26 milliseconds per denoising step. Combined with optimized sampling, the team claims greater than 15x total runtime acceleration versus a 10-billion-parameter-class model. Benchmark evaluations span natural scene inpainting (Places2) and portrait inpainting (CelebA-HQ, FFHQ), and the paper states Moebius is on par with or better than FLUX.1-Fill-Dev on all six, with particular gains on complex textures and facial plausibility.

What this signals for builders

The efficiency thesis here is worth separating from the benchmark claim. Running a 12B diffusion model for inpainting at scale carries real costs: GPU memory, inference latency, and per-request compute. A model that fits in a fraction of that footprint while holding quality parity on a focused task is a different product decision. Object removal and image inpainting are now standard features in photo and video tools, not frontier research problems. A model deployable on consumer-grade hardware or edge devices changes what is practical to ship without dedicated high-memory GPU capacity.

That framing is the Moebius team’s own: they describe the project as a “task-specific specialist over bloated generalists,” and they ask whether a model can become smaller and quicker once its job has been narrowly scoped. Inpainting is a narrow enough domain that task-specific compression with knowledge distillation can close most of the quality gap that parameter reduction would otherwise create.

What has not been independently verified

The project page presents benchmark results and visual comparisons but this is a preprint (arXiv

.19195), not a peer-reviewed publication. The benchmark numbers come from the research team’s own evaluation setup. Independent replication across diverse real-world images, particularly edge cases like fine hair strands, reflective surfaces, or heavily occluded backgrounds, has not been published by third parties. The distillation approach depends on PixelHacker as the teacher model; neither model has been widely stress-tested in production settings at volume. The 26ms-per-step figure is hardware-dependent and not yet independently benchmarked across inference stacks or quantized deployments.

Teams evaluating inpainting infrastructure for object removal features should treat these numbers as directional: the parameter efficiency claim and benchmark results are interesting enough to warrant running your own evaluation, but not yet independently confirmed at the level required to base an architectural decision on them.

If the benchmark results hold under independent testing, Moebius represents a concrete data point that task-specific distillation can close the quality gap between sub-billion and 10-billion-class generalist models on inpainting. Teams shipping object-removal or image cleanup features at scale should put this model on their evaluation list now rather than waiting for peer review.

Source: the Moebius project page by Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, and Xinggang Wang at Huazhong University of Science and Technology and VIVO AI Lab (hustvl.github.io/Moebius/, arXiv preprint 2606.19195, 2026).

A 0.22B inpainting model that matches an 11.9B generalist

The morning brief for people inside the AI industry.

More in Models

Alibaba's HappyHorse 1.1 passes Sora in video rankings

The transformer monoculture is over. Here is what replaced it.

Diffusion LLMs Are Not an Interpretability Dead End