Meta AI released WavFlow on GitHub on May 20, a flow-matching framework that generates synchronized audio from video and text inputs directly in raw waveform space, without routing through a latent audio codec. That design choice puts it on a different architectural path from most audio generation systems, including latent-based approaches that compress audio into a lower-dimensional space before generation.

The technical core uses waveform patchifying and amplitude lifting to make raw audio compatible with flow matching via direct x-prediction. The result: a model that accepts video, text, or both as inputs, and outputs synchronized waveforms at 16kHz or 44kHz.

Benchmark results on VGGSound (video-to-audio) and AudioCaps (text-to-audio) show WavFlow matching established latent-based methods on acoustic fidelity and synchronization. Meta AI has not released production-trained checkpoints, citing organizational policy. A foundation checkpoint trained on fully open-source data is in progress.

The practical constraint matters. Builders who want to run WavFlow today must train their own model from scratch using the published training guide. Teams evaluating raw-waveform generation against latent pipelines should factor that gap into timelines before committing to either architecture for a production audio feature.

Released by Meta AI on GitHub on 2026-05-20.