How OpenAI keeps voice AI fast for 900 million weekly users

OpenAI split WebRTC into a stateless relay and a stateful transceiver, using a protocol field as a routing key to avoid database lookups on Kubernetes.

Alessandro Benigni

PUBLISHED JUL 2, 2026

4 MIN READ

Follow on Google

1 HR AGO

How OpenAI keeps voice AI fast for 900 million weekly users — featured image for AI Insiders

OpenAI runs real-time voice for 900 million weekly users on WebRTC, the same protocol built for video calls, and had to redesign how it deploys that protocol to survive Kubernetes. ByteByteGo detailed the architecture in a technical breakdown of the system, based on publicly shared details from OpenAI’s engineering team. The core problem was a mismatch between the protocol’s assumptions and the infrastructure running it.

WebRTC was built for servers with stable IPs and fixed ports. Kubernetes treats compute as disposable: pods get scheduled, rescheduled, and replaced constantly. That mismatch created two failures at OpenAI’s scale, according to ByteByteGo. Deploying one UDP port per voice session meant reserving tens of thousands of public ports, which strained load balancers designed for a handful of well-known ports and complicated firewall policy and security audits. Session state also proved sticky: ICE and DTLS, the protocols that handle connection setup and encryption, require every packet in a session to reach the same process that started it, or the session breaks.

The standard fix at this scale is a Selective Forwarding Unit (SFU), the media server architecture used for group calls and classrooms. OpenAI evaluated it and passed. An SFU treats every participant, including the AI, as a peer in a multiparty call. OpenAI’s traffic is overwhelmingly one person talking to one model, so the SFU model would have added overhead without solving the actual problem.

OpenAI’s answer splits the stack into two components. A stateless relay sits at the network edge and handles nothing but packet routing. A stateful transceiver, positioned behind it, owns the ICE connectivity checks, the DTLS handshake, SRTP encryption keys, and the full session lifecycle. Signaling goes directly to the transceiver; media enters through the relay first.

The hard part of that split is routing the first packet of a new session, before any mapping between client and transceiver exists. A database lookup on that hot path would add latency and a new dependency. OpenAI instead reuses a field the protocol already carries: the ICE username fragment, or ufrag, generated during signaling and echoed back in the client’s first STUN binding request. Because OpenAI generates the server-side ufrag itself, it can encode routing metadata directly into that field. The relay reads the ufrag off the first packet, decodes the destination, and forwards accordingly. Every subsequent packet uses an in-memory mapping, skipping the ufrag parsing step entirely. A Redis cache backs up that mapping so a restarted relay can recover it instantly instead of waiting for a fresh STUN packet.

This structure enabled Global Relay, OpenAI’s fleet of geographically distributed ingress points running identical forwarding logic. Cloudflare handles proximity steering on the signaling side, routing each session’s setup request to a nearby transceiver cluster, which then determines which relay location gets advertised to the client. Both signaling and media enter the network close to the user while the session stays anchored to one transceiver for its duration, shortening the setup round-trip and the first connectivity check.

The relay itself runs as an ordinary userspace Go process rather than a kernel-bypass system that would poll the network card directly. ByteByteGo reports that OpenAI evaluated kernel bypass and decided the added operational complexity was not worth the throughput gain at its traffic level. Three techniques carry the performance load instead: SO_REUSEPORT, which lets multiple worker processes share one UDP port so the kernel distributes packets across them; thread pinning via Go’s LockOSThread, which keeps a given flow’s packets on the same CPU core for better cache locality; and pre-allocated buffers that limit garbage collection pressure.

The design has real limits. It assumes one-to-one sessions throughout, so adding group voice calls or human handoff would require rework of both the transceiver model and the decision to skip an SFU. The relay’s statelessness is also only partial: it holds an in-memory flow table, recoverable via Redis, rather than holding nothing at all. And the ufrag routing trick depends on OpenAI controlling both ends of signaling, a condition that would not hold for a team using an off-the-shelf signaling stack.

Teams operating real-time audio or video infrastructure on Kubernetes at meaningful scale should treat OpenAI’s approach as a template for one specific problem: encoding routing state inside a protocol field a client already sends, rather than adding a lookup service to the hot path. That pattern generalizes well beyond voice AI, and it is worth testing against any service where the first packet of a session currently forces a database round-trip.

Reported by ByteByteGo, based on publicly shared details from OpenAI’s engineering team.

How OpenAI keeps voice AI fast for 900 million weekly users

The morning brief for people inside the AI industry.

More in Tools

PorTAL lets teams stop re-tuning every time a new model ships

Base44 builds its own model to escape frontier-model dependence

RadixArk open-sources Miles, a PyTorch RL stack for frontier LLMs