Thinking Machines, the AI research lab known for its Connectionism publications, has built a model designed around a premise most labs treat as a nuisance to route around: people interrupt, correct, and redirect while a task is still in progress, and a serious assistant should be built to expect that.
The stakes are structural, not cosmetic. Most frontier labs measure progress by autonomy, meaning how well a model can take a task, complete it unsupervised, and hand back a finished result. Thinking Machines is proposing a second axis entirely: how well a model collaborates with a human who never fully lets go.
Today’s voice assistants fake real-time interaction with what amounts to a relay team. A voice activity detector guesses when a person stopped talking. A speech-to-text model transcribes it. A language model composes a reply. A text-to-speech model reads it aloud. A dialog manager holds the whole chain together. Each piece is simpler than the model it surrounds, and that simplicity is where the ceiling sits.
Thinking Machines describes the core model underneath all of this as still fundamentally turn-based: it perceives a finished input, generates a finished output, and cannot process anything new mid-generation. The company argues this forces users to think like a model themselves: compose the full request, submit it, then wait, instead of course-correcting mid-stream the way two people talking in the same room naturally do.
Its answer is a model called TML-Interaction-Small, a 276-billion-parameter mixture-of-experts system with 12 billion active parameters, positioned as the smallest entry in a planned family of larger versions. Rather than bolting audio and video onto a text-first model, the lab built the architecture around continuous audio and video streams from the start, on the reasoning that live interaction is the harder constraint text does not have to satisfy.
The architecture rests on three choices. First, time is sliced into 200-millisecond micro-turns instead of discrete conversational turns, so the model continuously decides whether to speak, listen, or stay silent based on everything arriving across audio, video, and text at once. Second, audio and video skip heavy pretrained encoders like Whisper in favor of lightweight components trained from scratch. Third, a fast interaction model handles live exchange while a separate, slower background model manages deep reasoning, tool use, and browsing, sharing context so the two appear to the user as one continuous conversation.
That micro-turn design is what lets the system speak while listening, a prerequisite for live translation, or watch while speaking, which is how commentary on a live video feed becomes possible. Thinking Machines built its own benchmarks, including TimeSpeak and CueSpeak, to test these behaviors because existing voice benchmarks were not built to measure them, and reports that most existing models fail these tasks outright rather than performing them poorly.
The company’s own account of the limits deserves as much attention as the capability claims. Long sessions strain the architecture because continuous audio and video accumulate context quickly. The system requires a stable, low-latency connection to function at all. And model size is bounded by latency targets: Thinking Machines says its larger pretrained models are currently too slow to run in this interactive setting, which is why the first release is labeled Small rather than a flagship.
This is a company making a specific bet against the industry’s dominant framing, and the bet has not yet been tested outside benchmarks the company designed itself. A limited research preview is planned for the coming months, with a broader release later this year alongside a research grant program for interaction model work.
Teams building voice products on today’s harness architecture (detection, transcription, generation, synthesis stitched together) should treat this as a warning about where that approach tops out, not an immediate replacement to adopt.
Reported by ByteByteGo.