Thinking Machines Lab tested Gemini, Claude, and GPT variants on six routine financial filtering tasks and found they averaged roughly 50 percent accuracy with a simple prompt. That is barely better than a coin flip on judgments a working investor makes correctly without thinking twice.

The tasks were not exotic. They involved sorting news articles by relevance to a C-suite investor, flagging central bank documents that signal rate moves, and identifying where boilerplate ends in a research report. Investors handle this kind of triage constantly, filtering the noise so they can spend their attention on higher-level decisions. Frontier models, according to Thinking Machines, could not reliably replicate that judgment even with carefully engineered prompts.

Better prompting closed some of the gap. Thinking Machines had its own experts rewrite task instructions and reframe the classification scheme, splitting news into three buckets (relevant and interesting, relevant but uninteresting, irrelevant) instead of two. That pushed accuracy from the mid-40s into the mid-70s. Even the best prompts topped out below 80 percent, the threshold Thinking Machines says investors need before they will trust a system inside their daily workflow.

The lab also found that newer frontier releases are not closing the gap efficiently. GPT 5.4 costs 43 percent more than GPT 5.2 in Thinking Machines’ testing but delivered only a marginal accuracy improvement on these tasks. Scaling the frontier model, in other words, is not the same as scaling judgment.

Thinking Machines then trained its own model on Qwen3-235B using its Tinker fine-tuning infrastructure, starting with GRPO reinforcement learning and layering in three refinements: interleaved batch training across tasks, a CISPO loss function with asymmetric clipping, and on-policy distillation that promotes the training checkpoint to teacher status only when it clears a new validation-accuracy high. Each addition compounded on the others, taking the base Qwen model from 44.8 percent accuracy to a final 84.66 percent.

That result matters more for what it implies about data than about architecture. Thinking Machines says its first attempt at a training set, sourced from non-expert labelers, produced a model that still performed poorly. The fix was a verification loop: train on the cheap labels, find where the model’s predictions disagree with the labels, and route only those contested examples to expert investors for correction. That let Thinking Machines get expert-quality data without paying expert-labeling costs on every example.

The finished model beat the best frontier model Thinking Machines tested by 29.8 percent fewer mistakes, at 13.8 times lower inference cost per task, according to the company. Thinking Machines frames the result as evidence for a broader shift: instead of one frontier model serving every use case, organizations will increasingly run smaller models tuned to their own proprietary data and judgment calls, and those models will beat general-purpose frontier systems on the tasks that matter to them.

The claims come from Thinking Machines’ own benchmark, built on its own proprietary dataset and released only in partial form for public verification. The company has not published the full training set or disclosed which specific frontier model versions it tested beyond noting they were “variants” of Gemini, Claude, and GPT. That limits independent replication, though the underlying mechanism, expert-verified fine-tuning on a narrow task set outperforming general models, matches results other labs have reported in domains like legal and medical document review.

For any operator running high-volume, judgment-heavy workflows on frontier-model APIs, this is a cost and accuracy argument for building a narrow fine-tuned model instead of prompting a general one. The next ninety days are the window to pilot a Tinker-style fine-tuning loop on a single high-frequency internal task before locking in a 2026 frontier-API contract sized for tasks a smaller model could now handle more cheaply.

Reported by Thinking Machines on June 2026.