A 100x cheaper eval judge that matches Claude Opus on chatbot traces

LangChain and Fireworks fine-tuned Qwen-3.5-35B to detect user-perceived errors in agent traces, matching or beating frontier models at a fraction of the inference cost.

Alessandro Benigni

PUBLISHED JUN 17, 2026

3 MIN READ

Follow on Google

-513 MIN AGO

A 100x cheaper eval judge that matches Claude Opus on chatbot traces — featured image for AI Insiders

LangChain and Fireworks AI published a joint engineering post in June 2026 showing they built a production-ready LLM-as-judge (an automated evaluator that uses a language model to score another model’s outputs) capable of matching Claude Opus accuracy on traces (the full recorded logs of a multi-turn agent session) for under one percent of the cost.

The specific task is detecting “perceived error”: the signal that a user believed the assistant made a mistake, regardless of whether the assistant was objectively correct. Perceived error is distinct from factual accuracy. A user who rephrases the same question three times is expressing perceived error even if the original answer was technically right. LangChain argues this signal is both universal across applications and measurable without human labelers at inference time.

The base model selected was Qwen-3.5-35B, chosen after small-scale tests ruled out smaller models as insufficiently capable for multi-turn reasoning. Fine-tuning used managed supervised fine-tuning (SFT) on Fireworks infrastructure with LoRA. The training set came from 707 traces drawn from LangChain’s own chat-langchain dataset, a documentation Q&A agent for LangChain’s developer products.

The accuracy results are concrete. Base Qwen without fine-tuning scored 90.5% on the chat-langchain holdout set and 83.2% on a separate Fleet dataset. After fine-tuning on chat-langchain data only, accuracy rose to 96.1% on chat-langchain and 90.8% on Fleet, without any Fleet-specific training examples. Claude Opus scored 91.6% on chat-langchain and 90.2% on Fleet. The fine-tuned Qwen model outperformed Opus on both sets. GPT-5.5 scored higher on chat-langchain at 98.9% but fell below the fine-tuned Qwen on Fleet at 89.1%.

The cost claim is a range: 10x to 100x cheaper than frontier model inference, depending on trace volume and comparison model. At scale, the gap widens, because serving a hosted open model at high volume is structurally cheaper than paying per-token API rates to a closed frontier lab.

Three factors make this result meaningful for teams running production evals. First, the transfer result: a judge trained on one domain (developer docs Q&A) held its accuracy on a completely different domain (research and document writing via Fleet) without any retraining. That suggests perceived error is not highly domain-specific, which is the prerequisite for using it as a general-purpose production signal rather than a bespoke per-product classifier. Second, the model selection: Haiku, a common cost-optimized substitute for large-model evals, was consistently outperformed by base Qwen before any fine-tuning. Teams currently routing eval traffic to Haiku for cost reasons have a better option. Third, the data volume required was modest: 707 training examples produced a model that crossed frontier accuracy. Most teams running production agents have more labeled trace data than that sitting unused.

The structural skepticism here is worth naming. This is a vendor co-authored post, published by LangChain and Fireworks, the two companies whose products you would use to build and serve the judge. The benchmark datasets are LangChain’s own internal traces (chat-langchain and Fleet), not independently collected or third-party verified. Accuracy numbers compare favorably against Claude Opus and GPT-5.5, but the comparison set was chosen by the authors. The 24% and 18% perceived-error rates in the two datasets are also reported without external validation of the labeling methodology, which relied on model-assisted panel agreement rather than ground-truth human annotation at full scale.

The rollout is staged: LangChain plans to release the fine-tuned perceived error model to a select group of customers before a broader launch in one to two months.

Teams evaluating eval infrastructure in the next quarter should benchmark this approach against their current Haiku-class setup before committing to renewed frontier API spend for trace scoring.

Source: LangChain engineering blog, co-authored with Fireworks AI, published June 2026.

A 100x cheaper eval judge that matches Claude Opus on chatbot traces

The morning brief for people inside the AI industry.

More in Tools

A new document format wants to fix how enterprises feed files to AI

DFlash delivers 4.3x throughput gains on Qwen 3.5 serving

Facebook Turns Its Search Bar Into a Conversational AI Engine