Weibo's 3B model matches flagships on LeetCode, falls short on science

VibeThinker-3B scores 96.1% on LeetCode first-attempt and 94.3 on AIME 2026, but posts only 70.2 on GPQA-Diamond, exposing the gap between verifiable-task optimization and general capability.

Alessandro Benigni

PUBLISHED JUN 18, 2026

3 MIN READ

Follow on Google

9 HR AGO

Weibo's 3B model matches flagships on LeetCode, falls short on science — featured image for AI Insiders

A 3-billion-parameter model from Weibo’s AI lab beat GPT-5.2, Kimi K2.5, and Claude Opus 4.6 on a LeetCode contest evaluation, solving 123 of 128 problems on the first attempt for a 96.1 percent rate. That single number is doing a lot of work in the benchmark debate, and the rest of the scorecard explains why.

VibeThinker-3B is built on Qwen2.5-Coder-3B and post-trained through what Weibo calls a Spectrum-to-Signal pipeline: curriculum-based supervised fine-tuning, multi-domain reinforcement learning, offline self-distillation, and instruction-oriented reinforcement learning. The training recipe is published alongside the model. On AIME 2026, the model scored 94.3, rising to 97.1 with claim-level test-time scaling. On LiveCodeBench v6 it posted a Pass@1 of 80.2. All three are verifiable-reasoning benchmarks, where correct answers can be checked by running code or matching against a known answer set.

The GPQA-Diamond score tells a different story. On that graduate-level science benchmark, VibeThinker-3B scored 70.2. Gemini 3 Pro sits at 91.9 on the same benchmark. Claude Opus 4.5 scores 87.0. The 17 to 21 point gap is not noise. GPQA-Diamond tests the kind of broad scientific reasoning that cannot be drilled through competition-math curricula or LeetCode repetition; it requires the factual density that tends to scale with parameter count and pretraining breadth.

The split result puts a clean frame on the efficiency-versus-generality debate. Verifiable benchmarks are, by design, optimizable. A model trained heavily on competition math and code problems, with reinforcement learning shaped by pass-fail signals, can climb those leaderboards disproportionately to its size. That is not a flaw in VibeThinker-3B. It is exactly what the training recipe was designed to do. The question is whether the community treats those scores as a proxy for overall capability or as evidence of a specific, deployable strength.

VentureBeat reported on the release on June 17, 2026, noting the benchmark argument the numbers reignited. That argument has a familiar structure: a small model posts headline-grabbing numbers on a narrow benchmark, skeptics point to a different benchmark where the gap reappears, and both camps are correct. The narrow benchmark reflects a real capability. The broad benchmark reflects a real gap. Neither cancels the other.

For builders, the practical signal is more concrete than the philosophical debate suggests. A 3B model that solves 96 percent of LeetCode problems on the first pass and scores 94 on AIME can run on edge hardware, cost a fraction of a frontier-model API call, and outperform much larger models on the specific tasks it was trained to handle. The tradeoff is that it will not generalize across science, medicine, or open-domain reasoning at the same level. That tradeoff is a product decision, not a research failure.

The deeper pattern worth watching is what this confirms about the benchmark landscape itself. When a model can reach near-frontier scores on verifiable tasks at 3B parameters, the separating signal shifts to the benchmarks that resist narrow optimization. GPQA-Diamond, open-ended reasoning evaluations, and multi-step tasks with ambiguous grounding are where capability differences between a fine-tuned small model and a large pretrained one remain durable.

Teams evaluating small reasoning models for coding pipelines should benchmark VibeThinker-3B against their actual task distribution before defaulting to a larger model; the LeetCode and AIME numbers are reproducible and the cost differential is substantial.

Reported by VentureBeat on June 17, 2026, with benchmark figures corroborated by the model’s arXiv paper (2606.16140).

Weibo's 3B model matches flagships on LeetCode, falls short on science

The morning brief for people inside the AI industry.

More in Models

Microsoft Extends Phi Silica to NVIDIA GPUs, Testing NPU Limits

OpenAI's GPT-Bidi-1 aims to fix voice mode's turn-taking problem

Qwen-RobotWorld makes plain language the control layer for robots