Anthropic’s Opus 4.8 reportedly tripled GPT-5.5’s score on ARC-AGI-3, the latest and hardest iteration of Francois Chollet’s abstract-reasoning benchmark suite, according to an X thread posted June 1 by the independent benchmark-watcher scaling01. The thread puts Opus 4.8 in roughly the 60th percentile range and GPT-5.5 in the low 20s.

The caveat matters: this is a single X thread, not an official ARC-Prize leaderboard release. Treat the specific numbers as directional until the ARC-Prize team publishes verified results.

If the gap holds, it extends a pattern already visible in BrowseComp and coding benchmarks: Anthropic’s reasoning edge widening against OpenAI’s current flagship. Teams evaluating Opus 4.8 versus GPT-5.5 for abstract reasoning tasks should not wait for official confirmation before running their own evals.

Reported by scaling01 in an X thread dated June 1, 2026, mirrored at threadreaderapp.com.