Opus 4.8 triples GPT-5.5's score on ARC-AGI-3

A public benchmark thread puts Anthropic's flagship in the 60s and OpenAI's in the 20s on the hardest abstract-reasoning suite yet.

Alessandro Benigni

PUBLISHED JUN 3, 2026

1 MIN READ

Follow on Google

-566 MIN AGO

Anthropic’s Opus 4.8 reportedly tripled GPT-5.5’s score on ARC-AGI-3, the latest and hardest iteration of Francois Chollet’s abstract-reasoning benchmark suite, according to an X thread posted June 1 by the independent benchmark-watcher scaling01. The thread puts Opus 4.8 in roughly the 60th percentile range and GPT-5.5 in the low 20s.

The caveat matters: this is a single X thread, not an official ARC-Prize leaderboard release. Treat the specific numbers as directional until the ARC-Prize team publishes verified results.

If the gap holds, it extends a pattern already visible in BrowseComp and coding benchmarks: Anthropic’s reasoning edge widening against OpenAI’s current flagship. Teams evaluating Opus 4.8 versus GPT-5.5 for abstract reasoning tasks should not wait for official confirmation before running their own evals.

Reported by scaling01 in an X thread dated June 1, 2026, mirrored at threadreaderapp.com.

Opus 4.8 triples GPT-5.5's score on ARC-AGI-3

The morning brief for people inside the AI industry.

More in Wire

Cursor adds Premium seat and spending controls to Teams plan

Mistral ships Search Toolkit to unify RAG ingestion, retrieval, and eval

OpenAI Cookbook shows how to run its models on Amazon Bedrock