OpenAI's new life-science benchmark hits a 36% ceiling

LifeSciBench tested frontier models on 750 real research tasks judged by PhD scientists, and the best system passed barely one in three.

Alessandro Benigni

PUBLISHED JUN 19, 2026

3 MIN READ

Follow on Google

-667 MIN AGO

OpenAI's new life-science benchmark hits a 36% ceiling — featured image for AI Insiders

The strongest AI model available today clears roughly one in three authentic life-science research tasks. That number comes from LifeSciBench, a benchmark OpenAI published on June 17 in collaboration with Tacit Labs, a startup focused on drug-development feedback systems.

The benchmark is built around 750 tasks authored by 173 scientists who have no institutional relationship with OpenAI. Those scientists hold PhD-level credentials and direct drug-discovery experience. Their involvement is structural: they wrote the tasks, supplied the grading rubrics, and judged the outputs. The design deliberately excludes OpenAI-internal authorship to sidestep the self-grading critique that has followed lab-published benchmarks.

Task structure matters here. Each entry pairs a natural-language prompt with supporting artifacts and a free-response format, not multiple choice. A scientist would recognize the format because it mirrors how a principal investigator briefs a junior colleague: here is the context, here are the files, tell me what you find. Grading is rubric-anchored and executed by the same population of practicing scientists who wrote the tasks.

The coverage spans seven biological domains: genomics, medicinal chemistry, clinical and translational science, and four others. Seven workflow categories run in parallel: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and research communication. The 7-by-7 grid was designed to catch models that excel at one quadrant while failing in others.

GPT-Rosalind posted the top score at 36.1 percent, followed by GPT-5.5, Grok 4.3, and Gemini 3.1 Pro in descending order. None cleared 40 percent. The release announcement does not include error bars or methodology for handling tasks where graders disagreed, which leaves some questions about score precision at the margin.

What 36 percent means in practice depends on which tasks a model passes. A system that clears 36 percent of evidence-handling tasks but fails 90 percent of experimental design tasks is a literature-review tool, not a research collaborator. OpenAI has not published per-workflow or per-domain breakdowns in the launch materials, though the benchmark’s architecture implies that data exists.

The low ceiling is worth naming plainly. Marketing language around AI in drug discovery has outpaced demonstrated capability for at least three years. LifeSciBench is an attempt to reground that conversation with tasks grounded in practitioner judgment rather than isolated trivia questions or curated chemistry datasets where models have had extensive training exposure.

The benchmark follows a period of intense lab claims around science AI. OpenAI’s own AlphaFold collaboration and various reasoning model announcements have fed expectations that AI is close to autonomous research. A ceiling of 36 percent across end-to-end workflows, judged by working scientists, is a colder read of where things actually stand.

For biotech and pharma teams currently evaluating AI research assistants, LifeSciBench offers a useful internal calibration tool: the 750-task corpus and rubrics can serve as a reference for scoping which workflow categories a vendor’s model genuinely clears before committing to a deployment contract. The per-workflow breakdown is the number to ask any vendor for.

Reported by OpenAI on June 17, 2026, with additional coverage by MarkTechPost and TechTimes.

OpenAI's new life-science benchmark hits a 36% ceiling

The morning brief for people inside the AI industry.

More in Models

Ai2 ships MolmoMotion to close robotics' language gap

Cursor is building a 1.5-trillion-parameter model from scratch

Kimi K2.7 Code ran 16x cheaper than Claude Fable 5 in a landing-page test