Anthropic published a white paper on June 5 showing that Claude Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning, matched or outperformed ChemDraw and MestReNova on nuclear magnetic resonance spectroscopy tasks. The evaluation, designed with practicing chemists at Anthropic, is the first methodologically careful head-to-head between a frontier general-purpose LLM and the dedicated domain software those chemists actually use day-to-day.
NMR spectroscopy is the standard analytical technique organic chemists rely on to confirm molecular structure. For every compound synthesized, a chemist manually matches each peak in a spectrum to an atom in the proposed structure. It is time-consuming, requires expertise, and generates the kind of pattern-matching work that looks tractable for a capable model.
The forward-prediction task gave each tool a molecular structure and asked it to predict where hydrogen and carbon peaks would fall in a 1D spectrum. Anthropic tested 20 compounds drawn from ChemRxiv preprints published after the models’ training cutoffs, specifically to avoid data contamination. On hydrogen shift prediction, Opus 4.7 posted an average error of 0.079 ppm, well under half the 0.20 ppm tolerance window chemists use. On carbon, Opus 4.7 and MestReNova were effectively tied at 1.37 and 1.48 ppm respectively. Anthropic queried each Claude model three times per compound and averaged the results; ChemDraw and MestReNova return a single deterministic answer. Opus 4.7 also predicted sub-peak spacing to within half a hertz on roughly 80% of cases, compared to 26 to 35% for the two classical tools.
The more significant result is the inverse task. Standard NMR software does forward prediction. It does not attempt to work backward from a spectrum to propose a molecular structure. That step is left to the chemist, or to specialized licensed tools that require 2D NMR data and dedicated training. Anthropic gave Opus 4.7 15 structure-elucidation problems using only a molecular formula and a 1D peak list. The model recovered all eight simpler structures correctly on every attempt. On seven harder targets (fused rings, spirocycles), given only the starting material as an additional input, it returned the correct answer on all three runs for four of them, and on two of three runs for the remaining three.
That result matters structurally. The existing workflow in most labs is: use software for the parts it handles, then hand the hard cases to a senior spectroscopist. If a general-purpose model can handle a meaningful fraction of the elucidation cases from the same 1D data a chemist would paste into a chat, the handoff point shifts. The workflow changes from software-plus-human to model-plus-human-for-edge-cases.
Anthropic’s own framing is careful. The evaluation covered 20 compounds across four scaffold classes for the forward task, a sample Anthropic describes as indicative rather than precise. The model struggled with complex coupling patterns and stereochemistry, which 1D NMR cannot fix in any case. On the densest inverse targets without the starting-material hint, the model sometimes looped through its reasoning without committing to a structure. Solvent coverage was limited to three common choices. Anthropic names the gaps explicitly and states it wants to extend the evaluation to several hundred compounds spanning 20 to 30 scaffold classes before drawing stronger conclusions.
What Anthropic does not address is whether these results hold when chemists integrate Claude into actual laboratory workflows rather than a controlled benchmark. Benchmark compounds were novel molecules from recent preprints; real lab work includes compounds with incomplete characterization data, ambiguous spectra, and multiple plausible structures that require judgment calls. The evaluation is honest about its limits in a way that AI-for-science announcements often are not.
The broader pattern worth noting: Anthropic is publishing structured domain evaluations in chemistry, social science, and other fields in parallel with its safety research. The chemistry white paper cites the same infrastructure approach described in recent Anthropic research. These evaluations are not product announcements. They are the public record of where a general-purpose model now sits relative to tools built for a single domain, and they will become reference points for labs deciding whether to build chemistry-specific software or pay for Claude API access.
Chemistry teams at pharma companies and research institutions who currently run ChemDraw for NMR assignment should run the Anthropic white paper’s test protocol on their own compound sets before their next software renewal cycle.
Anthropic Research (anthropic.com/research), 2026-06-05.