Researchers built a two-model system for frontier mathematics research and pointed it at genuinely unsolved problems. GPT-5.5 Pro generated candidate proofs. Claude Opus 4.7 checked them line by line. According to the researchers, who posted the results on X, the pairing resolved a list of open questions across areas of very different familiarity to the models.

The setup matters more than either model alone. A single LLM asked to prove a hard theorem will often produce confident, wrong reasoning: fluent notation wrapped around a logical gap. Splitting solver and verifier into separate models forces an adversarial check that a single context window cannot replicate, since the verifier has no stake in defending the solver’s approach.

The researchers describe the workflow as stress-tested across problems with very different levels of familiarity to the models, meaning some questions likely sat closer to well-trodden textbook territory and others further into genuinely novel terrain. That distinction determines how much credit the system deserves. A model resolving a lightly documented but ultimately known result is a strong retrieval-and-synthesis result. A model resolving a problem with no prior published solution is a different claim entirely, and the post does not specify how the open questions were selected or how many fell into each category.

The prover-verifier pattern itself is not new. It echoes work labs have published on using one model to generate outputs and another to grade them, a structure that shows up in reinforcement learning pipelines and coding-agent evaluation alike. What is new here is pointing that architecture at research-level mathematics rather than benchmark problems with known answers, where there is no ground truth to check against except the community of mathematicians who eventually read the proof.

The announcement, posted directly to X rather than through a peer-reviewed venue, does not include independent verification from the mathematics community. A proof that survives an LLM verifier is not the same as a proof accepted by human referees, and history is short on cases where an AI-generated mathematical result skipped that step entirely. The claim of “surprisingly strong results” is the researchers’ own characterization, not an outside benchmark.

If the underlying proofs hold up under mathematician review, the result is still narrower than “AI does math research”: it is evidence that a two-model check-and-balance system can extend reasoning further than either model working alone, in a domain where correctness is binary and unforgiving. Labs building reasoning products should treat prover-verifier pairing as a template worth testing on their own hard-verification domains, not as proof that novel discovery is now routine.

Reported by the researchers (on X) on July 1, 2026.