Google announced Aletheia, an AI using Gemini 3 Deep Think that solved 6/10 novel math problems in the FirstProof challenge. Aletheia also scored ~91.9% on IMO-ProofBench, signaling a significant shift in automated research-level proof discovery without human intervention.
Unlike traditional benchmarks that often suffer from data contamination—where models inadvertently memorize training data—the FirstProof challenge consists of ten unpublished, research-level mathematical lemmas. Because these problems were sourced from the ongoing work of mathematicians and had never been posted online, it is deemed virtually impossible for the AI to have seen them before. Furthermore, participants were given only one week to submit their solutions.
Handed raw problem prompts without human hints or dialogue loops, Aletheia produced candidate proofs completely autonomously. Expert human evaluators judged 6 of the 10 proposed solutions as “publishable after minor revisions.” Notably, the solution for Problem 8 was judged correct by 5/7 experts, with the rest of them regretting a lack of clarifying details. Crucially, for the remaining 4 problems, Aletheia explicitly outputted “No solution found” or timed out, rather than hallucinating a convincing but flawed answer. DeepMind researchers commented:
“This self-filtering feature was one of the key design principles of Aletheia; we view reliability as the primary bottleneck to scaling up AI assistance on research mathematics. We suspect that… many practicing researchers would prefer to trade raw problem-solving capability for increased accuracy.”
OpenAI also tackled the challenge with an internal, unreleased reasoning model. They initially reported solving 6 of the 10 problems (specifically problems 2, 4, 5, 6, 9, and 10), but that estimate was later revised downward to 5 after their solution to Problem 2 was found to be logically flawed. Unlike DeepMind’s strict zero-shot automation, OpenAI acknowledged relying on limited human supervision to manually evaluate and select the best outputs from multiple attempts.
Under the hood, Aletheia leverages the Gemini 3 Deep Think architecture, relying on extended “test-time compute” (inference time). The system uses a multi-agent framework including a Generator to propose logical steps, a Verifier to evaluate steps for flaws, and a Reviser to iterate and patch mistakes. By integrating external tools like Google Search, the agent can navigate existing literature to verify concepts and is more likely to avoid the unfounded citations that typically plague LLMs.
(Source: Google DeepMind blog)
As explored in a deep dive by Luhui Dev, Aletheia is akin to as a strict, runnable research loop, thus similar to a CI/CD pipeline for mathematics: propose, verify, fail, repair, and merge. The LLM acts as a creative candidate generator, while a second agent acts as peer reviewer to drive remediation.
However, researchers noted in the paper Towards Autonomous Mathematics Research that while progress has been significant over a few months, full autonomy is yet to be achieved:
“Even with its verifier mechanism, Aletheia is still more prone to errors than human experts. Furthermore, whenever there is room for ambiguity, the model exhibits a tendency to misinterpret the question in a way that is easiest to answer… This aligns with the well-known tendencies for ‘specification gaming’ and ‘reward hacking’ in machine learning.”
The mathematicians behind the initiative are already working on its second iteration. A second batch of problems will be created, tested, and graded from March to June 2026, designed this time as a fully formal benchmark.
Aletheia is powered by an advanced version of Gemini Deep Think