OpenAI has created a new way to check how well AI can understand and repeat real AI research. They call it PaperBench.
Announced on April 2, 2025, this pioneering evaluation framework tests whether AI systems can accurately interpret complex research papers, independently develop necessary codebases, and successfully execute experiments to match published results.
What is PaperBench?
PaperBench challenges AI agents to replicate 20 Spotlight and Oral papers from the International Conference on Machine Learning (ICML) 2024 from scratch.
This requires systems to demonstrate an understanding of the papers core contributions, develop complete codebases, and successfully execute experiments that match the original research outcomes.
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state of the art AI research, as stated the OpenAI team in their announcement.
Agents need to recreate 20 key papers from the ICML 2024 conference, both Spotlight and Oral ones, starting from zero.
This means figuring out what each paper is about, building the code from scratch, and running the experiments successfully.
This challenge stands out because it’s huge and super detailed. It includes 8,316 small tasks that can each be graded separately.
These tasks come with guides that break down the big job of replicating a paper into smaller, clear steps with specific rules for grading.
These guides were made together with the people who wrote the original ICML papers, so they’re accurate and practical.
PaperBench Evaluation: A Three-Steps Process
PaperBench tests AI’s ability to redo real research. It does this in three main steps:
- AI Creates Code:
- The AI is given a research paper and has to write computer code that does what the paper describes. It works in a controlled computer space (an Ubuntu container).
- Running the Code:
- The AI’s code is then run on a computer with powerful graphics cards (GPUs). This step checks if the code actually works and produces results.
- Checking the Results:
- The results from the code are compared to what the original research paper said. A detailed set of rules (a rubric) is used to see if the AI’s results match.
To make this process faster, OpenAI built an AI tool that automatically grades the AI’s work. They also made a separate test to make sure this grading tool is accurate.
Current Model Performance
OpenAI evaluated several frontier models on PaperBench, with Claude 3.5 Sonnet from Anthropic emerging as the top performer.
Claude 3.5 Sonnet achieved an average replication score of 21.0% when equipped with open source scaffolding, the highest among tested models.
This relatively modest score highlights the significant challenge PaperBench presents and indicates substantial room for improvement in AI systems ability to understand and replicate complex research.
We evaluate several frontier models on PaperBench, finding that the best performing tested agent, Claude 3.5 Sonnet (New) with open source scaffolding, achieves an average replication score of 21.0%, the researchers noted.
Notably, the research team also recruited top machine learning PhDs to attempt a subset of PaperBench tasks and found that current AI models do not yet outperform human experts in this domain.
PaperBench Code-Dev: A Lighter Alternative
OpenAI introduced PaperBench Code-Dev, a lighter weight variant focusing solely on code development requirements.
This version skips the reproduction step and doesn’t evaluate whether the code runs correctly or matches the paper’s empirical results.
PaperBench Code-Dev offers a more accessible alternative that requires less computational resources, particularly GPUs, making it more widely usable for researchers with limited access to high end hardware.
According to OpenAI, this variant typically cuts grading costs by around 85% compared to the full benchmark.
Open Source Initiative
In line with their commitment to collaborative research, OpenAI has open sourced PaperBench code and dataset through their GitHub repository.
This allows researchers worldwide to utilize the benchmark for evaluating models and advancing the field.
We open source our code to facilitate future research in understanding the AI engineering capabilities of AI agents, the team explained.
The repository contains the complete dataset of papers, rubrics, and evaluation code, enabling researchers to run evaluations on their own AI systems.
Implications for AI Research
PaperBench represents a significant advance in how we measure progress in AI systems research capabilities.
By focusing on replicating published research, it tests not just an AI’s ability to generate code but also its capacity to understand complex ideas, implement them correctly, and verify results.

The introduction of this benchmark arrives at a crucial moment in AI development, as systems increasingly demonstrate capabilities to assist in or potentially automate aspects of the research process.
PaperBench provides a standardized way to evaluate these capabilities and track progress over time.
PaperBench specifically measures whether AI systems can accurately interpret research papers, independently develop the necessary codebases, and successfully execute experiments, according to a technology publication covering the announcement.
Challenge of Automated AI Research
As AI gets better, we need ways to check its progress. PaperBench is a new tool to do just that.
Right now, even the best AI struggles with this test. This shows that AI still has a long way to go before it can do research like humans.
OpenAI’s work adds to the growing number of tests that help us see how well AI can handle tough, real-world problems.
PaperBench helps us understand what AI can and can’t do in science.
As AI improves and more tools like PaperBench become available, AI might become a better research partner, which could speed up discoveries in many areas.