Include global setup and parameters
source("setup_params.R")We study how a frontier language model evaluates social‑science research compared to expert human reviewers in The Unjournal. Using the same structured rubric as human evaluators, we ask GPT‑5 Pro to rate papers on overall quality, methods, evidence, communication, openness, and global relevance, and to produce a narrative assessment anchored in the PDF of each paper. We first compare its quantitative ratings to aggregated human scores across The Unjournal’s existing evaluations, then take a closer qualitative look at a small set of focal papers, including a high‑profile mapping study of natural regeneration. For these focal cases, we examine where the model’s written review overlaps with and diverges from the human reports, and how both sides describe the main strengths, weaknesses, and policy relevance of the work. So far, the model reliably identifies many of the same methodological and interpretive issues that human experts emphasize, but it tends to translate these into more generous numerical ratings and narrower uncertainty intervals. We view this as an initial, work‑in‑progress probe of LLM‑based peer review in a high‑stakes, policy‑relevant domain, and as the first step toward a broader benchmark and set of tools for comparing and combining human and AI research evaluations.
source("setup_params.R")Is AI good at peer-reviewing? Does it offer useful and valid feedback? Can it predict how human experts will rate research across a range of categories? How can it help academics do this “thankless” task better? Is it particularly good at spotting errors? Are there specific categories, e.g. spotting math errors or judging real-world relevance, where it does surprisingly well or poorly? How does its “research taste” compare to humans?
If AI research-evaluation works it could free up a lot of scientific resources – perhaps $1.5 billion/year in the US alone (Aczel, Szaszi, and Holcombe 2021) – and offer more continual and detailed review, helping improve research. It could also help characterize methodological strengths/weaknesses across papers, aiding training and research direction-setting. Furthermore, a key promise of AI is to directly improve science and research. Understanding how AI engages with research evaluations may provide a window into its values, abilities, and limitations.
In this project, we test whether current large language models (LLMs) can generate research evaluations that are comparable, in structure and content, to expert human reviews. The Unjournal systematically prioritizes “impactful” research and pays for high‑quality human evaluations, including structured numeric ratings with credible intervals, claim identification and assessment, predictions, and detailed narrative reports. We use a frontier LLM (OpenAI’s GPT‑5 Pro) to review the same social‑science and policy‑relevant working papers under essentially the same rubric.
For a first pass we focus on papers that already have completed Unjournal evaluation packages. For each of 47 such papers, the model reads the PDF that human evaluators saw and returns: (i) percentile ratings and 90% credible intervals on The Unjournal’s seven criteria, and (ii) two 0–5 journal‑tier scores (“should” and “will” be published). In an additional, richer run on a small set of focal papers, we keep these quantitative outputs but also require a long diagnostic summary and high‑effort reasoning trace. We then compare the model’s ratings, journal‑tier predictions, and qualitative assessments to the existing human evaluations.
Future iterations will extend this design to papers that are still in The Unjournal’s pipeline, where no human evaluations are yet public. This will let us study out‑of‑sample prediction, reduce the risk of model contamination from published evaluations, and test LLMs as tools for triaging and prioritising new work.
Our work in context
Luo et al. (2025) survey LLM roles from idea generation to peer review, including experiment planning and automated scientific writing. They highlight opportunities (productivity, coverage of long documents) alongside governance needs (provenance, detection of LLM-generated content, standardizing tooling) and call for reliable evaluation frameworks.
Eger et al. (2025) provide a broad review of LLMs in science and a focused discussion of AI‑assisted peer review. They argue: (i) peer‑review data is scarce and concentrated in CS/OpenReview venues; (ii) targeted assistance that preserves human autonomy is preferable to end‑to‑end reviewing; and (iii) ethics and governance (bias, provenance, detection of AI‑generated text) are first‑class constraints.
Zhang and Abernethy (2025) propose deploying LLMs as quality checkers to surface critical problems instead of generating full narrative reviews. Using papers from WITHDRARXIV and an automatic evaluation framework that leverages “LLM-as-judge,” they find the best performance from top reasoning models but still recommend human oversight.
Pataranutaporn et al. (2025) asked four nearly state-of-the-art LLM models (GPT-4o mini, Claude 3.5 Haiku, Gemma 3 27B, and LLaMA 3.3 70B) to consider 1220 unique papers “drawn from 110 economics journals excluded from the training data of current LLMs”. They prompted the models to act “in your capacity as a reviewer for [a top-5 economics journal]” and make a publication recommendation using a 6-point scale ranging from “1 = Definite Reject…” to “6. Accept As Is…”. They asked it to evaluate each paper on a 10-point scale for originality, rigor, scope, impact, and whether it was ‘written by AI’. They also (separately) had LLMs rate 330 papers with the authors’ identities removed, or replacing the names with fake male/female names and real elite or non-elite institutions (check this) or with prominent male or female economists attached.
They compare the LLMs’ ratings with the RePEC rankings for the journals the papers were published in, finding general alignment. They find mixed results on detecting AI-generated papers. In the names/institutions comparisons, they also find the LLMs show biases towards named high-prestige male authors relative to high-prestige female authors, as well as biases towards elite institutions and US/UK universities.
There have been several other empirical benchmarking projects, including work covered in LLM4SR: A Survey on Large Language Models for Scientific Research and Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation.
Our project distinguishes itself in its use of actual human evaluations of research in economics and adjacent fields, past and prospective, including both reports, ratings, and predictions.1 The Unjournal’s 50+ evaluation packages enable us to train and benchmark the models. Their pipeline of future evaluations allow for clean out-of-training-data predictions and evaluation. Their detailed written reports and multi-dimensional ratings also allows us to compare the ‘taste’, priorities, and comparative ratings of humans relative to AI models across the different criteria and domains. The ‘journal tier prediction’ outcomes also provides an external ground-truth2 enabling a human-vs-LLM horse race. We are also planning multi-armed trials on these human evaluations (Brodeur et al. 2025) to understand the potential for hybrid human-AI evaluation in this context.
Other work has relied on collections of research and grant reviews, including NLPEER, SubstanReview, and the Swiss National Science Foundation. That data has a heavy focus on computer-science adjacent fields, and iss less representative of mainstream research peer review practices in older, established academic fields. Note that The Unjournal commissions the evaluation of impactful research, often from high-prestige working paper archives like NBER, and makes all evaluations public, even if they are highly critical of the paper.↩︎
About verifiable publication outcomes, not about the ‘true quality’ of the paper of course.↩︎