Estimating the deep replicability of scientific findings using human and artificial intelligence

PNAS – Proceedings of the National Academy of Sciences of the United States of America
http://www.pnas.org/content/early/
[Accessed 9 May 2020]

 

Estimating the deep replicability of scientific findings using human and artificial intelligence
Yang Yang, Wu Youyou, and Brian Uzzi
PNAS first published May 4, 2020. https://doi.org/10.1073/pnas.1909046117
Significance
After years of urgent concern about the failure of scientific papers to replicate, an accurate, scalable method for identifying findings at risk has yet to arrive. We present a method that combines machine intelligence and human acumen for estimating a study’s likelihood of replication. Our model—trained and tested on hundreds of manually replicated studies and out-of-sample datasets —is comparable to the best current methods, yet reduces the strain on researchers’ resources. In practice, our model can complement prediction market and survey replication methods, prioritize studies for expensive manual replication tests, and furnish independent feedback to researchers prior to submitting a study for review.
Abstract
Replicability tests of scientific papers show that the majority of papers fail replication. Moreover, failed papers circulate through the literature as quickly as replicating papers. This dynamic weakens the literature, raises research costs, and demonstrates the need for new approaches for estimating a study’s replicability. Here, we trained an artificial intelligence model to estimate a paper’s replicability using ground truth data on studies that had passed or failed manual replication tests, and then tested the model’s generalizability on an extensive set of out-of-sample studies. The model predicts replicability better than the base rate of reviewers and comparably as well as prediction markets, the best present-day method for predicting replicability. In out-of-sample tests on manually replicated papers from diverse disciplines and methods, the model had strong accuracy levels of 0.65 to 0.78. Exploring the reasons behind the model’s predictions, we found no evidence for bias based on topics, journals, disciplines, base rates of failure, persuasion words, or novelty words like “remarkable” or “unexpected.” We did find that the model’s accuracy is higher when trained on a paper’s text rather than its reported statistics and that n-grams, higher order word combinations that humans have difficulty processing, correlate with replication. We discuss how combining human and machine intelligence can raise confidence in research, provide research self-assessment techniques, and create methods that are scalable and efficient enough to review the ever-growing numbers of publications—a task that entails extensive human resources to accomplish with prediction markets and manual replication alone.