Automating the data extraction process for systematic reviews using GPT-4o and o3.

Kataoka Y., Takayama T., Yoshimura K., So R., Tsujimoto Y., Yamagishi Y., Takagi S., Furukawa Y., Sakata M., Bašić Đ., Cipriani A., Cuijpers P., Karyotaki E., Harrer M., Leucht S., Homiar A., Ostinelli EG., Miguel C., Rodolico A., Furukawa TA.

Large language models have shown promise for automating data extraction (DE) in systematic reviews (SRs), but most existing approaches require manual interaction. We developed an open-source system using GPT-4o to automatically extract data with no human intervention during the extraction process. We developed the system on a dataset of 290 randomized controlled trials (RCTs) from a published SR about cognitive behavioral therapy for insomnia. We evaluated the system on two other datasets: 5 RCTs from an updated search for the same review and 10 RCTs used in a separate published study that had also evaluated automated DE. We developed the best approach across all variables in the development dataset using GPT-4o. The performance in the updated-search dataset using o3 was 74.9% sensitivity, 76.7% specificity, 75.7 precision, 93.5% variable detection comprehensiveness, and 75.3% accuracy. In both datasets, accuracy was higher for string variables (e.g., country, study design, drug names, and outcome definitions) compared with numeric variables. In the third external validation dataset, GPT-4o showed a lower performance with a mean accuracy of 84.4% compared with the previous study. However, by adjusting our DE method, while maintaining the same prompting technique, we achieved a mean accuracy of 96.3%, which was comparable to the previous manual extraction study. Our system shows potential for assisting the DE of string variables alongside a human reviewer. However, it cannot yet replace humans for numeric DE. Further evaluation across diverse review contexts is needed to establish broader applicability.

DOI

10.1017/rsm.2025.10030

Type

Journal article

Publication Date

2026-01-01T00:00:00+00:00

Volume

17

Pages

42 - 62

Total pages

20

Keywords

GPT-4o, data extraction automation, large language models, o3, systematic reviews, Humans, Randomized Controlled Trials as Topic, Algorithms, Reproducibility of Results, Information Storage and Retrieval, Software, Sensitivity and Specificity, Automation, Systematic Reviews as Topic, Cognitive Behavioral Therapy, Data Mining, Review Literature as Topic

Permalink More information Close