RT Journal Article SR Electronic T1 Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts JF The Annals of Family Medicine JO Ann Fam Med FD American Academy of Family Physicians SP 113 OP 120 DO 10.1370/afm.3075 VO 22 IS 2 A1 Hake, Joel A1 Crowley, Miles A1 Coy, Allison A1 Shanks, Denton A1 Eoff, Aundria A1 Kirmer-Voss, Kalee A1 Dhanda, Gurpreet A1 Parente, Daniel J. YR 2024 UL http://www.annfammed.org/content/22/2/113.abstract AB PURPOSE Worldwide clinical knowledge is expanding rapidly, but physicians have sparse time to review scientific literature. Large language models (eg, Chat Generative Pretrained Transformer [ChatGPT]), might help summarize and prioritize research articles to review. However, large language models sometimes “hallucinate” incorrect information.METHODS We evaluated ChatGPT’s ability to summarize 140 peer-reviewed abstracts from 14 journals. Physicians rated the quality, accuracy, and bias of the ChatGPT summaries. We also compared human ratings of relevance to various areas of medicine to ChatGPT relevance ratings.RESULTS ChatGPT produced summaries that were 70% shorter (mean abstract length of 2,438 characters decreased to 739 characters). Summaries were nevertheless rated as high quality (median score 90, interquartile range [IQR] 87.0-92.5; scale 0-100), high accuracy (median 92.5, IQR 89.0-95.0), and low bias (median 0, IQR 0-7.5). Serious inaccuracies and hallucinations were uncommon. Classification of the relevance of entire journals to various fields of medicine closely mirrored physician classifications (nonlinear standard error of the regression [SER] 8.6 on a scale of 0-100). However, relevance classification for individual articles was much more modest (SER 22.3).CONCLUSIONS Summaries generated by ChatGPT were 70% shorter than mean abstract length and were characterized by high quality, high accuracy, and low bias. Conversely, ChatGPT had modest ability to classify the relevance of articles to medical specialties. We suggest that ChatGPT can help family physicians accelerate review of the scientific literature and have developed software (pyJournalWatch) to support this application. Life-critical medical decisions should remain based on full, critical, and thoughtful evaluation of the full text of research articles in context with clinical guidelines.