PT - JOURNAL ARTICLE AU - Rahgozar, Arya AU - Mortezaagha, Pouria AU - McGowan, Jessie AU - Cobey, Kelly AU - Edwards, Jodi AU - Tricco, Andrea AU - Manuel, Doug AU - Fergusson, Dean AU - Moher, David TI - Reducing The Effort for Performing Systematic Reviews Using Natural Language Processing And Large Language Models AID - 10.1370/afm.22.s1.6188 DP - 2024 Nov 20 TA - The Annals of Family Medicine PG - 6188 VI - 22 IP - Supplement 1 4099 - http://www.annfammed.org/content/22/Supplement_1/6188.short 4100 - http://www.annfammed.org/content/22/Supplement_1/6188.full SO - Ann Fam Med2024 Nov 20; 22 AB - Context Systematic reviews are critical to support future research, investigation, and development of adherence to the reporting guidelines and to identify the gaps but require intensive human synthesis.Objective The aim is to evaluate the feasibility of an instructed large language model (LLM) to assist with human synthesis such as temporal topic search and screening by citation in systematic reviews for Brain-Heart-Interconnectome (BHI) to support reporting guidelines, such as CONSORT and SPIRIT.Study Design and Analysis We used Lang-Chain framework to implement a Retrieval Augmented Generation (RAG) system including document loader, text-splitter, vectorizer (OpenAI Embedding to facilitate similarity), database and an instructed LLM (GPT3.5-Turbo Model) to respond to human questions automatically regarding systematic reviews given a set of pre-selected papers. We then compared the answers with those generated by a general purpose LLM.Setting or Dataset Dataset includes a set of 846 research papers in BHI that claimed to have followed relevant reporting guidelines. We compared the 2 answers to a set of 20 user questions.Population Studied Corpus includes a variety of categories in Randomized Clinical Trails (RCT), their assessments, quality, prevalence, outcomes, bias, methods, interventions, and adherence to CONSORT.Intervention/Instrument The instrument is an especially trained conversational system that significantly expedites the BHI systematic review development against the reporting guidelines criteria.Outcome Measures We used a normalized combination of precision and recall (F1) index to compare and count the corresponding same terms between the two generated answers by the two LLMs.Results We qualitatively assessed the two versions of the auto-generated answers to the systematic review questions. The retrained LLM that was instructed to use the selected papers generated more specific answers from the defined citations and known source of selected papers. The F1 index comparison between the two sets of 20 answers showed 30.7% concordance on average, which showed the effects of our controlled source and prompt-engineering.Conclusion We showed it was workable to use both a general LLM, and an instructed LLM with a set of specific constrained source of citations of research papers to provide relevant insightful answers to assist with human synthesis of material and hence significantly facilitated the development of BHI systematic reviews.