Skip to main content
Dear Editor,
We write this letter as a Spanish family physician and an independent researcher, both deeply engaged in the ongoing debates surrounding artificial intelligence (AI) in medical practice (1,2). Our shared perspective arises from years of clinical work in primary care settings, coupled with a commitment to exploring how technological innovations can best serve both patients and professionals. Recent advances in Large Language Models (LLMs), such as ChatGPT or Gemini, have ignited our enthusiasm for the transformation they might bring to healthcare, while at the same time raising serious questions about the reliability and potential pitfalls of such systems (3,4), especially as they become integrated into sensitive tasks like medical diagnosis and treatment planning.
An essential point of concern that has recently garnered much attention is the phenomenon commonly termed “hallucinations”. These hallucinations occur when an LLM generates output that appears plausible but is factually incorrect or entirely fabricated. In the legal sphere, the starkest examples have been instances in which judges have confronted fictional citations or misapplied case law, all traced back to LLM-generated documents (5). However, the medical domain faces a different and potentially graver danger. A lawyer citing a fabricated case usually triggers an alarm that can be checked in legal databases (6). But for a busy family physician working in a demanding clinic, a subtle misstep like a misplaced clinical guideline, an incorrect dosage, or an invented side effect… may not raise immediate suspicion. This risk is particularly acute in complex diagnostic contexts, where real-time decision-making can be literally a matter of life and death.
We have come across several troubling anecdotes of fabricated references to scientific articles (7,8), a phenomenon that is often easy to spot once the actual journals are consulted. It is one thing when the AI invents an author’s name or confounds the details of a publication; such errors, though worrisome, are readily identified through diligent verification of sources. Far more insidious are the nuanced distortions that mimic plausible findings while omitting crucial details (9). For instance, an LLM may claim that a certain imaging test is the “gold standard” for a specific condition, citing partial or outdated evidence rather than wholly inventing a paper-like research report. The appearance of verisimilitude lowers our guard, making us more likely to accept the AI’s counsel without the thorough scrutiny that genuine critical appraisals demand. Recent commentaries, such as those by Liebrenz and colleagues in The Lancet Digital Health (10) and by Biswas in Radiology (11), highlight how such hallucinations pose ethical and clinical challenges that extend beyond mere academic inconvenience.
From the vantage point of a family physician, the attractiveness of LLM-based tools lies in their potential to enhance efficiency. AI-driven applications promise a degree of rapid, wide-ranging literature synthesis that could theoretically allow doctors to make better decisions in less time. Certain pioneering studies, such as the work by Kung et al. (12), even propose that these models might bolster medical education by rapidly generating useful summaries of complex subjects and facilitating on-demand tutoring for medical students and residents. When implemented responsibly, there is no doubt that AI can unlock new levels of efficiency (13), offering second opinions within seconds or sifting through electronic health records to spot patterns that would be imperceptible to the human eye (14,15).
Yet what is less frequently discussed, and what concerns us as practicing and researching professionals, is the practical burden of supervising these tools. In principle, a diagnostic support system should lighten the workload by providing reliable, evidence-based suggestions. However, once a physician learns that LLMs can hallucinate (and often do so in ways not easily detected) we encourage to verify each statement, reference, and recommendation generated. This elevated demand for vigilance could paradoxically make the adoption of AI more time-consuming than traditional approaches, at least in the early stages of integration. In an era when consultations are already time-pressured and administrative tasks are ever increasing, the prospect of continually fact-checking a voluminous AI output can feel more burdensome than beneficial.
Moreover, our independent research on the integration of AI into medical workflows indicates that family physicians may bear a disproportionate share of this supervisory load. Primary care requires a breadth of knowledge covering pediatrics, geriatrics, chronic disease management, mental health, and more. An LLM cannot be an all-in-one-expert in all domains simultaneously without occasionally tripping into the kind of oversights that humans can detect only after meticulous analysis. While the AI might identify correlations that our human brains might miss, it can just as easily invent them. This precarious mix of potential brilliance, “black-box” and imaginative failures underscores the importance of expert oversight, yet that oversight itself demands additional training, greater familiarity with AI’s capabilities and limitations (16), and an extension of the current standards of practice.
Furthermore, adopting LLM-based tools must involve active consideration of ethical and legal implications. The accountability question remains largely unresolved: when an LLM suggests a flawed diagnosis that a physician follows, who is held responsible? Professional guidelines and regulatory frameworks are beginning to adapt, but the process remains in its infancy. In our dialogue with colleagues, we sense a collective concern that the hurried deployment of AI solutions, often motivated by cost-effectiveness or the desire for technological advancement, might outpace the development of robust safeguards and protocols. The real risk is that we adopt these solutions wholesale without building a “fail-safe” mechanism to catch subtle errors before they result in patient harm.
In closing, we urge a balanced approach: one that wholeheartedly embraces AI’s promise of improved efficiency and expanded analytical capacity, but with an unwavering commitment to independent verification and clinical judgment. As Spanish professionals (one of us in the trenches of family medicine, the other studying these evolutions from a applied research standpoint) we believe these concerns must be voiced transparently in the scientific literature and in clinical practice guidelines. Our fear is that in the eagerness to adopt new technology, we neglect the fundamental principle of “primum non nocere” (first, do no harm). We remain hopeful that concerted efforts in research and development, such as those examining LLM performance in formal medical examinations and specialized diagnostic tasks, will continue to expose weaknesses and sharpen the reliability of these systems. In the meantime, we advocate for cautious, stepwise integration of AI tools into medical practice, always accompanied by vigilant oversight from trained professionals, even in the finest details (17).
REFERENCES:
1. Rodríguez JE, Lussier Y. The AI Moonshot: What We Need and What We Do Not. Ann Fam Med. 2025 Jan 1;23(1):7–7.
2. Tenajas R, Miraut D, Tenajas R, Miraut D. El pulso de la Inteligencia Artificial y la alfabetización digital en Medicina: Nuevas herramientas, viejos desafíos. Rev Medica Hered. 2023 Oct;34(4):232–3.
3. Tenajas R, Miraut D. The 24 Big Challenges of Artificial Inteligence Adoption in Healthcare: Review Article. Acta Medica Ruha. 2023 Sep 20;1(3):432–67.
4. Tenajas R, Miraut D. The Risks of Artificial Intelligence in Academic Medical Writing. Ann Fam Med. 2024 Feb 10;(2023):eLetter. [cited 2025 Mar 16]. Available from: https://www.annfammed.org/content/risks-artificial-intelligence-academic...
5. Dahl M, Magesh V, Suzgun M, Ho DE. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. J Leg Anal. 2024 Jan 1;16(1):64–93.
6. Magesh V, Surani F, Dahl M, Suzgun M, Manning CD, Ho DE. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools [Internet]. arXiv; 2024 [cited 2025 Mar 16]. Available from: http://arxiv.org/abs/2405.20362
7. Buchanan J, Hill S, Shapoval O. ChatGPT Hallucinates Non-existent Citations: Evidence from Economics. Am Econ. 2024 Mar 1;69(1):80–7.
8. Schrager S, Seehusen DA, Sexton S, Richardson CR, Neher J, Pimlott N, et al. Use of AI in Family Medicine Publications: A Joint Editorial From Journal Editors. Ann Fam Med. 2025 Jan 1;23(1):1–4.
9. Tenajas-Cobo R, Miraut-Andrés D. Riesgos en el uso de Grandes Modelos de Lenguaje para la revisión bibliográfica en Medicina. Investig En Educ Médica. 2024 Jan 9;13(49):141.
10. Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health. 2023 Mar 1;5(3):e105–6.
11. Biswas S. ChatGPT and the Future of Medical Writing. Radiology. 2023 Apr;307(2):e223312.
12. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models | PLOS Digital Health [Internet]. [cited 2025 Mar 16]. Available from: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig....
13. Tenajas R, Miraut D. Echoes in space: Online training and AI’s potential in advancing ultrasound competency. WFUMB Ultrasound Open. 2023 Dec 1;1(2):100015.
14. Tenajas R, Miraut D, Illana CI, Alonso-Gonzalez R, Arias-Valcayo F, Herraiz JL. Recent Advances in Artificial Intelligence-Assisted Ultrasound Scanning. Appl Sci. 2023 Jan;13(6):3693.
15. Tenajas R, Miraut D. El renacimiento tecnológico de la Radiología: la revolución open source y la inteligencia artificial. Rev Cuba Informática Médica. 2023;15(2).
16. Tenajas R, Miraut D. Ecografía Inteligente. Rev Cuba Informática Médica. 2023;15(2).
17. Tenajas R, Miraut D. Rethinking ultrasound probe maintenance in the era of AI. WFUMB Ultrasound Open. 2023 Dec 1;1(2):100014.