Abstract
Clinical workflows that prioritize repetitive patient intake screening to meet performance metrics may have unintended consequences. This retrospective analysis of electronic health record data from 24 Federally Qualified Health Centers assessed effectiveness and accuracy of the 2-item Patient Health Questionnaire (PHQ-2) for depression screening and Generalized Anxiety Disorder 2 (GAD-2) for anxiety screening from 2019 to 2021. Scores of over 91% of PHQ-2 and GAD-2 tests indicated low likelihood of depression or anxiety, which diverged markedly from published literature on screening outcomes. Visit-based screenings linked to performance metrics may not be delivering the intended value in a real-world setting and risk distracting clinical effort from other high value activities.
- performance measures
- health care quality
- administrative burden
- practice-based research
- PHQ-9
- quality improvement
- physician burnout
INTRODUCTION
Primary care visits often start with a myriad of standardized intake screening questions that are tied to performance metrics and incorporated into electronic health records (EHRs). Prioritizing repetition of intake screening questionnaires at primary care visits may have unintended consequences such as administrative burden, provision of low-value care, and reduced clinical capacity to deliver other, high-value services.1
Prior work demonstrated high levels of repetition of 6 intake screening questionnaires tied to performance metrics (ie, Patient Health Questionnaire-2 [PHQ-2], tobacco use screening, etc) during visits to 25 Federally Qualified Health Centers (FQHCs) in 2019.2 The current study extends this research by exploring the accuracy and utility of 2 of these validated questionnaires (PHQ-2, Generalized Anxiety Disorder 2 [GAD-2]) to better understand if they provide the expected value in real-world settings.
METHODS
We analyzed EHR data to (1) compare rates of positive PHQ-2 and GAD-2 tests administered within our study population to publicly available US Census data and published literature, and (2) to assess the accuracy of these instruments by comparing the PHQ-2 and GAD-2 scores to diagnoses for corresponding patients. The study population included patients aged 18 years and older with at least 1 visit between 2019 and 2021 to 1 of 24 FQHCs (spanning 11 states). The 2 questionnaires were selected because they are widely implemented at the FQHCs and are linked to performance metrics for the National Committee for Quality Assurance Patient-Centered Medical Home recognition3 and/or the Health Resources and Services Administration’s Uniform Data System4 and they are embedded into the intake form of the EHR. Questionnaires are predominately administered verbally during the intake process by medical assistants.
To make our results comparable to the US Census Bureau’s 2021 Household Pulse Survey (HPS), we applied HPS sample weights to generate nationally representative estimates of adults experiencing symptoms of depression and anxiety as measured by the PHQ-2 and GAD-2.5
To assess accuracy, we examined score distributions for PHQ-2 and GAD-2 screenings completed by patients with subsequent new evidence of depression and anxiety (delineated as a new diagnosis in the EHR). We compared the ability of the screeners to detect disease to sensitivity rates in published literature.
This study was granted an exemption from review by the Chicago Department of Public Health Institutional Review Board.
RESULTS
Screenings, including 1,883,317 PHQ-2s and 1,573,107 GAD-2s, were performed on 380,057 patients. Of these, 92.3% (1,738,534/1,883,317) of PHQ-2 tests and 91.4% (1,437,234/1,573,107) of GAD-2 tests resulted in a cumulative score of 0 or 1, indicating low likelihood of depression (for PHQ-2) and anxiety (for GAD-2) (Figure 1). The mean (SD) PHQ-2 score was 0.29 (1.024). The mean (SD) GAD-2 score was 0.35 (1.193). The median (interquartile range [IQR]) was 0.00 (0.00-0.00) for both instruments. Score distributions show 11% of patients had a positive PHQ-2 score (≥2) on their first screen, compared with 26% to 43% of first screens in the literature6-9 and census data sets5 (Figure 2). Similarly, score distributions show 11% of patients had a positive GAD-2 score (≥2) on their first screen, compared with 47% to 53% in census data sets5 and previous literature.10
GAD-2 and PHQ-2 score distributions.
GAD-2 = Generalized Anxiety Disorder 2 questionnaire; PHQ-2 = 2-item Patient Health Questionnaire.
Comparison of positive PHQ-2 rates.
PHQ-2 = 2-item Patient Health Questionnaire.
Narrowing the analysis to patients with new diagnoses (excluding patients without a diagnosis or with a prior diagnosis), we found 42.3% (10,624/25,116) of patients with a new depression diagnosis scored 0 or 1 on the PHQ-2 within the previous 30 days. Of patients with a new anxiety diagnosis, 42.7% (16,272/38,127) scored 0 or 1 on the GAD-2. Said another way, screening only detected risk in 57.7% of patients subsequently diagnosed with depression and 57.3% of patients subsequently diagnosed with anxiety.
DISCUSSION
Our prior study demonstrated that intake screening questionnaires during primary care visits in FQHCs are often administered repetitively in order to meet performance metrics.2 The current results suggest that existing workflows for screening are also less effective in detecting depression and anxiety than expected. In this real-world setting, PHQ-2 and GAD-2 results were more frequently negative (normal) when compared with settings described in published literature and census data. Although FQHC patients may differ from those in the literature and census data, these differences are unlikely to account for this disparity. In fact, the patients we studied are likely to have a relatively high prevalence of depression and anxiety because FQHC patients are predominantly low income11,12 and because the study period overlapped with the COVID-19 pandemic.13,14
We also evaluated PHQ-2 and GAD-2 results in patients who develop new diagnoses of depression or anxiety. In these patients, the PHQ-2 and GAD-2 had disease detection rates of less than 60%, compared with 90+% sensitivity in published literature.6-8 We acknowledge that documentation on a diagnosis list in an EHR is not gold standard proof that the patient has depression or anxiety. Nonetheless, low positivity (<60%) in a screening test among patients diagnosed within 30 days of screening warrants further exploration.
These results raise the possibility that when done frequently to meet performance thresholds, such screenings may be performed in a perfunctory or inconsistent manner that reduces sensitivity. Preliminary qualitative findings based on structured interviews with clinicians, staff, and patients demonstrate variation in questionnaire administration and time constraints as underlying factors leading to inaccuracies, but future, more comprehensive work in this area is needed.
The US Preventive Services Task Force (USPSTF) recently issued draft recommendations that primary care clinicians screen all adults aged <65 years for anxiety. The recommendations state that “more studies are needed on the diagnostic accuracy of screening tools that are feasible for use in primary care.”15 Our findings indicate potentially compromised accuracy of anxiety and depression screeners when their implementation is driven by a need to meet performance measures and they are embedded into EHRs and visit workflows. Some improvement suggestions are to screen at predetermined intervals rather than at every clinical encounter and to rely on self-administration methods, either electronic or paper, which may have higher fidelity and reliability16 and cause less burden to staff and patients.
Our study has broad relevance for policy makers, regulators, measure developers, and clinician organizations that extends beyond depression and anxiety screening. Focusing on incentivized process measures like intake screening questionnaires leads to repetitive2 and, we hypothesize, inaccurate completion. The impact on outcomes that matter (ie, reducing mortality and morbidity from depression and anxiety) may not be as favorable as previously perceived, and ineffective screening may unintentionally detract from clinical care because care teams and patients have less time and cognitive energy to focus on other priorities during busy clinical encounters. The importance of not confusing metrics with objectives (“surrogation”) is described in the Harvard Business Review article “Don’t Let Metrics Undermine Your Business.”17 Our findings suggest similar wisdom could be useful in health care, given the implementation of care processes like depression and anxiety screening to meet a performance metric may inadvertently lead to reduced accuracy and low-value care.
Acknowledgments
Elizabeth Adetoro assisted with study design, data collection, validation, and interpretation. Ryan Jaeger, AllianceChicago, created the data set for analysis.
Footnotes
Conflicts of interest: authors report none. Dr Sinsky is employed by the American Medical Association.
Funding support: This work was funded by the American Medical Association Practice Transformation Initiative.
Disclaimer: The opinions expressed in this article are those of the author(s) and should not be interpreted as American Medical Association policy.
Previous presentations: Illinois Primary Health Care Association Annual Leadership Conference; Chicago, Illinois; October 6, 2022.
- Received for publication October 25, 2022.
- Revision received March 16, 2023.
- Accepted for publication March 29, 2023.
- © 2023 Annals of Family Medicine, Inc.