Abstract
PURPOSE To develop and test a machine-learning–based model to predict primary care and other specialties using Medicare claims data.
METHODS We used 2014-2016 prescription and procedure Medicare data to train 3 sets of random forest classifiers (prescription only, procedure only, and combined) to predict specialty. Self-reported specialties were condensed to 27 categories. Physicians were assigned to testing and training cohorts, and random forest models were trained and then applied to 2014-2016 data sets for the testing cohort to generate a series of specialty predictions. Comparing the predicted specialty to self-report, we assessed performance with F1 scores and area under the receiver operating characteristic curve (AUROC) values.
RESULTS A total of 564,986 physicians were included. The combined model had a greater aggregate (macro) F1 score (0.876) than the prescription-only (0.745; P <.01) or procedure-only (0.821; P <.01) model. Mean F1 scores across specialties in the combined model ranged from 0.533 to 0.987. The mean F1 score was 0.920 for primary care. The mean AUROC value for the combined model was 0.992, with values ranging from 0.982 to 0.999. The AUROC value for primary care was 0.982.
CONCLUSIONS This novel approach showed high performance and provides a near real-time assessment of current primary care practice. These findings have important implications for primary care workforce research in the absence of accurate data.
INTRODUCTION
Approximately 1 in 8 Americans works in health care.1 Translating that into better health depends on the presence of an effective workforce, and many believe the system needs to address shortages and maldistribution.2–4 In response, Congress established the National Health Care Workforce Commission, though it was never funded.1
A primary task of the Commission was to analyze data that would inform responses to threats. For example, organizations have projected increasing shortages of primary care physicians,4–7 underscoring the need for coordination across agencies and timely, accurate data.8
Unfortunately, the data needed are inadequate. Workforce data sets—the American Medical Association’s Masterfile and the Centers for Medicare and Medicaid Services’ (CMS) National Plan and Provider Enumeration System—have limitations. The Masterfile is a registry that documents medical school, residency, and fellowship training. Whereas training information is accurate, the registry relies on voluntary, self-reported responses for updates.9 Thus, the Masterfile’s accuracy decreases as clinicians age, reduce their hours, or change the type of care they deliver.7,9
The National Plan and Provider Enumeration System similarly has difficulty reflecting actual practice.10,11 Congress requires that physicians, regardless of Medicare participation, have unique identifiers—National Provider Identifiers (NPIs). The NPI specialty is self-reported, and there are neither requests for updated information, nor mechanisms to determine whether providers are clinically active.9 Clinicians are instructed to report changes within 30 days, though there are no penalties for failing to do so.9
Even with timely data, misclassification remains a risk. Workforce projections use the most recent residency to categorize specialties. A first problem with this approach is the services might be inconsistent with the residency, eg, family medicine residency graduates might be practicing dermatology. Second, it disregards the contributions of physicians in other specialties and nonphysicians, eg, a rural cardiologist might be practicing primary care.
The method described below overcomes these limitations by evaluating current behavior to infer specialty. Integrating the additional data has the potential to improve accuracy and serve as a check to traditional approaches. Prescription and procedure data are available via the CMS,12 and technological advances allow us to apply emerging techniques. Machine learning, which develops algorithms to detect patterns, has been used to predict myriad outcomes including cancer survival and myocardial infarctions,13–17 and has also been applied to Medicare billing data to predict physician specialty and identify fraud; however, this was not restricted to physicians, did not combine specialties performing similar roles, and did not incorporate prescribing data and, as a result, had low accuracy.18
The present study combined prescription and procedure data to predict specialty for this purpose. Rather than rely on training, we propose a new method that assesses prescriptions and procedures to determine specialty. The objectives were to describe prescriptions and procedures by specialty, combine data on prescriptions and procedures with machine learning to develop algorithms to predict physician specialties, and test model performance against self-reported specialty.
METHODS
Data Sources
The American Academy of Family Physicians Institutional Review Board approved this study. For this cross-sectional study, we used the 2014-2016 CMS Medicare Fee-For-Service Provider Utilization and Payment Data: Part D Prescriber Public Use Files to identify prescriptions.19 These data sets include information regarding beneficiaries enrolled in Medicare Part D (70% of all beneficiaries), information about providers (eg, NPI and self-reported specialty), and prescriptions (except for over-the-counter drugs).
To identify procedures, we used the 2014-2016 CMS Medicare Fee-For-Service Provider Utilization and Payment Data: Physician and Other Supplier Public Use Files.20 In this Medicare Part B data set, procedures were identified with Healthcare Common Procedure Coding System codes. To protect privacy in these data sets, drugs and procedures were not reported by NPI if there were ≤10 claims.
Variables
To assess the same cohort of physicians, the analysis was restricted to nonpediatric physicians appearing in all 3 years (though they only needed to appear in either the procedure or prescription data sets for a given year). To maintain consistency, physicians were only included if they self-reported the same specialty across all 3 years. We excluded nonphysicians and physician assistants (PAs) and nurse practitioners (NPs) because their subspecialties were not listed. We assigned physicians from specialties with a low number of physicians or for which multiple specialties practice in similar ways into 1 of 27 larger specialties (eg, internal medicine or family medicine were relabeled as primary care). To avoid rare drugs or procedures, we restricted the analysis to the 850 most common prescriptions and 1,500 most common procedure codes and excluded items that did not appear in all 3 years. For each year, we characterized physicians by whether they prescribed or performed each of the 2,350 prescriptions/procedures. We did not account for the number of times they prescribed or performed each.
Physicians were then randomly assigned to 2 groups of the same size (Train and Test). Each physician in the Train and Test groups had a data set of associated prescription/procedure behavior for each of the 3 years.
Deriving the Algorithm
Random forest is an ensemble learning method that creates decision trees and generates an output based on the class value that appears most frequently, incorporating random variation to generate a lot of trees that are slightly different. This minimizes overfitting and makes the analysis robust to imbalanced data by limiting the pool of possible variables available at each split.21,22 We selected this method for its conceptual simplicity and favorable statistical properties.23
To begin, we trained a separate random forest model (the combined model, consisting of both prescription and procedure data) for each year. Each random forest consisted of 200 trees and had a pool of 100 possible variables at each node. Changes in hyper-parameters failed to significantly improve these models over the default settings, with the exception of slightly better performance with more possible variables at each node than the default setting; we selected a value of 100 for simplicity. We chose to run 3 separate models as an alternative to cross-validation. Because the prescription and procedure patterns associated with each specialty should be stable across each year, applying 3 separate random forest models to each year of Test data was a robust way to generate many sets of predictions and assess how consistent the method was at predicting specialty. Though these are imbalanced data, various methods to account for this, including undersampling the larger specialties and weighting the smaller specialties, improved performance for some specialties at the expense of others. Because the goal was accurate prediction for physicians regardless of specialty, we chose to leave the data unbalanced.
Validating the Algorithm
To assess consistency, we applied each of the 3 random forest models to each of the 3 years of Test data, giving 9 sets of predictions based on the physicians in the Test group. The 9 sets of predictions were compared with self-reported specialty to generate an F1 score (harmonic mean of precision [positive predictive value] and recall [sensitivity]) for each specialty, and a macro F1 score, calculated on the average precision and recall of all specialties. We reported these values as an average across the 9 sets of predictions. We used the 2016 random forest on the 2016 Test data to create sample receiver operating characteristic curves and calculate area under the curve (AUC) values for each specialty.
The F1 score was selected as the primary measure instead of AUC value because of class imbalance. The F1 score is ideal in that it does not take into account true negatives (which will be large no matter what specialty is examined). The F1 score will be low for a given specialty if a significant number of false negatives or false positives occur, and as a result, F1 score can be low for individual specialties even if the model predicts most other specialties well. Because of the large number of true negatives when predicting small specialties, specificity (true negatives/[true negatives + false positives]) can be high even when there are many false positives and precision (true positives/[true positives + false positives]) is low. The high specificity over a large range of sensitivities leads to high AUC values.
Prescription- and Procedure-Only Subanalyses
We generated 3 additional random forests using only the prescription variables and removing physicians with no prescription data available. We did the same for the procedure variables, removing physicians with no procedure data.
We used the 3 prescription-only models to generate 9 predictions (eg, the 2016 prescription-only model can generate predictions using the 2014, 2015, and 2016 Test data sets) based on the Test data sets, looking only at variables for prescriptions. We did the same for the 3 procedure-only models. We then generated an F1 score for each specialty and macro F1 scores for the prescription-only and procedure-only sets of predictions.
Statistical Analysis
We used 2-sided paired t tests to assess whether the performance of the combined method differed from the prescription-only or procedure-only method, by specialty as well as macro F1 score. Data are presented as mean (%) or mean (95% CI). We considered P <.05 to be statistically significant.
Aggregate Analysis
We summed the predicted number of physicians in each specialty for the 9 predictions generated by the combined random forests, averaged the counts, and compared them to the specialty distribution of the Test set to assess if the overall predicted physician counts were in line with the actual Test set counts. To assess model consistency at the individual physician level, we looked at 2016 data for physicians in the Test set and used the 3 combined (2014-2016) models to generate 3 predictions. We defined model agreement as all 3 models predicting the same specialty. We focused on a single year of prescribing and procedural data for physicians in the Test set because even though we excluded physicians who did not self-report a consistent specialty across all 3 years, it was still possible that a physician’s actual specialty had changed year to year. Choosing to apply the 2014-2016 random forest models to just the 2016 Test data set removed the possibility that the model was inconsistent when a physician in the Test data set showed behavior that changed across the years; prediction disagreement in terms of their 2014, 2015, and 2016 specialty might have reflected the model working as intended. We then categorized physicians according to whether their self-reported specialties did or did not match the predictions.
Statistical analyses were performed with Stata version 15.0 (StataCorp, LLC). The random forest models were run with the ranger package in R, and AUC was calculated in R with the pROC package.24
National Provider Identifiers
Despite its flaws, self-report via NPI is an appropriate reference standard. First, it effectively deals with the concern that historical training differs from current practice by divorcing specialty categorization from residency training. This would remain an issue if we used the American Medical Association’s Masterfile. Second, by only including those physicians who appeared in the prescribing or procedural data set, we excluded those not clinically active. Our models are based on the aggregate behavior of a large number of physicians, and we hypothesized that they are not meaningfully influenced by the small number of physicians with inaccurate self-reported specialty.
RESULTS
We included 564,986 physicians (n = 282,493 in the Train and Test groups). A breakdown by specialty for the Train and Test sets is shown in Table 1. The smallest specialty was allergy/immunology, comprising 0.6% of the physicians in both data sets, and the largest was primary care, comprising 35.6% and 35.9% of the Train and Test sets, respectively. Using prescription data only, approximately 40% of physicians identified as primary care compared with approximately 34% using procedure data only (Supplemental Table 1). Psychiatrists exhibited a similar pattern, with more appearing in the prescription data set than the procedure data set. The inverse was true for specialists who routinely perform procedures.
Primary care physicians prescribed the greatest mean number of unique drugs (61.4), more than 50% more than the next greatest group (cardiologists, 38.1) (Table 1). Radiologists had the greatest mean number of unique procedure codes (35.7).
Comparing the combined and procedure-only predictions, the combined model was significantly better for 18 (66.7%) specialties, worse for 8 (29.6%), and no different for 1 (3.7%) (Table 2; see Supplemental Table 2 for recall, negative predictive, and positive predictive values). Comparing the combined to prescription-only predictions, 19 (70.4%) were significantly better, 6 (22.2%) were worse, and 2 (7.4%) were no different. Macro F1 scores also showed statistically significant differences; the combined model (0.876) was more than 0.05 greater than the procedure-only model (0.821) and more than 0.10 greater than the prescription-only model (0.745).
With respect to the overall robustness of the combined model, 22 specialties (81.5%) had mean F1 scores > 0.80, and 15 (55.6%) had sc0ores > 0.90 (Table 2). The 3 worst specialties were plastic surgery (0.533), physical medicine and rehabilitation (0.586), and neurosurgery (0.650), and the combined model was significantly better than the procedure-only and prescription-only models for all 3 of these specialties. No specialty had a score of < 0.500 for the combined model. The F1 score for the combined model for primary care was 0.920.
These performance characteristics translated to high AUC values (Supplemental Table 3); 22 specialties (81.5%) had AUC values > 0.99. The lowest AUC was for primary care (0.982).
These models also generated relatively accurate predictions for specialty counts (Table 3). Nineteen (70.4%) of the predicted counts for specialties were within 5% of the actual counts. The models underestimated the number of physicians in several specialties, including infectious disease, neurosurgery, physical medicine and rehabilitation, and plastic surgery. In contrast, the model overestimated the number of physicians practicing primary care by 3.7%.
With respect to consistency, the 3 models predicted the same specialty for 97.0% of physicians, when applied to the same year of Test prescription and procedure data (2016) (Table 4). Among these, 89.4% were consistently predicted as the specialty that matched their self-report, whereas 7.6% were consistently predicted as a nonmatching specialty. These values were 98.3%, 92.6%, and 5.8%, respectively, for primary care.
DISCUSSION
In this study, we developed high-performing models to predict specialties. With noted exceptions, these models exhibited high F1 scores and AUC values, especially in comparison to earlier work.18
For several specialties, including neurosurgery and physical medicine and rehabilitation, the models’ performance was suboptimal. We hypothesize that these specialties have high overlap with other specialties, making classification difficult. This finding was not true for primary care, suggesting that the constellation of procedures and prescriptions is also important. Whereas primary care shares prescriptions and procedures with a broad range of specialties, few share its breadth.
Our method has implications for primary care workforce studies. For example, this approach can be used to identify primary care PAs/NPs, who do not have mandated residencies and have eluded classification.25 Workforce projections have been hampered by these limitations. For example, across 40 state workforce assessments, 60% did not include PAs/NPs, citing inadequate data as justification for their exclusion.26 To capture the contribution of PAs/NPs, researchers have relied on surveys and state licensing data,27,28 which have response rates of 20% to 30%.29,30
Our approach also enhances the accuracy and granularity of projections. As noted, workforce projections rely on training though this might not reflect current practice.5–7 Our approach provides a near real-time assessment of behavior. This subtle distinction might affect residencies created and policies supported. This method also allows for identification of physicians not easily categorized such as those providing HIV care.31
There are several limitations to the study. First, we excluded physicians not billing Medicare, only participating in Medicare Advantage, or only providing pediatric care. Physicians had to prescribe drugs or perform procedures >10 times to appear in the data set. A national all-payer claims database would overcome these limitations. Second, we evaluated a single technique in this analysis. Whereas random forest models are broadly used, it is possible that other techniques or changes to parameters might improve accuracy.32 Third, we only included physicians appearing in 3 consecutive years. These analyses need to be repeated with a cohort that involves physicians with less longitudinal data to determine if results are similar. Fourth, we were unable to understand the motivations behind scope deviations, eg, a family physician could practice differently because of unique disease patterns in their service area. Understanding these motivations via a qualitative approach would provide additional context. Finally, we used self-reported specialty for training and testing. As mentioned, this database does not have a penalty for out-of-date information, though physicians are instructed to report changes.9
In summary, we report a novel method for identifying primary care physicians. These models exhibit high performance, and because they identify the practice patterns of specialties, they can be used to identify primary care PAs and NPs. By assessing current practice rather than historical training, this approach has the potential to change how the primary care workforce is tracked.
Footnotes
Conflicts of interest: authors report none.
To read or post commentaries in response to this article, see it online at https://www.AnnFamMed.org/content/18/4/334.
Prior presentation: 2017 North American Primary Care Research Group Annual Meeting; November 17-21, 2017; Montreal, Canada.
Supplemental materials: available at https://www.AnnFamMed.org/content/18/4/334/suppl/DC1/.
- Received for publication February 22, 2019.
- Revision received November 27, 2019.
- Accepted for publication January 6, 2020.
- © 2020 Annals of Family Medicine, Inc.