Abstract
PURPOSE Mental health screening is recommended by the US Preventive Services Task Force for all patients in areas where treatment options are available. Still, it is estimated that only 4% of primary care patients are screened for depression. The goal of this study was to evaluate the efficacy of machine learning technology (Kintsugi Voice, v1, Kintsugi Mindful Wellness, Inc) to detect and analyze voice biomarkers consistent with moderate to severe depression, potentially allowing for greater compliance with this critical primary care public health need.
METHODS We performed a cross-sectional study from February 1, 2021 to July 31, 2022 to examine ≥25 seconds of free-form speech content from English-speaking samples captured from 14,898 unique adults in the United States and Canada. Participants were recruited via social media, provided informed consent, and their voice biomarker results were compared with a self-reported Patient Health Questionnaire-9 (PHQ-9) at a cut-off score of 10 (moderate to severe depression).
RESULTS From as few as 25 seconds of free-form speech, machine learning technology was able to detect vocal characteristics consistent with an increased PHQ-9 ≥10, with a sensitivity of 71.3 (95% CI, 69.0-73.5) and a specificity of 73.5 (95% CI, 71.5-75.5).
CONCLUSIONS Machine learning has potential utility in helping clinicians screen patients for moderate to severe depression. Further research is needed to measure the effectiveness of machine learning vocal detection and analysis technology in clinical deployment.
INTRODUCTION
Depression is a leading cause of disability, affecting an estimated 18 million Americans each year, with a lifetime prevalence of major depression approaching 30%.1-3 In 2016, the US Preventive Services Task Force recommended universal depression screening for adult patients when adequate follow-up is available.4,5 Still, depression screening rarely occurs in the outpatient setting, with some estimates placing screening rates at <4% of primary care encounters.6-9 Even when identified to undergo screening, patients with depression are included <50% of the time.6,10 Thus, there is a substantial opportunity and need to improve primary care screening for depression. Machine learning (ML) can help fill this care gap by augmenting clinical workflows without additional clerical burden, to increase the frequency of depression screening and accelerate patient triage.11-14
Individuals with an active depressive episode have distinct speech patterns such as more frequent stuttering and hesitations, longer and more frequent pauses, and slower speech cadence.15-22 Vocal signatures associated with a clinical diagnosis are defined as voice biomarkers.23 Using ML technology to evaluate these voice signatures represents a novel, noninvasive, quantitative, reproducible, and near seamless assessment that can be added to virtual encounters. We sought to assess whether ML can effectively detect vocal characteristics consistent with a moderate to severe acute depressive episode.
METHODS
Dataset
All data were obtained with approval from the Solutions IRB (https://www.solutionsirb.com) institutional review board. We performed this study in accordance with relevant regulations and guidance including the Declaration of Helsinki. The study population included adults aged ≥18 years in the United States and Canada recruited via social media (ie, Reddit, Craigslist, Facebook, and Instagram). Because young and/or female individuals were more likely to self-enroll in this study, additional advertisements on social media using images of men and older individuals were directed at male and senior populations to strive for a more evenly distributed study sample. From February 1, 2021 to July 31, 2022, 14,898 participants provided informed consent, completed the Patient Health Questionnaire-9 (PHQ-9), and recorded a voice response to the prompt, “How was your day?” for at least 25 seconds in English using their personal electronic device’s microphone from their remote location (Figure 1). Responses were captured using a secure online survey platform. Participants self-reported demographic information, which included age, gender, race and ethnicity, and country of residence, to assess sample representativeness and eligibility. We collected e-mail addresses to distribute compensation of $5 (USD or CAD) on study completion for eligible participants. Other identifying information, such as name or phone number, was not collected, for protection of participant privacy.
Participant Exclusion and Audio Preprocessing Criteria Used to Create Training and Validation Sets to Train and Tune the Model and Evaluate its Performance
Note: Eligible participants for inclusion in the analysis data sets were adults aged ≥18 years living in the United States or Canada who provided a voice sample in English containing at least 25 seconds of speech content meeting audio quality parameters. The training set and validation set were split to evenly distribute samples on the basis of participant characteristics and audio length.
Study Measurement
Participants were evaluated via completion of the PHQ-9 questionnaire. Scores for the 9 items range from 0 (“Not at all”) to 3 (“Nearly every day”), and total scores range from 0 to 27. A PHQ-9 score of ≥10 was the threshold for a moderate to severe acute depressive episode because it maximizes the sensitivity and specificity of the PHQ-9 instrument.8,9
Data Processing
Surveys were individually reviewed by study staff for completion, uniqueness, and authenticity. Incomplete, duplicate, or fraudulent surveys (eg, outside the United States or Canada) were excluded from analysis. The eligible audio recordings were captured as .wav files, and linear pulse code modulation, sampling rate, and voice activity were standardized to limit variations in quality introduced by differences in participants’ personal electronic device microphones. Preserving consistent audio quality was accomplished by converting files to 16-kHz linear pulse code modulation, which is the standard for speech processing and minimizes file degradation.24,25 Full details regarding data processing and model architecture and training (Kintsugi Voice, v1, Kintsugi Mindfull Wellness) are described in Supplemental Appendixes 1 and 2.
Model Evaluation
Predictions were normalized, scaled from 0 to 1. Values closer to 1 represented a greater confidence score for the model’s belief that the participant had vocal characteristics consistent with a moderate to severe acute depressive episode. We selected the following 3 predicted model outputs: (1) Signs of Depression Detected for individuals with sufficient vocal characteristics consistent with an active depressive episode; (2) Signs of Depression Not Detected for individuals who had insufficient vocal characteristics consistent with an active depressive episode; and (3) Further Evaluation Recommended, which captured individuals for whom the model did not have sufficient confidence to yield output and would defer to clinician judgment for a formal screening determination in practice.
Quantitatively, Signs of Depression Detected corresponded to model output values >0.5631 or = 1 and anticipated a PHQ-9 score of ≥10. Signs of Depression Not Detected corresponded to model output values = 0 and <0.4449. Values from 0.4449 to 0.5631 were labeled Further Evaluation Recommended. We set threshold values to minimize the presence of false-positive and false-negative samples. We evaluated overall model performance by comparing model outputs to self-reported PHQ-9 scores.
RESULTS
The number of participant files in the training and validation data sets were 10,442 and 4,456, respectively (Figure 1). The validation set had a speech content range of 25.0-74.9 seconds, (median = 57.9 seconds, mode = 58.5 seconds) and self-reported PHQ-9 score range of 0-27, (median = 9, mode = 0). Demographic characteristics are summarized in Table 1, model performance is shown in Table 2, and subpopulation model performance is shown in Table 3. The subpopulation participant demographics for misclassified samples are listed in Supplemental Table 1.
Participant Demographic Characteristics
Model Performance
Subpopulation Performance
The model provided an output of Signs of Depression Detected and Signs of Depression Not Detected for 3,536 of the validation samples. The performance of these predictions was as follows: overall sensitivity of the model was 71.3 (95% CI, 69.0-73.5), specificity was 73.5 (95% CI, 71.5-75.5), PPV was 69.3 (95% CI, 67.1-71.5), and NPV was 75.3 (95% CI, 73.3-77.2) (Table 2). The output Further Evaluation Recommended was returned for 20% of the overall validation set, or 920 samples.
Within subpopulations, model sensitivity was greatest for Hispanic or Latine (80.3; 95% CI, 72.6-86.6) and Black or African American (72.4; 95% CI, 64.0-79.8) populations, and model specificity was greatest for Asian or Pacific Islander (77.5; 95% CI, 72.8-81.8) and Black or African American (75.9; 95% CI, 69.3-81.7) populations, which all had wider CIs relative to the full sample and White subpopulation. Sensitivity and specificity for women and men were notably different; sensitivity and specificity for women were 74 (95% CI, 71.4-76.5) and 68.9 (95% CI, 66.2-71.4), and those for men were 59.3 (95% CI, 54.0-64.4) and 83.9 (95% CI, 80.8-86.7). The population aged <60 years had a sensitivity (71.9; 95% CI, 69.5-74.2) and specificity (71.8; 95% CI, 69.6-73.9) with narrower CIs than the population aged ≥60 years (63.4; 95% CI, 54.3-71.9) and (86.8; 95% CI, 81.6-91.0).
DISCUSSION
The present study showed the preliminary effectiveness of ML to detect vocal characteristics consistent with a moderate to severe acute depressive episode from audio clips of ≥25 seconds of free-form speech content. The ML system showed an overall sensitivity of 71.3, specificity of 73.5, PPV of 69.3, and NPV of 75.3 relative to a PHQ-9 with a cutoff score of 10. Many mental health diagnostic and screening inventories have a performance ranging from 60.0-90.0 for both sensitivity and specificity.8,28,29 Thus, the performance of the tool relative to the PHQ-9 suggests it might be effective for an ML device to assist in screening and identifying individuals with depression.6,13 Machine learning voice-based approaches used to detect other conditions, such as bulbar dysfunction in amyotrophic lateral sclerosis show similar performance criteria.30 An ML-based instrument for depression screening holds promise because it could increase the proportion of patients screened, without undue clinician clerical burden.
As with any medical device, it is important to consider false positives and false negatives. Minimizing false positives versus false negatives is a natural trade-off and can be adjusted via the model threshold depending on the demands and objectives of the clinical setting. We set our threshold to >0.5631 for this study. Given that the majority of persons meeting binary diagnostic criteria for a depressive episode have mild to moderate symptoms, and many do not need clinical interventions and are able to successfully manage a mild depressive episode without formal therapy or medication, increased sensitivity might be a worthwhile aim for future exploration and warrants clinician feedback via formal study.4,5,10,12,14,17,31-33 False-negative detection might result in patients experiencing an active depressive episode missing formal screening and gaining access to subsequent behavioral health treatment; however, because the tool is intended as an adjuvant screening tool, and the proportion of false negatives is in line with the performance of other screening tools, this risk should be minimal within the context of current practices.6-9 Additional acceptability studies tailored to specific environments will be required to quantify and qualify the implication of false-positive and false-negative screenings using ML technology to augment clinical workflow for depression screening.
Among the false negatives identified, there was a greater proportion of men (31.7%) relative to the population in the overall validation set (27.9%). There was an observable difference in the sensitivity measure for men (59.3; 95% CI, 54.0-64.4) relative to the overall population (71.3; 95% CI, 69.0-73.5). Whereas there is less precision regarding the estimate for men relative to the full population, owing to a smaller population of men in our sample, the sensitivity measure is still within the bounds of other precedent depression inventories.8,29 The study team made efforts to recruit additional men and seniors to participate in the study; however, there is documented resistance to participating in depression-related research among these groups, and depression in the general population is recorded to be greatest among women and individuals aged <25 years, which could have influenced participants’ motivations to volunteer.34,35 The lower representation of men (27.3%) in training data relative to women (69.5%) might have resulted in decreased exposure to the population’s characteristics for model learning, contributing to lower observed performance on validation data. The lower sensitivity suggests that the model might need to be better trained at identifying signs of depression in men, given that research has shown that artificial intelligence algorithms might falsely correlate a more masculine voice with decreased likelihood of depression, owing to the fact that depression is less prevalent among men.36
Segmenting by age, the <60 years population comprised a larger proportion of the data set and had narrower CIs than the ≥60 years group. We used the comprehensive sample age because we believe the results were clinically significant across the entire age range to warrant consideration of the voice biomarkers across the age spectrum. Many biomarkers (electrocardiogram morphology, blood pressure, lipid profile) are age dependent, and we suspect that this might be true of voice biomarkers. The etiologies of these age-related changes in voice biomarkers are difficult to speculate on and could include both age-related differences in voice as well as age-dependent neuromotor manifestations of depression. Further study of this phenomenon is warranted and could result in even more accurate screening via use of age-specific voice analytic biomarker tools. Similarly, further study and honing of the ML device to other patient characteristics that might allow for increased accuracy and value to clinicians is also warranted.
We note several strengths and limitations that need to be addressed in future ML model depression screening studies. First, as a substantial strength, the data set was diverse in socioeconomic population characteristics and consisted of a diverse regional representation across the United States and Canada, which captures a breadth of speech patterns and accents and is comparable in distribution to the racial makeup of the United States and Canada according to aggregate Census data.37,38 Because we did not collect information on comorbid conditions, future studies should expand the overall data set and capture the relevant medical history of conditions affecting vocal production to help further understand any effects on voice biomarkers.
To correct imbalances in the representativeness of the study sample, we used targeted ads with images of men and seniors during recruitment. Persistent sample bias might be due to recruiting via social media or because depressed individuals might be more likely to participate in depression-based research. The average PHQ-9 score of the sample was 9.8, and just over 45% of participants scored >10, which is increased relative to the 8.6% prevalence of major depressive episodes in the United States.39 Although sensitivity and specificity are generally stable predictors of test performance, PPV is increased and NPV is decreased based on the prevalence of disease in the sample and might have been affected in our results.40 Future study designs should use purposeful rather than convenient sampling frames to achieve representativeness. The increased prevalence in our training sample might allow the model to gain exposure to a broad spectrum of depression cases, however, which is important for generalizability given the nonuniform clinical presentation of depression.
Finally, the ML device was trained using the PHQ-9, which has reliably shown both a sensitivity and specificity of 88% in screening for depression; however, like the PHQ-9, the ML device is not intended as a substitute for a formal clinical interview and qualified clinician assessment, which remains the reference standard for confirming the presence of a depressive episode or clinical depression, nor is it meant to substitute for a comprehensive psychiatric evaluation for those that might be experiencing a mood disorder such as a major depressive disorder.8,9 The ML device is not intended as a standalone tool for screening or diagnosing depression, and we are presenting these data to show how the ML device might be used by qualified clinicians, particularly primary care physicians such as family medicine doctors, as an adjuvant tool to help in their monitoring and screening of their patients for depression.
The present study represents one of the first attempts to train and validate ML technology to evaluate clips of freeform speech to detect signs of a depressive episode. Findings from this study suggest that harnessing ML technology to evaluate speech for the detection of signs of a depressive episode is effective compared with the PHQ-9 at a cutoff score of 10. This study supports that the use of ML technology as a clinical decision-support tool might be a step toward universal depression screening, a primary care objective recommended by the US Preventive Services Task Force.4,5 Although this ML device technology is a breakthrough, and we believe it is important to communicate the performance of this technology at this juncture, we emphasize that this study represents an initial study validating how an ML device that analyzes a purely physiologic biometric (voice biomarkers), not dependent on patient or clinician interpretation and thus not subject to the inherent biases of natural language processing devices that interpret speech content, can be used to help validate and direct clinician action. We recognize that future studies are needed and that there is an expectation that this technology will continue to evolve and improve. Future studies will be directed toward determining the acceptability of augmenting primary care workflows with ML technology as a clinical decision-support tool and assessing the effect of other conditions that might influence depression voice biomarker analysis.
Acknowledgments
The authors extend their sincerest gratitude to Victoria Graham, Chase Walker, and Brandn Green for their invaluable contributions to this work.
Footnotes
Annals Early Access article
Conflicts of interest: A.M., P.T., and H.C. are currently employed by Kintsugi Mindful Wellness, Inc, have equity in the company, and have played a role in the investigational device’s development. Kintsugi Mindful Wellness received a grant from the National Science Foundation to conduct this research but had full control of the data and the decision to submit this manuscript to Annals of Family Medicine. M.P.W. and R.G.T. have no conflicts of interest to disclose.
Funding support: A.M., H.C., and P.T. were employed by Kintsugi Health (dba: Kintsugi Mindful Wellness, Inc) during the period this study was conducted. This work was supported by the National Science Foundation (grants #2036213 and #1938831).
Data statement: This manuscript complies with the policies and practices outlined for its respective program by the National Science Foundation SBIR funding, under which some data components are considered proprietary and may not be shared. The authors will respond to reasonable requests and comply as appropriate.
- Received for publication February 20, 2024.
- Revision received September 18, 2024.
- Accepted for publication September 19, 2024.
- © 2025 Annals of Family Medicine, Inc.