Abstract
BACKGROUND Clinicians are faced with a plethora of guidelines. To rate guidelines, they can select from a number of evaluation tools, most of which are long and difficult to apply. The goal of this project was to develop a simple, easy-to-use checklist for clinicians to use to identify trustworthy, relevant, and useful practice guidelines, the Guideline Trustworthiness, Relevance, and Utility Scoring Tool (G-TRUST).
METHODS A modified Delphi process was used to obtain consensus of experts and guideline developers regarding a checklist of items and their relative impact on guideline quality. We conducted 4 rounds of sampling to refine wording, add and subtract items, and develop a scoring system. Multiple attribute utility analysis was used to develop a weighted utility score for each item to determine scoring.
RESULTS Twenty-two experts in evidence-based medicine, 17 developers of high-quality guidelines, and 1 consumer representative participated. In rounds 1 and 2, items were rewritten or dropped, and 2 items were added. In round 3, weighted scores were calculated from rankings and relative weights assigned by the expert panel. In the last round, more than 75% of experts indicated 3 of the 8 checklist items to be major indicators of guideline usefulness and, using the AGREE tool as a reference standard, a scoring system was developed to identify guidelines as useful, may not be useful, and not useful.
CONCLUSION The 8-item G-TRUST is potentially helpful as a tool for clinicians to identify useful guidelines. Further research will focus on its reliability when used by clinicians.
INTRODUCTION
Clinicians expect clinical practice guidelines to have 3 characteristics. Good guidelines should (1) be trustworthy, in that the recommendations are based on the best available evidence; (2) be relevant, meaning the recommendations are pertinent to one’s practice population and focus on affecting outcomes of importance to patients; and, (3) have a high degree of utility, in that the recommendations are clear and actionable.
Numerous researchers have documented issues with the guideline development process. Guidelines vary in their relevance to specific clinical practice,1–5 their use of evidence,6–14 and the role of other factors on the process of drafting recommendations.15–32
Tools are available to evaluate the quality of clinical practice guidelines.33–38 These tools, however, are designed in part to guide guideline development and are difficult to use by nonresearchers without extensive training. None of these tools considers the need for a focus on patient-oriented outcomes, and none allows users to conclude whether a guideline should be followed. The aim of this study was to develop the Guideline Trustworthiness, Relevance, and Utility Scoring Tool (G-TRUST) for clinicians to easily identify useful clinical practice guidelines.
METHODS
The study design used a Delphi approach39 to obtain expert consensus on items for inclusion, to hone the wording of the items, and to develop a ranking system. The Delphi approach is designed to gather the wisdom of the group without succumbing to issues of group process, such as social pressure (groupthink40), personality influence,15 and individual dominance16,41 It can be applied to generate consensus within groups of individuals who hold different views. Its main attributes include anonymity of participants, structured information flow to participants, and regular feedback to the group on the progress of the decision making.
Because we have already developed and piloted a set of items to be used to evaluate the validity of guidelines, we used a modified Delphi technique, which can be used when basic information is already available.42 The source of checklist items is outlined in Supplemental Appendix 1 at http://www.annfammed.org/content/15/5/413/suppl/DC1.
Selection of Experts
We selected a representative group of volunteer experts from 2 populations: producers of practice guidelines known to be of high quality in several clinical areas;7,13,14,43–46 and self-identified and recognized experts in evidence-based medicine. Physicians in family medicine and primary care internal medicine made up some of both groups. The sources of these experts are further described in the supplementary material (Supplemental Appendix 2, http://www.annfammed.org/content/15/5/413/suppl/DC1).
Initial Items on the Tool
The initial 8 items of the instrument were derived from several sources, including the National Academy of Medicine’s (formerly Institute of Medicine) “Clinical Practice Guidelines We Can Trust,”47 the AGREE II instrument,48,49 and the research of ours50 and others on guideline validity.34 The items hew most closely to the National Academy of Medicine standards47 and are critical for evaluating and recognizing flaws in the evidence development process, the relevance of recommendations to clinical practice, and the threats to the judgment process of creating recommendations from the evidence.
Delphi Process
The modified Delphi process consisted of 4 rounds and was conducted using an online survey instrument (http://www.surveymonkey.com). Participants were not told the number or identity of other participants. At each stage we analyzed results from each subgroup (evidence-based medicine and guideline experts) separately to identify any discrepancies in opinion. The complete process is outlined in Figure 1 and explained in Supplemental Appendix 3, http://www.annfammed.org/content/15/5/413/suppl/DC1.
Study flow chart.
The goal of the first round was to develop the wording of the items and identify additional items to be added to the tool. For the second round, participants were asked whether the revised items were “required to identify guidelines that present both relevant and trustworthy recommendations.” For the third round, participants were given aggregate responses from the second round and asked to rank, weight, and order the items. Based on these rankings and weights, we used multiple attribute utility analysis51,52 to obtain utility scores for each item on a scale from 0 to 100. During the fourth round, participants determined whether each item was a major or minor threat to the usefulness of a practice guideline.
Scoring System
To determine concurrent validity and to develop a scoring system, the final items were used to assess the quality of 26 (74.3%) low-quality and 9 high-quality guidelines previously assessed by others using the AGREE instrument.13,14 Two authors (L.C. and A.F.S.), independently and masked to the AGREE quality scores, assessed each guideline using 7 of the 8 G-TRUST items, excluding the item evaluating the clinical relevance of the recommendations because this criterion is not considered in AGREE or AGREE II and will vary based on user. For each item they determined whether the criterion was met, was not met, or could not be determined from the guideline description. Results from each investigator were compared and discrepancies resolved via discussion.
Analysis
For each round, except for utility calculations after round 3, we calculated average responses. For round 4, we used either the Fisher exact test or the χ2 test with Yates correction to determine whether designation of major and minor flaws was different between evidence-based medicine experts and guideline writers.
RESULTS
Expert Consensus Panel
The group (Table 1) comprised 40 members representing expertise in evidence-based medicine (n = 22) and in guideline development (n = 17); 1 consumer representative had expertise in risk communication and health policy. All panel members participated in rounds 1 and 2, 95% of members (n = 38) participated in the round 3, and 85% (n = 34) participated in round 4.
Demographic Composition of the Expert Panel
Item Selection and Refinement
Responses from the first Delphi round resulted in changed wording and explanations of several items. A general statement (and not a specific item) was added that the guideline should have been written or updated within the past 5 years, which is similar to the requirement for inclusion in the National Guideline Clearinghouse.53
In the second round, 6 of the 8 items were deemed to be critical to assess the relevance and validity of practice guideline recommendations. The item, “The guidelines are the official stance or policy of a professional society,” was deemed to be critical by only 7.5% and was dropped from the instrument. Based on written comments, wording for 1 item was slightly changed, and 1 compound item was split into 2 items.
For the third round, Table 2 contains median utility scores calculated from respondents’ ranking and weighting. Utility varied widely, and items evaluating evidence quality (systematic review, evidence grading) had the highest utility, followed by items evaluating relevance.
Final Item Wording With Utility Scores and Ratings
The last round produced a cutoff between major and minor threats that corresponded to a utility of greater than 16 for the weighted scores (Table 2). Two evidence validity items were considered major items: 1 pertained to identifying systematic review (100%), and the other related to the use of graded evidence (85.3%). One relevance item, “recommendations focus on improving patient-oriented outcomes, explicitly comparing benefits versus harms to support clinical decision making,” was also considered to be a major threat to the usefulness of guidelines by most participants (82.4%). The rest of the items were considered major threats by fewer than one-half the participants. There was no difference in these designations between evidence-based medicine experts and guideline developers. Results from each Delphi round are outlined in detail in Supplemental Appendix 4, http://www.annfammed.org/content/15/5/413/suppl/DC1.
G-TRUST Scoring System
Using AGREE scores as our reference standard, we evaluated various combinations of item responses from the consensus panel to develop a scale aimed at fewer lower quality guidelines identified as trustworthy:
Useful: no major items answered “can’t tell” or “no”, and 0–1 minor items answered “no”.
May not be useful: no major items answered “can’t tell” or “no”, but 2 minor items answered “no”.
Not useful: any major item answered “can’t tell” or “no”, or more than 2 minor items answered “no”.
Applying these cutoffs, 3 of 26 (11%) low-quality guidelines were identified as being useful. These guidelines were downgraded by the AGREE instrument because of scores of 0 for the domain “editorial independence,” comprising editorial independence from the funding body and recording of conflicts of interest. A lack of clarity of the guideline’s description of conflicts of interest resulted in a score of “cannot be determined” using the G-TRUST instrument. After the analysis, we added the following wording (in italics) to this item to add clarity: “The Chair of the guideline development committee and a majority of the rest of the committee are free of declared financial conflicts of interest, and the guideline development group did not receive industry funding for developing the guideline.”
The G-TRUST instrument identified 5 guidelines (55%) as either may not be useful or not useful of the 9 determined to be of high quality by AGREE. Using AGREE, 1 guideline received a high score (81%) for rigor even though it was not based on a systematic review of the literature. All the other guidelines were graded as may not be useful because they did not include members from most of the relevant specialties or were not substantially free of conflicts of interest.
DISCUSSION
Through expert consensus we developed an 8-item checklist designed to help clinicians quickly identify useful guidelines to follow in practice. Using AGREE as our reference standard, our checklist identified almost all (92%) of the low-quality guidelines and disqualified many high-quality guidelines because of a stricter definition of trustworthiness. The items in the G-TRUST (tool available from the authors) address issues and concerns voiced by the National Academy of Medicine report47 and the AGREE II instrument33 and add additional issues of relevance not considered by either.54,55 G-TRUST is more stringent than AGREE II in that it stipulates an independent (ie, nonconflicted) research analyst or methodologist be part of the process, based on recent research findings that including independent methodological experts may better ensure evidence-based and conservative recommendations.56,57 The tool is also more stringent than AGREE II in its handling of conflicts of interest (barring them rather than simply addressing them) and in broad representation on the guideline development group. Using the stricter requirement for conflicts of interest reflected in the G-TRUST led to many guidelines being rated as may not be useful that would be rated as of high quality by AGREE.
A major advantage of the G-TRUST is that it gives different importance to individual items (eg, major, minor) and arrives at a determination of overall guideline quality (useful, may not be useful, not useful).
In the development of the scoring system, we produced a conservative cutoff score that, while preventing false positives (eg, falsely identifying guidelines as high quality), will exclude some high-quality guidelines. Given the large number of guidelines, this emphasis is needed to ensure that fewer low-quality guidelines will be incorrectly identified as useful.
A second limitation is that it may be difficult for users to determine conflicts of interest and the presence of a research analyst on the guideline development group. Despite extensive searching, we could not determine the answer to these items for almost one-half the studied guidelines. In a previous study we found that more than one-half (57%) of the guidelines for the treatment of major depressive disorder did not include a conflicts of interest policy or disclosure statement.8
The evidence supporting clinical practice guideline development is very preliminary; the Institute of Medicine report, upon which we based our initial development, is widely seen as the best we have. Still, much of the evidence supporting what constitutes a reliable and valid practice guideline is expert opinion.
Further research should determine the reliability of G-TRUST by comparing scores obtained by single users. Also, the use of technology, such as smartphone applications that integrate with the National Guideline Clearinghouse, could be explored to determine the usefulness of the tool. In addition, neither the G-TRUST nor any of the other guideline evaluation tools evaluate whether guidelines provide enough information to support shared decision making.
Considering the proliferation of guidelines in all areas of medicine and the well-documented concerns about their validity and trustworthiness, clinicians need an easy-to-use screening tool to enhance evidence-based care. The 8-item G-TRUST instrument is a potentially helpful tool for clinicians to identify clinical practice guidelines that are trustworthy in their development, reliable in their application to patient care, and have high utility in clinical practice.
Acknowledgement
Madeline Brodt, MS, administered the surveys and collated and analyzing the results, and the guideline developers and the evidence-based medicine experts volunteered their time for this project.
Footnotes
Conflicts of interest: A.F.S., J.L., and L.C. have published research evaluating clinical practice guidelines and a commentary advocating for more rigorous oversight of guideline development and dissemination.
Funding support: The research included in this article was supported by an RO3 grant funded by the Association for Healthcare Research and Quality (Grant No. R03HS022940-01A1).
Previous presentation: Guideline International Network Scientific Programme; August 20, 2015; Amsterdam, the Netherlands.
Supplementary materials: Available at http://www.AnnFamMed.org/content/15/5/413/suppl/DC1/.
- Received for publication October 24, 2016.
- Revision received February 6, 2017.
- Accepted for publication March 16, 2017.
- © 2017 Annals of Family Medicine, Inc.