Commentary
When should we remain blind and when should our eyes remain open in diagnostic studies?

https://doi.org/10.1016/S0895-4356(02)00408-0Get rights and content

Introduction

The diagnostic work-up, in practice, starts with a patient presenting with a symptom or sign suspected of a particular disease. The work-up is commonly a phased, hierarchical process starting with a (disease specific) patient history and physical examination, followed by more invasive and costly tests in various order. It amounts to the estimation of the probability of disease presence using all obtained information or test results [1]. Each piece of information, including the answer on a simple question from patient history and the presence or absence of a sign at physical examination, can be considered as a different diagnostic test result 2, 3, 4. The motive for scientific diagnostic research is commonly to increase the efficiency of the work-up, in practice, i.e., to decrease patient burden or measurement costs with the same or improved diagnostic accuracy 5, 6, 7, 8. To quantify whether the test under study truly contributes to the diagnostic work-up, the results of that test are commonly compared to the definitive diagnosis as determined by a reference method. Such reference may either include a single test, a combination of specific tests (or a specific set of criteria), or a consensus diagnosis in which the definitive diagnosis is judged by an outcome panel. The reference method is commonly more burdening (invasive), time consuming, and expensive. Diagnostic accuracy refers to the number of patients that are correctly diagnosed by the test under study compared to the reference.

In diagnostic research, i.e., scientific studies aiming to quantify the accuracy or cost-effectiveness of diagnostic tests, blinding is commonly recommended to enhance validity 2, 3, 9, 10, 11, 12, 13. There are, however, various levels at which blinding is possible, depending on the research question of the study, as schematically illustrated in Fig. 1, Fig. 2. It has already been doubted whether blinding in diagnostic research (i.e., blinding at point 1, Fig. 1, Fig. 2) yields meaningful results in circumstances where other, previously documented diagnostic information play a role in the interpretation of the test(s) under study 10, 11. Nevertheless, the “Patients and Methods” section of reports on diagnostic studies, in particular studies on imaging tests whose results require subjective interpretation, commonly include statements such as “all test results were interpreted without knowledge of any other diagnostic information,” or “the observer of each test was blinded for all other information.” Such statements suggest that blinding has been executed at point 1, 2, and (if applicable) point 3 of Fig. 1, Fig. 2. We wondered whether “all this blinding” should always be pursued.

We believe it is timely to briefly reconsider the concept of blinding in scientific diagnostic studies. We give a brief overview of the different types of blinding that may be applied in diagnostic studies depending on the study question. Per type, we discuss whether blinding is desired, considering the consequences for the validity, interpretation and clinical utility of the study results.

Scientific diagnostic studies commonly investigate whether the commonly more burdening, time consuming, and costly reference test(s) can be withdrawn from the diagnostic work-up (making the work-up more efficient) without unacceptable loss in diagnostic accuracy (without increasing false positive or false negative diagnoses). Given this, it is obvious that the final diagnosis must be made independent of the results of the test(s) under study, i.e., blinding at point 2 in Fig. 1, Fig. 2 2, 3, 9, 10, 11, 12, 13. Commonly, the investigator(s) who assesses the final diagnosis is blinded for all preceding test results, including patient history and physical examination. If this blinding is not guaranteed, the information provided by the preceding test(s) under study may partly be used (“incorporated”) in the assessment of the final diagnosis by the reference method 2, 3, 9, 11, 12, 13. Consequently, the two information sources cannot be distinguished, and the accuracy of the test(s) under study will be biased. Theoretically, this bias can lead to an under- or overestimation of the accuracy of the test under study. However, it often results in an overestimation because the results of the test under evaluation and the results of the reference become more alike (matched), incorrectly decreasing the number of false positive and false negative results [12]. This kind of bias, also referred to as “incorporation bias” 2, 9, particularly applies when the interpretation of the reference is subject to (intra and inter) observer variation, which is common, for example, for imaging tests. It should be noted that for the same reason, in studies in which the reference might be executed before the test under study is performed, the interpretation of the latter should be interpreted without knowledge of the final diagnosis.

The possibility to blind those who assess the final diagnosis for the results of the test(s) under study depends on the type of reference. It is feasible if the reference consists of a specific test or combination of specific tests. Problems, however, arise for diagnostic studies of diseases that lack these types of reference; diseases in which the final diagnosis is made by consensus on all available patient information, such as congestive heart failure 12, 14. Such studies commonly apply an outcome panel that judges all patient information, as obtained from the medical history, physical examination, additional tests, and follow-up, to set a final diagnosis in each patient. Follow-up information is often added to better judge whether the disease at issue was present at the moment of (initial) presentation of the patient. Obviously, to prevent incorporation bias in studies using a consensus diagnosis as reference, the outcome panel should make the final diagnosis without the results of the test(s) under study. But, withholding the results of the test(s) under study may lead to misclassification of the final diagnosis with varying consequences (over- or underestimation of the test's accuracy) that can hardly be judged afterwards. There are no general solutions to this dilemma, which is inherent to consensus diagnosis as a reference. The pros and cons of excluding or including the results of the test under study in the assessment of the final diagnosis that is made by consensus should be weighed in each particular study. We have two general suggestions, illustrated by a hypothetical study that aims to quantify which tests from patient history and physical examination have diagnostic value and whether echocardiography has added value in patients suspected of heart failure. An outcome panel (consensus diagnosis) determines the “true” presence or absence of heart failure.

First, when studying the accuracy of a test of which is known beforehand that it would receive much weight in the consensus judgement, as may apply to the echocardiography in the heart failure example, it seems preferable not to use such test in the assessment of the final diagnosis. If such test would be used in the consensus diagnosis, its (added) value would be highly overestimated. On the other hand, if it is likely that the test under study provides only a “piece of information” to an array of more or less equally contributing tests, as may apply to the different tests from patient history or physical examination in the heart failure example, it may be better to include such test in the consensus judgement. The bias of the test's estimated accuracy can then at least be discussed afterwards: it would be overestimated due to the incorporation bias, but the overestimation would be relatively small. Excluding such test from the consensus judgement, the effect of the misclassification in final diagnosis on the test's accuracy is more difficult to judge afterwards.

A second suggestion for studies that use a consensus diagnosis as reference, is to let the outcome panel judge the final diagnosis (e.g., presence or absence of heart failure) without and with the test under study (e.g., echocardiography). Analyzing the data for both outcomes, i.e., a kind of sensitivity analysis, provides insight in the effect of the incorporation bias on the estimated accuracy of the test under study.

As the diagnostic work-up in practice is commonly a phased and hierarchical process, it is generally acknowledged that the true clinical relevance of a test is determined by its added or incremental value 1, 2, 3, 4, 5, 15, 16, 17, 18, 19, 20, 21, 22. This is the test's accuracy given the presence of test results that are documented anyway before the test under study would be applied, such as results from patient history and physical examination. To quantify the incremental value, investigators use some kind of multivariable analyses, for example, logistic regression analyses or Bayes' theorem, to first estimate the (overall) accuracy of the preceding tests, and subsequently add the test under study to estimate whether it improves the diagnostic accuracy. In studies aiming to quantify a test's added value, researchers commonly blind the observer of the test under evaluation for these preceding test results, i.e., blinding at point 1, Fig. 1, Fig. 2. This can be inferred from statements as “all test results were interpreted without knowledge of any other diagnostic information” or “the observer of each test was blinded for all other information.” Accordingly, the test under study is interpreted in isolation to obtain a so-called “clean test result” [10]. This is probably done for similar reasons as outlined in the former section, i.e., to prevent the incorporation of preceding test results in the interpretation of the test under study. We wonder, however, whether a scientific diagnostic study yields the true clinical relevance of a test if in the study the test was interpreted in isolation, i.e., blinded for other test results as, for example, obtained from patient history and physical examination.

Interpretation of a test in isolation is often in contrast with clinical practice. Consider a study to evaluate the accuracy of spiral CT scanning to diagnose pulmonary embolism (as determined by pulmonary angiography) in patients referred to the hospital because of suspicion of pulmonary embolism. In practice, the spiral CT-scan will never be applied, and interpreted in isolation. Its results will always be interpreted in view of the patient history and physical examination, and of other preceding test results, for example, obtained from electrocardiography, chest X-ray, and ventilation perfusion lung scanning. To aid practice, we therefore believe that also in a research setting the observer of the spiral CT-scan should not be blinded for the preceding information. Begg already argued in 1987 that although such blinding may seem “more clean” (which is probably the reason why it is so often applied), it is unrealistic [10]. Moreover, it is very likely though not per se, that a diagnostic test will be better interpreted when previous test results (e.g., patient history and physical examination) are taken into account 23, 24. Hence, blinding the observer of the test under study for preceding test results that in practice are obtained anyway, would underestimate the true added value of the test that would be experienced in practice. Finally, as suggested in the previous section, also in studies aiming to quantify the added value of a test, researchers could execute a kind of sensitivity analysis to gain insight in the extent to which preceding results alter the interpretation and therefore the added value of the test under study. To facilitate this, researchers should interpret the test under study without and with the knowledge of preexisting information.

To summarize, when the aim of a diagnostic study is to quantify the added value of a particular test and in routine practice preexisting results are commonly used in the interpretation of that test, researchers should not blind the observer of the test for the preexisting results. Such blinding is not conform practice, and most likely may underestimate the true clinical value of the test under study. The more preceding test results may influence the interpretation of a subsequent test, the more blinding for them seems contraindicated. If, however, the question is whether a particular test can replace preexisting tests (e.g., the patient history or physical examination), blinding the observer of the test under study for preexisting results becomes necessary. This will be discussed in the next section.

Diagnostic studies often include the comparison of two tests, for example, a new test (test 1, Fig. 2) vs. an existing test (test 2, Fig. 2), aiming to quantify their difference in accuracy as compared to a reference. Or in case of equal diagnostic accuracy, to quantify their difference in patient burden or measurement costs. In such studies the alternative value between the two tests (tests 1 and 2, Fig. 2) is at issue, aimed at replacing one by the other to increase the efficiency of the diagnostic work-up. Consider a study in patients suspected of pulmonary embolism that all undergo spiral CT scanning (test 1, Fig. 2), ventilation-perfusion lung scanning (test 2, Fig. 2) and pulmonary angiography (reference Fig. 2). A question could be whether the spiral CT scan has a higher accuracy compared to the angiographic result than the ventilation-perfusion scan, such that the latter could become redundant in the diagnostic work-up. In studies aiming to quantify the difference in accuracy between two tests to replace one by the other, the observer of each test must be blinded for the results of the other test to enhance a valid interpretation of both tests, i.e., blinding at point 3, Fig. 2. Otherwise, the information of test 1 may partly be used in the judgement of test 2 and vice versa, again increasing the likelihood of incorporation bias.

If in clinical practice both tests (tests 1 and 2, Fig. 2) will always be applied beyond preceding tests such as patient history and physical examination, we suggest not to blind the observers of tests 1 and 2 for these preceding tests, for reasons outlined in the previous section. This means no blinding at point 1, Figure 2. The study aim of Figure 2 should then be to quantify the difference in added value between tests 1 and 2.

Finally, a study may evaluate whether a particular (e.g., new) test can replace the physical examination in terms of improved diagnostic accuracy or less patient burden and measurement costs. For example, in patients with low back pain who are suspected of having a lumbar disc herniation one may evaluate whether the MRI is more cost-effective, and therefore, could replace physical examination. If this question is to be answered, the physical examination will not be considered as a previous test anymore but, in fact, as one of the two tests under study (i.e., test 1 or 2, Fig. 2). Accordingly, the observer of the other test (MRI) should be blinded for the physical examination and vice versa to prevent incorporation bias.

Section snippets

Concluding remarks

When studying the added or alternative value of a particular test, blinding the observer of that test for other test results would become a nonissue if the result of the test under study can be fully objectivated and would not be subject to observer variation at all. This may be more or less achievable for some test results, for example, the number of millimeters ST segment depression on exercise testing in the diagnosis of coronary artery disease using digital measurement (instead of the

Acknowledgments

We gratefully acknowledge the reviewers for their comments, which made a great contribution to the merits of this article, and the Netherlands Organization for Scientific Research (NWO) for their support (NR: gov-66-112).

First page preview

First page preview
Click to open first page preview

References (24)

  • D.E. Grobbee et al.

    Clinical epidemiologyintroduction to the discipline

    Neth J Med

    (1995)
  • R. Mackenzie et al.

    Measuring the effects of imagingan evaluative framework

    Clin Radiol

    (1995)
  • O.S. Miettinen et al.

    Evaluation of diagnostic imaging testsdiagnostic probability estimation

    J Clin Epidemiol

    (1998)
  • K.G.M. Moons et al.

    Limitations of sensitivity, specificity, likelihood ratio and Bayes' theorem in assessing diagnostic probabilitiesa clinical example

    Epidemiology

    (1997)
  • D.L. Sackett et al.

    Clinical epidemiology; a basic science for clinical medicine

    (1985)
  • A.R. Feinstein

    Clinical epidemiologythe architecture of clinical research

    (1985)
  • K.G.M. Moons et al.

    Redundancy of single diagnostic test evaluation

    Epidemiology

    (1999)
  • M.G. Hunink

    Outcomes research and cost-effectiveness analysis in radiology

    Eur Radiol

    (1996)
  • L. Dalla-Palma et al.

    An overview of cost-effective radiology

    Eur Radiol

    (1997)
  • D.F. Ransohoff et al.

    Problems of spectrum and bias in evaluating the efficacy of diagnostic tests

    N Engl J Med

    (1978)
  • C.B. Begg

    Biases in the assessment of diagnostic tests

    Stat Med

    (1987)
  • C.B. Begg et al.

    Assessment of radiologic tests; control of bias and other design considerations

    Radiology

    (1988)
  • Cited by (69)

    View all citing articles on Scopus
    View full text