Skip to main content
Log in

The ongoing tyranny of statistical significance testing in biomedical research

  • METHODS
  • Published:
European Journal of Epidemiology Aims and scope Submit manuscript

Abstract

Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the α-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Boring EG. Mathematical vs. scientific significance. Psychol Bull. 1919;15(10):335–8.

    Article  Google Scholar 

  2. Hogben LT. Statistical theory: an examination of the contemporary crisis in statistical theory from a behaviourist viewpoint. London: George Allen & Unwin; 1957.

    Google Scholar 

  3. Morrison DE, Henkel RE. The significance test controversy: a reader. Chicago: Aldine Pub; 1970.

    Google Scholar 

  4. Cohen J. The earth is round (p < .05). Am Psychol. 1994;49(12):997–1003.

    Article  Google Scholar 

  5. Greenland S, Rothman KJ. Fundamentals of epidemiologic data analysis. In: Rothman KJ, Greenland S, Lash TL, editors. Modern epidemiology. 3rd ed. Philadelphia: Wolters Kluwer, Lippincott Williams & Wilkins; 2008. p. 213–37.

    Google Scholar 

  6. Blume J, Peipert JF. What your statistician never told you about P-values. J Am Assoc Gynecol Laparosc. 2003;10(4):439–44.

    Article  PubMed  Google Scholar 

  7. Miettinen OS. Theoretical epidemiology. Albany: Delmar Publishers Inc.; 1985.

    Google Scholar 

  8. Lang JM, Rothman KJ, Cann CI. That confounded P-value. Epidemiology. 1998;9(1):7–8.

    CAS  PubMed  Google Scholar 

  9. Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45(3):135–40.

    Article  PubMed  Google Scholar 

  10. Hubbard R, Lindsay RM. Why p-values are not a useful measure of evidence in statistical significance testing. Theory Psychol. 2008;18(1):69–88.

    Article  Google Scholar 

  11. Gigerenzer G. Mindless statistics. J Socio-Econ. 2004;33:587–606.

    Article  Google Scholar 

  12. Fisher RA. Statistical methods and scientific inference. Edingburgh: Oliver & Boyd; 1956.

    Google Scholar 

  13. Sterne JA, Davey SG. Sifting the evidence-what’s wrong with significance tests? BMJ. 2001;322(7280):226–31.

    Article  CAS  PubMed  Google Scholar 

  14. Poole C, Peters U, Il’yasova D, Arab L. Commentary: this study failed? Int J Epidemiol. 2003;32(4):534–5.

    Article  PubMed  Google Scholar 

  15. Neyman J, Pearson ES. On the use and interpretation of certain test criteria for purposes of statistical inference. Part I. Biometrika. 1928;20A:175–240.

    Google Scholar 

  16. Rabe KF. Treating COPD—the TORCH trial, P values, and the Dodo. N Engl J Med. 2007;356(8):851–4.

    Article  CAS  PubMed  Google Scholar 

  17. Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ. 1995;311(7003):485.

    CAS  PubMed  Google Scholar 

  18. Sobin LH, Wittekind Ch. TNM classification of malignant tumours. 6th ed. New York: Wiley-Liss, Inc.; 2002.

    Google Scholar 

  19. White VA, Chambers JD, Courtright PD, Chang WY, Horsman DE. Correlation of cytogenetic abnormalities with the outcome of patients with uveal melanoma. Cancer. 1998;83(2):354–9.

    Article  CAS  PubMed  Google Scholar 

  20. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121(3):200–6.

    CAS  PubMed  Google Scholar 

  21. Stampfer MJ, Kang JH, Chen J, Cherry R, Grodstein F. Effects of moderate alcohol consumption on cognitive function in women. N Engl J Med. 2005;352(3):245–53.

    Article  CAS  PubMed  Google Scholar 

  22. Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, et al. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the women’s health initiative randomized controlled trial. JAMA. 2002;288(3):321–33.

    Article  CAS  PubMed  Google Scholar 

  23. Fisher RA. The design of experiments. Edinburgh: Oliver & Boyd; 1935.

    Google Scholar 

  24. Poole C. Low P-values or narrow confidence intervals: which are more durable? Epidemiology. 2001;12(3):291–4.

    Article  CAS  PubMed  Google Scholar 

  25. Rothman KJ. A show of confidence. N Engl J Med. 1978;299(24):1362–3.

    Article  CAS  PubMed  Google Scholar 

  26. Pocock SJ, Ware JH. Translating statistical findings into plain English. Lancet. 2009;373(9679):1926–8.

    Article  PubMed  Google Scholar 

  27. Altman DG. A fair trial? Br Med J (Clin Res Ed). 1984;289(6441):336–7.

    Article  CAS  Google Scholar 

  28. Main KM, Kiviranta H, Virtanen HE, Sundqvist E, Tuomisto JT, Tuomisto J, et al. Flame retardants in placenta and breast milk and cryptorchidism in newborn boys. Environ Health Perspect. 2007;115(10):1519–26.

    CAS  PubMed  Google Scholar 

  29. Rothman KJ. Significance questing. Ann Intern Med. 1986;105(3):445–7.

    CAS  PubMed  Google Scholar 

  30. Wilkinson L. Task force on statistical inference. Statistical methods in psychology journals: guidelines and explanations. Am Psychol. 1999;54(8):594–604.

    Article  Google Scholar 

  31. Loftus GR. On the tyranny of hypothesis testing in the social sciences. Contemp Psychol. 1991;36(2):102–5.

    Google Scholar 

Download references

Disclosure

None of the authors reports any conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Stang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stang, A., Poole, C. & Kuss, O. The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 25, 225–230 (2010). https://doi.org/10.1007/s10654-010-9440-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10654-010-9440-x

Keywords

Navigation