Studies in which data from multiple patients arecollected per clinician or per practice are becoming common in primary care research, particularly with the increase of studies conducted in practice-based research networks. These studies generate data that are clustered. A special case of clustered data is an intervention study where clinicians or practices are randomized into an intervention or control group. In such cluster-randomized designs, all patients of a clinician or practice are assigned to the same treatment, and this design is often used when logistics of implementation or the need to avoid contamination of treatment arms is a priority.
A major issue in the analysis of clustered data is that observations within a cluster are not independent, and the degree of similarity is typically measured by the intracluster correlation coefficient (ICC).1 Ignoring the intracluster correlation in the analysis could lead to incorrect P values, confidence intervals that are too small, and biased estimates and effect sizes, all of which can lead to incorrect interpretation of associations between variables.2 Failure to take into account the clustered structure of the study design during the planning phase of the study also can lead to underpowered study designs in which the effective sample size and statistical power to detect differences are smaller than planned.
In most situations, the numeric value of the intra-cluster correlation tends to be small and positive. Several authors have provided guidelines for interpreting the magnitude of the intraclass correlation3 with small, medium, and large values of the intraclass correlation coefficients reported as .05, .10, and .15. Small values of the intracluster correlation can be deceiving, however. Investigators need to be aware that the cluster effect is a combination of both the intracluster correlation and the cluster size. Small intracluster correlations coupled with large cluster size can still affect the validity of conventional statistical analyses.
Although clustered data are common, investigators often overlook both the special analysis challenges and the unique opportunities inherent with clustered data.4,5 In this issue of the Annals, Reed suggests a convenient correction procedure to address clustered data.6 The correction involves applying a formula to the standard errors and then conducting the planned analysis with the corrected standard errors. Also in this issue, the article by Killip et al7 provides a formula to compute an effective sample size for clustered data. Computation of the effective sample size is important, as it avoids costly sample size errors caused by underpowered studies. Examples in the Killip et al article show how the intracluster correlation, number of observations within a cluster, and number of clusters are all interrelated in estimating sample size and power for clustered data.
Clustered data imply a hierarchical nature to the data, and while many levels can be considered, two levels are most commonly specified. The outcome measure is always assessed at the lowest level. Explanatory variables, however, may be considered at any of the levels (eg, patient variables and/or physician or practice level variables). Consequently, clustered data provide considerable opportunities to explore, in greater depth, the interrelationships among variables at any level; these analyses are generically called multilevel analyses.
Considering an example of data with patients clustered with physicians, a comprehensive multilevel data analysis aims to assess the direct effect of patient and clinician/practice level variables on the outcome. One could also determine whether the variables at the clinician/practice level serve as moderators of patient level relationships by testing cross-level interactions between variables from the patient level and the physician level.8 Hence, multilevel analyses are designed to analyze variables from different levels simultaneously, all the while taking into account the intracluster correlation.
Statistical software to conduct these types of analyses and for computing sample size for clustered data now exist, and we encourage their wider use.9–,11 While the two articles featured in this issue help raise awareness of the challenges and some solutions to analyzing clustered data, the skills required for optimal analysis of clustered data often are beyond those of most clinician-investigators. Studies involving clustered data would greatly benefit from the expertise provided by statisticians versed in the analysis of clustered data. Several recent textbooks3,9,12–,14 and Web sites15–,17 provide good introduction to the area with realistic health care examples. Finally, the recent CONSORT statement delineating guidelines for reporting of randomized controlled trials has now been extended to the special case of cluster-randomized trials.18
Footnotes
-
Conflicts of interest: none reported
- Received for publication March 26, 2004.
- Accepted for publication April 9, 2004.
- © 2004 Annals of Family Medicine, Inc.