Skip to main content
Log in

Multi-level longitudinal learning curve regression models integrated with item difficulty metrics for deliberate practice of visual diagnosis: groundwork for adaptive learning

  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript

Abstract

Visual diagnosis of radiographs, histology and electrocardiograms lends itself to deliberate practice, facilitated by large online banks of cases. Which cases to supply to which learners in which order is still to be worked out, with there being considerable potential for adapting the learning. Advances in statistical modeling, based on an accumulating learning curve, offer methods for more effectively pairing learners with cases of known calibrations. Using demonstration radiograph and electrocardiogram datasets, the advantages of moving from traditional regression to multilevel methods for modeling growth in ability or performance are demonstrated, with a final step of integrating case-level item-response information based on diagnostic grouping. This produces more precise individual-level estimates that can eventually support learner adaptive case selection. The progressive increase in model sophistication is not simply statistical but rather brings the models into alignment with core learning principles including the importance of taking into account individual differences in baseline skill and learning rate as well as the differential interaction with cases of varying diagnosis and difficulty. The developed approach can thus give researchers and educators a better basis on which to anticipate learners’ pathways and individually adapt their future learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

reproduced from Fig. 1 both the individual no pooling regressions (blue lines) and complete pooling (red line) approaches so as to compare it with the right panel where the Random intercept model is shown, along with the fixed effect (red) line at the mean intercept. The random intercept model shows more consistent alignment with the central tendency in base knowledge

Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Anderson, S. J., Hecker, K. G., Krigolson, O. E., & Jamniczky, H. A. (2018). A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention—A Neuroeducation Study. Frontiers in Human Neuroscience, 12, 38.

    Article  Google Scholar 

  • Bates, D., Maechler, M., Bolker, B, & Walker, S. Fitting linear mixed-effects models using lme4. Journal of Statistical Software. 2000. https://arxiv.org/abs/1406.5823v1

  • Bates D, Maechler M, Bolker B, Walker S. (2015). lme4: Linear Mixed-Effects Models Using Eigen and S4. R package version 1.1–10, URL http://CRAN.R-project.org/package=lme4.

  • Bok, H. G., de Jong, L. H., O’Neill, T., Maxey, C., & Hecker, K. G. (2018). Validity evidence for programmatic assessment in competency-based education. Perspectives on medical education., 7(6), 362–372.

    Article  Google Scholar 

  • Bolsin, S., & Colson, M. (2000). The use of the Cusum Technique in the assessment of trainee competence in new procedures. International Journal for Quality in Health Care, 12(5), 433–438.

    Article  Google Scholar 

  • Boutis, K., Pecaric, M., Carrière, B., Stimec, J., Willan, A., Chan, J., & Pusic, M. (2019). The effect of testing and feedback on the forgetting curves for radiograph interpretation skills. Medical Teacher, 41(7), 756–764.

    Article  Google Scholar 

  • Bryk, A. S., & Raudenbush, S. W. (1987). Application of hierarchical linear models to assessing change. Psychological Bulletin, 101, 147–158.

    Article  Google Scholar 

  • Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science., 19(11), 1095–1102.

    Article  Google Scholar 

  • Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009). Optimizing distributed practice: Theoretical analysis and practical implications. Experimental Psychology., 56(4), 236–246.

    Article  Google Scholar 

  • Chaiklin, S. (2003). The zone of proximal development in Vygotsky’s analysis of learning and instruction. In A. Kozulin, B. Gindis, V. Ageyev, & S. Miller (Eds.), Vygotsky’s educational theory in cultural context (First (pp. 39–64). Cambridge, England: Cambridge University Press.

    Chapter  Google Scholar 

  • Davis, A. L., Pecaric, M., Pusic, M. V., Smith, T., Shouldice, M., Brown, J., & Boutis, K. (2020). Deliberate practice as an educational method for learning to interpret the prepubescent female genital examination. Child Abuse and Neglect, 101, 104379.

    Article  Google Scholar 

  • De Boeck, P., et al. (2011). The Estimation of Item Response Models with the lmer Function from the lme4 Package in R. Journal of Statistical Software, 39(12), 1–28.

    Article  Google Scholar 

  • De Boeck, P., & Wilson, M. (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.

    Book  Google Scholar 

  • Diederich, E., Thomas, L., Mahnken, J., & Lineberry, M. (2018). Pretest scores uniquely predict 1-year-delayed performance in a simulation-based mastery course for central line insertion. Simulation in Healthcare, 13(3), 163–167.

    Article  Google Scholar 

  • Doran, H., Bates, D., Bliese, P., & Dowling, M. (2007). Estimating the multilevel rasch model: with the lme4 package. Journal of Statistical Software, 20(2), 1–18.

    Article  Google Scholar 

  • Downing, S. M. (2003). Item response theory: applications of modern test theory in medical education. Medical education, 37(8), 739–745.

    Article  Google Scholar 

  • Ericsson, K. A. (2004). Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Academic Medicine, 79(10), S70–S81.

    Article  Google Scholar 

  • Ericsson, K. A. (2015). Acquisition and maintenance of medical expertise. Academic Medicine, 90(11), 1471–1486.

    Article  Google Scholar 

  • Ertmer, P. A., & Newby, T. J. (1993). Behaviorism, cognitivism, constructivism: comparing critical features from an instructional design perspective. Performance Improvement Quarterly, 6(4), 50–72.

    Article  Google Scholar 

  • Faraway, J. J. (2016). Linear models with R. Boca Raton: Chapman and Hall/CRC.

    Book  Google Scholar 

  • Gelman, A., & Hill, J. (2007). Multilevel linear models: the basics. Data analysis using regression and multilevel/hierarchical models (pp. 251–278). Cambridge: Cambridge University Press.

    Google Scholar 

  • Gelman, A., & Hill, J. (2007). Simulation of Probability Models and Statistical Inferences in Data Analysis Using Regression and Multilevel/Hierarchical Models (pp. 251–278). Cambridge: Cambridge University Press.

    Google Scholar 

  • Guadagnoli, M., Morin, M., & Dubrowski, A. (2012). The application of the challenge point framework in medical education. Medical Education, 46, 447–453.

    Article  Google Scholar 

  • Gulliksen, H. (1934). A rational equation of the learning curve based on Thorndike’s law of effect. The Journal of General Psychology, 11(2), 395–434.

    Article  Google Scholar 

  • Hatala, R., Gutman, J., Lineberry, M., Triola, M., & Pusic, M. (2019). How well is each learner learning? Validity investigation of a learning curve-based assessment approach for ECG interpretation. Advances in Health Sciences Education, 24(1), 45–63.

    Article  Google Scholar 

  • Jaber, M. Y., & Bonney, M. (1997). A comparative study of learning curves with forgetting. Applied Mathematical Modelling, 21(8), 523–531.

    Article  Google Scholar 

  • Jonassen, D. (1991). Objectivism versus constructivism: Do we need a new philosophical paradigm? Educational Technology Research and Development, 39(3), 5–14.

    Article  Google Scholar 

  • Kerfoot, B. P., Baker, H., Pangara, L., Agarwal, K., Taffet, G., Mechaber, A. J., et al. (2012). An online spaced-education game to teach and assess medical students. Academic Medicine, 87(10), 1443–1449.

    Article  Google Scholar 

  • Linacre, J. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328.

    Google Scholar 

  • Lindsey, R. V., Shroyer, J. D., Pashler, H., & Mozer, M. C. (2014). Improving students’ long-term knowledge retention through personalized review. Psychological Science, 25(3), 639–647.

    Article  Google Scholar 

  • Park, O., & Lee, J. (2003). Adaptive instructional systems. Educational Technology Research and Development, 25, 651–684.

    Google Scholar 

  • Pavlik, P. I., & Anderson, J. R. (2008). Using a model to compute the optimal schedule of practice. Journal of Experimental Psychology: Applied, 14(2), 101–117.

    Google Scholar 

  • Pecaric, M., Boutis, K., Beckstead, J., & Pusic, M. (2017). A big data and learning analytics approach to process-level feedback in cognitive simulations. Academic Medicine, 92(2), 175–184.

    Article  Google Scholar 

  • Price, D. W., Swanson, D. B., Irons, M. B., & Hawkins, R. E. (2018 Sep 2). Longitudinal assessments in continuing specialty certification and lifelong learning. Medical Teacher, 40(9), 917–919.

    Article  Google Scholar 

  • Pusic, M. V., Pecaric, M., & Boutis, K. (2011). How much practice is enough? Using learning curves to assess the deliberate practice of radiograph interpretation. Academic Medicine, 86, 731–736.

    Article  Google Scholar 

  • Pusic, M. V., Boutis, K., Hatala, R., & Cook, D. (2015). Learning curves in health professions education. Academic Medicine : Journal of the Association of American Medical Colleges, 90(8), 1034–1042.

    Article  Google Scholar 

  • Pusic, M. V., Boutis, K., Pecaric, M. R., Savenkov, O., Beckstead, J. W., & Jaber, M. Y. (2017). A primer on the statistical modelling of learning curves in health professions education. Advances in Health Sciences Education, 22(3), 741–759.

    Article  Google Scholar 

  • Pusic, M. V., Boutis, K., & McGaghie, W. C. (2018). Role of scientific theory in simulation education research. Simulation in Healthcare, 13(3S), S7-14.

    Article  Google Scholar 

  • R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

  • Rijmen, F., Tuerlinckx, F., Boeck, P. D., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8(2), 185–205.

    Article  Google Scholar 

  • Robson, K., & Pevalin, D. J. (2016). Multilevel modeling in plain language. London: SAGE.

    Book  Google Scholar 

  • Singer, J. D., & Willett, J. B. (2003). Doing data analysis with the multilevel model for change. Applied longitudinal data analysis: modeling change and event occurrence (pp. 75–137). Oxford: Oxford University Press.

    Chapter  Google Scholar 

  • Thurstone, L. L. (1919). The learning curve equation. Psychological Review, 34, 278–286.

    Google Scholar 

  • van der Linden, W. J. (2009). Constrained adaptive testing with shadow tests. Elements of adaptive testing (pp. 31–55). New York: Springer.

    Chapter  Google Scholar 

  • Versteeg, M., Hendriks, R. A., Thomas, A., Ommering, B. W. C., & Steendijk, P. (2020). Conceptualising spaced learning in health professions education: A scoping review. Medical Education., 54(3), 205–216.

    Article  Google Scholar 

  • Wang, L., Zhang, Z., McArdle, J. J., & Salthouse, T. A. (2008). Investigating ceiling effects in longitudinal data analysis. Multivariate Behavioral Research, 43(3), 476–496.

    Article  Google Scholar 

  • Wolfe, J. M., Evans, K. K., Drew, T., Aizenman, A., & Josephs, E. (2016). How do radiologists use the human search engine? Radiation Protection Dosimetry, 169(1–4), 24–31.

    Article  Google Scholar 

  • Wood, G., BattAppelboam, J. A., Harris, A., & Wilson, M. R. (2013). Exploring the impact of expertise, clinical history, and visual search on electrocardiogram interpretation. Medical Decision Making, 34(1), 75–83.

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by the U.S. Department of Defense Medical Simulation and Information Sciences Research Program Grant Number W81XWH-16-1-0797. The funding source played no role in the design and conduct of the study. The authors would like to acknowledge the support in developing the ECG dataset by the full grant team including Drs. Julie Friedman, Joseph Bennett, David Rhee, Barry Rosenzweig and Jeffrey Lorin. The elbow radiograph dataset is provided courtesy of Drs. Kathy Boutis and Martin Pecaric with that work having been funded by a Royal College of Physicians and Surgeons of Canada Medical Education Research Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin V. Pusic.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Variable Definition and Notation

See Table 4

Table 4 Notation table for variables, parameters and indexes

Appendix 2

Comments on Local Independence

Conditional independence is an important limiting assumption of both IRT and traditional linear models (De Boeck, 2004). Specifically, the errors in a regression model are assumed to be independent of each other. This may be more or less plausible depending on what is conditioned on in the model. For instance, once we include the covariate vector \({\mathbf{x}}\), the fixed effects for diagnosis (or item, or item order) this assumption applies to only the portion of the response not explained by these inputs, making this independence assumption more plausible because it allows responses to be more similar to each other within groups defined by the inputs (for instance among responses for items with the same diagnosis).

A critical advantage of multilevel modeling is the ability to explicitly account for non-independence of observations even beyond this. For instance the random effects associated with each person p in our models allow for (a specific type of) correlation across measurements for the same person beyond the fact that we predict that they will have a similar average value. They allow us to make a more flexible assumption of a correlation of measures within person.

Our models do not include this type of explicit correlation within items or even within diagnosis. However, in a learning context with longitudinal data, we expect the knowledge gained on early items will improve learners’ accuracy on subsequent items, especially of the same diagnosis. For our final model in Eq. 14 ("Discussion" section), we have assumed that responses of a single learner at a given time are independent of previous responses, conditioned on specific learner parameters (learning rate and base knowledge). In the data we used for this paper, items were reviewed by the learners in a random order and we measured the learning trajectory adjusted for the difficulty of the items within diagnosis. Through the inclusion of the diagnoses’ fixed effects, we ameliorate some of the problem in that we allow items with a similar diagnosis to share a common mean. Additional correlation may exist, and moreover, the time trajectory could induce further correlation.

Future models could consider other covariance structures to account for response dependence, for example by assuming the learning rate at future times will depend on the amount of learning up to that specific point in time. Another strategy would be to include random effects for diagnosis therefore allowing us to distinguish specific item effects within the different diagnosis groups. Our intent is not to map out all possibilities but rather to point out that the multi-level modeling context has the flexibility to address a limiting assumption of the traditional logistic model.

Appendix 3

Model Evaluation

This appendix evaluates and compares the models used to develop the nonlinear learning trajectories – the multilevel logistic models from Eqs. 6, 8 and 14. The primary goal of this comparison is to evaluate whether adjusting for the diagnosis helps explain the learning in greater detail and provides more accurate learner-level predictions.

We first evaluate relative model performance by comparing the information criteria Akaike Information Criteria and Bayesian Information Criteria (AIC and BIC) across the models. As we add informative predictors to our models, we expect these measures to improve (i.e. decrease). As Table 4 demonstrates, all three criteria do indeed favor the learning trajectory models that incorporate diagnosis information.

See Table 5

Table 5 Model diagnostics comparison

To further evaluate our models, we can perform simulations to generate new data corresponding to the assumptions implicit in the model. (Gelman and Hill 2007a) For each simulated dataset we can calculate relevant summaries of the data. We use two such statistics. The first two are the mean and standard deviation of the total number of correct responses at the population level (across learners—see Fig. 10, Elbow Dataset). We then compare the same statistics in our observed data to the distribution of statistics formed from our simulated datasets. This evaluation will help us understand whether our modeling assumptions are reasonable for the data we are trying to understand. Said another way, we can see whether it is reasonable to assume our observed data arose from a world meeting the assumptions of the model.

See Fig. 10.

Fig. 10
figure 10

Elbow dataset. The test statistic used for this plot is the mean and standard deviation of the total number of correct responses by learner. The red line represents the true observed value in our training data; the blue line the simulated result. From left to right, we have the complete pooling model (no hierarchical structure on the parameters), the random intercepts and slopes with no diagnosis, and the random intercepts and slopes with the diagnosis terms. Top Panel: Mean; Bottom Panel Standard Deviation

We can see in Fig. 10a and b the results of our first two test statistics for the three models we compared on Table 2. These demonstrate that although the complete pooling approach closely reproduces the mean of the total correct responses it underestimates the standard deviation of this summary statistic by an important margin. Also note that, when we include random intercept and slopes, the additional variance terms help us correct this behavior and we not only are able to get a better fit to the standard deviation, but to the mean as well. Both multilevel models closely reproduce the first summary statistic and the second falls well within the predictive simulated distributions for both models. The datasets and code necessary to repeat these analyses is available at: https://github.com/IlanReinstein/ecg_paper

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reinstein, I., Hill, J., Cook, D.A. et al. Multi-level longitudinal learning curve regression models integrated with item difficulty metrics for deliberate practice of visual diagnosis: groundwork for adaptive learning. Adv in Health Sci Educ 26, 881–912 (2021). https://doi.org/10.1007/s10459-021-10027-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10459-021-10027-0

Keywords

Navigation