THE RELIABILITY OF ABFM EXAMINATIONS: IMPLICATIONS FOR TEST-TAKERS

Kenneth D. Royal; James C. Puffer

doi:10.1370/afm.1303

A common theme among family physicians that have repeatedly performed poorly on the ABFM Maintenance of Certification (MC-FP) Examination is the complaint that they received a score that was identical, or almost identical to their score on a previous administration of the exam. From their perspective, it is a mystery as to why they received the exact same score (or a very similar score), despite additional study time and preparation. Often, physicians assume a mixup has occurred and ask if it is possible that results have erroneously been provided from their previous attempt. After a psychometric review, it is clear that there is no mistake at all. In fact, we anticipate many test-takers will receive a comparable score on future attempts at successfully taking the exam. We base this anticipation on the psychometric concept of reliability.

Overview of Reliability

The notion of reliability is perhaps one of the oldest, yet most misunderstood notions in the measurement and assessment arena. Commonly, researchers of all experience levels assert their instruments are reliable. The truth is there is no such thing as a reliable instrument. Only the scores produced from an assessment have the property of reliability. All tests are dependent upon the characteristics of the test, the test administration, and the group of examinees. It is the interaction among these 3 elements that determine the reliability of results for any test.

With regard to the 3 major elements, let us briefly discuss each. Test characteristics typically include test length, item type, and item quality. Generally speaking, longer tests produce more reliable scores than shorter tests. With regard to item type, objective items such as multiple-choice items typically produce more reliable scores than subjective items such as essays. Item quality is also important as poor quality items tend to reduce reliability. Also, good quality items should sufficiently vary in difficulty so that they effectively discriminate among examinees. Discrimination is useful in that it helps identify which examinees possess the knowledge necessary to correctly answer an item. Those who possess the most knowledge will have the greatest probability of answering difficult items correctly. Over the course of a lengthy examination, distinctions between examinees become clearer, and we are better able to determine how much knowledge an examinee possesses.

Conditions of administration are also important. Conditions include physical conditions (eg, temperature levels, noise, etc, in the testing room), exam instructions, and time limits. Our testing vendor goes to great lengths to ensure these factors remain as constant as possible across multiple administrations of our examination. Variation in these conditions could affect some examinees differently, resulting in scores that vary for reasons other than an examinee knowing more or less about the content. The ABFM acknowledges that disruptions such as excessive noise or other distractions can introduce additional error into one’s score, thus potentially invalidating results. We have policies in place to rectify situations when this occurs. However, other administration factors such as instructions and time limits are imposed equally upon everyone, unless a disability is documented in which case extra time and possibly other accommodations may be permitted.

Finally, the characteristics of the group of examinees are also important. As mentioned previously, a good test should contain a considerable number of items with varying degrees of difficulty. But what happens when a good test is attempted by a very homogenous sample, say, all high-achievers with similar levels of knowledge? Although the test may be psychometrically sound, the sample of examinees varies so little that scores cannot be reliably differentiated. When this happens, low reliability estimates are produced and many researchers quickly dismiss the instrument (or assessment) as being of poor quality. It is for this reason that reliability estimates are not the measure of exam quality, but rather a measure of exam quality. In order for a test to produce reliable scores, the ability of examinees must also sufficiently vary. When there is a great range of ability in a group, reliable distinctions between what an examinee knows and does not know can be made.

Empirical Example and Interpretation

Although no strict guidelines for minimum levels of reliability exist, many measurement experts tend to agree with Nunnally and Bernstein’s recommendations.¹ That is, the minimum reliability necessary for a group of test scores is .90 if important decisions are going to be made based on those scores. Reliability estimates between .80 and .89 are considered reasonably reliable. The 2009 ABFM MC-FP examination had a reliability estimate of .94. This is considered a very high estimate of internal consistency. This estimate indicates an estimated 94% of the observed variance in scores is due to systematic differences in examinee performance, with 6% due to chance differences. Another way to interpret this estimate is to consider perfect reliability (1.0) minus the observed reliability (.94). The difference, in this case .06 (or 6%), is the amount of observed variance that is due to measurement error.

Implications for High-Stakes Testing

In many ways high estimates of reliability essentially echo the old adage, if you always do what you’ve always done, you’ll always get what you’ve always gotten, to test-takers. For an examinee that has a history of scoring very high on the exam, this notion will typically work in the examinee’s favor. However, it should be made abundantly clear that this is not a guarantee. On the other hand, test-takers who have previously failed an examination may find this news disconcerting. However, this is not to say that one is not capable of making such gains. With a significantly improved approach to exam preparation, most examinees that have failed previously are capable of making the types of gains necessary to pass this examination. It all begins with asking the right question and preparing an effective study plan.

Examinees should not ask themselves “what do I have to do to reach the minimum score necessary for passing?” but rather “how can I become a more knowledgeable physician?” For physicians whose goal is to simply pass the test, their intentions, and possibly preparation strategy, are misguided. One’s goal should not be to pass the exam, but rather to become a better family physician. With an increased fund of medical knowledge, the chances of passing the examination will improve naturally as a result of actual learning. However, if one’s goal is to simply receive a passing score, then the examinee will likely find him or herself in the position of trying to anticipate examination items and otherwise resorting to methods similar to “cramming.” Spending exorbitant amounts of time and energy attempting to memorize content solely for the purposes of regurgitating it at a later time, or working on improving one’s test-taking skills with regard to identifying distracters do not work well on a high-stakes, criterion-referenced examination such as ours that measures one’s fund of medical knowledge.

As we have demonstrated previously, simply being a good test-taker is not likely to significantly improve one’s chances of passing a high-stakes certification exam.² Also, the scoring methods used for our exams work in such a way that one’s ability is estimated based on correct/incorrect responses to items of varying degrees of difficulty. When both person ability and item difficulty are mapped onto a single continuum, it becomes clear from a psychometric perspective what an examinee knows and what he or she does not.³ Therefore, only when a physician has taken an improved approach to exam preparation, particularly one that focuses on increasing one’s fund of medical knowledge, can one seriously expect to advance along that continuum of ability.

Conclusion

It is important to clearly and directly emphasize that an examinee of marginal ability or someone with a history of previous failures is likely to continue to fail the MC-FP exam if he or she continues with the same preparation approach or otherwise utilizes study preparation methods that do not solicit actual and sustained learning. Improving test-taking skills will be of minimal benefit to a test-taker, as high-stakes examinations are not a measure of one’s test-taking skills. The MC-FP examination is constructed in such a way that the influence of test-taking skills is negligible. Examinees should understand that the only legitimate way to improve one’s performance on the MC-FP Examination is to increase their fund of medical knowledge and decision making ability in clinical scenarios; that is what the exam measures. When examinees make real gains with regard to improving these, they are most likely to receive higher scores. It should be noted that the ABFM provides important information on its Web site about its exams intended to help the family physician understand both the type and amount of content one might expect to see, as well as tips for developing a study plan.⁴ Utilizing this information can assist with improving performance on our examinations.

References

↵
1. Nunnally JC,
2. Bernstein IH
. Psychometric Theory. 3rd ed. New York, NY: Mcgraw Hill; 1994.
↵
1. O’Neill TR,
2. Royal KD,
3. Puffer JC
. Performance on the American Board of Family Medicine (ABFM) certification examination: are superior test-taking skills alone sufficient to pass? J Am Board Fam Med. 2011;24(2):175–180.
OpenUrl Abstract/FREE Full Text
↵
1. Linacre JM
. KR-20 or Rasch reliability: which tells the “truth”? Rasch Measurement Transactions. 1997:11(3):580–581. http://www.rasch.org/rmt/rmt113l.htm.
OpenUrl
↵
Examination Descriptions ABFM. (2010). Examination Descriptions. https://www.theabfm.org/cert/exams.aspx. Accessed Nov 11, 2010.