Methods for Evaluating the Clinical Competence of Residents in Internal Medicine: A Review

  1. Eric S. Holmboe, MD; and
  2. Richard E. Hawkins, MD
  1. From the Robert Wood Johnson Clinical Scholars Program, Yale University School of Medicine, New Haven, Connecticut; and Uniformed Services University of the Health Sciences, Bethesda, Maryland. Disclaimer: The opinions and views expressed herein are solely those of the authors and not those of the Department of Defense or Department of the United States Navy. Requests for Reprints: Eric S. Holmboe, MD, Division of General Medicine, National Naval Medical Center, Bethesda, MD 20814. Current Author Addresses: Dr. Holmboe: Division of General Medicine, National Naval Medical Center, Bethesda, MD 20814. Dr. Hawkins: Department of Medicine, EDP, Uniformed Services University of the Health Sciences, 4301 Jones Bridge Road, Bethesda, MD 20814-4799.

    Abstract

    This paper reviews methods commonly used to assess the clinical competence of residents in internal medicine, including the In-Training Examination, medical record audits, rating scales, clinical evaluation exercises, and the use of standardized patients. Studies were identified through a MEDLINE search (1966 to present) and from the bibliographies of relevant articles and were selected for inclusion according to consensus between the authors. Whenever possible, original studies were chosen over reviews and editorials.

    No single assessment method can successfully evaluate the clinical competence of residents in internal medicine, and educators need to be cognizant of the most appropriate applications and the advantages and disadvantages of the available evaluation tools.A combination of assessment tools provides the best opportunity to evaluate and educate physicians-in-training.

    The evaluation of clinical competence is a major responsibility of medical educators, and forces within and outside of organized medicine are pushing training programs to establish and enforce standards of clinical competence [1-5]. Both the American Board of Internal Medicine (ABIM) and the Accreditation Council for Graduate Medical Education have developed specific guidelines for the evaluation of housestaff [1, 6]. Although it is considered a benchmark for certification, the ABIM certifying examination is inadequate as the sole measure of clinical competence. For this reason, the ABIM actively encourages the use of other forms of competency assessment.

    This article reviews the major instruments available to educators for the assessment of clinical competence during residency: the In-Training Examination, the ABIM Longitudinal Evaluation Form (rating scale), medical record audits, the clinical evaluation exercise (CEX), and the use of standardized patients (Table 1). Because use of these evaluation tools is voluntary and is not required for certification, we focus on the advantages and disadvantages of these tools in an attempt to help educators decide which combination best provides the information they need to assess their residents' knowledge, skills, and attitudes.

    Table 1. Relative Effectiveness of Evaluation Tools To Measure Specific Elements of Clinical Competence*

    The term competence is often used broadly to incorporate the domains of knowledge, skills, and attitudes. We should note, however, that other educators define competence more narrowly. Miller [4] divides clinical ability into a four-step pyramid: knowledge (knows), competence (knows how), performance (shows how), and action (does). Regardless of the framework chosen, clinical ability-or competence-is a multidimensional construct, and, as Miller notes [4], “no single assessment method can provide all the data required for judgment of anything so complex as the delivery of professional services by a successful physician.”

    Methods

    We did a MEDLINE search for articles published between 1966 and September 1997, and we searched bibliographies of reviews for relevant articles. We selected articles by consensus. Although this is not a structured review, we emphasized original studies rather than reviews and opinion pieces and, whenever possible, we chose articles directly relevant to internal medicine training programs.

    Medical Record Audits

    Critique of the written medical record is a time-honored approach to the evaluation of physicians-in-training [7]. One advantage of this method is the ready availability of charts. Record audits, which supply information related directly to patient care, can provide a powerful template for patient-centered feedback.

    Chart review can change the clinical practice of residents. When combined with feedback, it has been used successfully to improve the preventive care practices of residents, assess physical examination skills, promote the cost-effective use of laboratory tests, and improve record documentation [8-16]. Reviews can also be conducted longitudinally over time, providing many opportunities for evaluation. Audits can obtain a high level of reliability when explicit criteria are used [8-13]. Chart review is also an excellent method with which to involve housestaff in self-assessment, peer assessment, and quality improvement [17, 18].

    The major disadvantages of chart audit are related to the quality of the documentation. Medical records may not accurately reflect what occurred during the clinical encounter; may not include pertinent information that was, in fact, collected; and may fail to justify impressions and plans. Tugwell and Dok [7] summarized the main weakness of chart review as “the fact that records are used more as an aide-de-memoir rather than a documentation of the justification for management decisions, which continues to compromise the validity of the medical record.” Norman and colleagues [19] found that physicians often failed to completely record information obtained and procedures performed; diagnosis, patient education, and procedures were undocumented up to 50% of the time. In addition, lack of direct observation of resident-patient interactions makes a valid assessment of the “quality” of the visit difficult, particularly with regard to the accuracy of physical examination findings, patient education, interviewing skills, and judgment [20, 21].

    The nature of the criteria used for chart audit is important. Studies using implicit review (for example, subjective or qualitative impressions) have reported low reliability among reviewers [22]. Ognibene and Norman and their colleagues [15, 19] noted that many charts are needed to ensure reasonable reliability even when explicit criteria are used. This ultimately means that chart review requires a substantial time commitment by faculty. Trained ancillary personnel can perform certain reviews, but the more complex the target of review, the more likely it is that physician involvement will be necessary.

    In summary, chart audit is a useful evaluation tool that can provide important information about a trainee's clinical skills. When accompanied by specific feedback, it can lead to changes in behavior. However, the achievement of optimal results may require a substantial time commitment from faculty, and problems with the assessment of accuracy limit the value of chart audit for summative resident evaluations.

    The In-Training Examination

    The In-Training Examination, first administered in 1988, is intended to provide formative feedback to trainees on their knowledge relative to that of a national peer group at the midpoint of their training [23]. It also allows program directors to assess progress made toward acquisition of the knowledge base that will ultimately be tested by the ABIM certifying examination [24]. However, its sponsors-the American College of Physicians, the Association of Professors in Medicine, and the Association of Program Directors in Internal Medicine-emphasize that the In-Training Examination is not intended to be used in retention or promotion decisions [23].

    The examination has several advantages. It is a 1-day test that primarily measures a resident's knowledge base, and its overall reliability coefficients are consistently greater than 0.9. However, reliability coefficients for the subsections of the test range from 0.54 to 0.80; this is probably because each section has fewer questions [23].

    Several studies [24-27] have shown the predictive validity of the In-Training Examination for subsequent performance on the ABIM certifying examination [24-27]. Grossman and colleagues [25] found that the In-Training Examination results for 109 postgraduate year 2 (PGY-2) residents explained up to 70% of the variation in scores on the ABIM certifying examination. Scores above and below the 35th percentile (relative to a national peer group) had a positive predictive value of 89% and a negative predictive value of 83% for a pass or fail, respectively, on the ABIM certifying examination [25]. Waxman and coworkers [24], in a study of eight programs and 223 PGY-2 residents, found that a percentile score greater than 70 predicted a 100% pass rate [24]. However, sensitivity and specificity for the examination vary depending on the composition of the group studied.

    The In-Training Examination also complements other aspects of resident evaluation by serving as an objective measure of aspects of a resident's knowledge that are not accurately assessed by other methods. In our own program, faculty could identify residents at risk for failing the ABIM certifying examination only about one third of the time (Sumption KF. The In-Training Examination in internal medicine. Presented at Navy ACP meeting, 4 November 1996, San Diego, California). In a related study, family practice program directors accurately predicted only 25% of their residents' scores on a licensing examination [28].

    Although the In-Training Examination has predictive power, studies show that earlier examinations, such as the National Board of Medical Examiners tests, also have predictive accuracy for the ABIM certifying examination [29, 30]. Is the information gained through the In-Training Examination worth the cost of the test (approximately $75.00 per examinee)? The Examination may measure certain important abilities, such as analytical and interpretive skills, less accurately and comprehensively than other instruments, and it may not identify deficiencies in subspecialty knowledge.

    Despite these limitations, the number of residents who take the examination continues to increase, and many residents take the examination in all 3 years of their training. In our experience, residents are increasingly aware of the examination's predictive value for the ABIM certifying examination, and more than 90% of our residents modified their schedules or study habits on the basis of the examination results (Sumption KF. The In-Training Examination in internal medicine). Therefore, the In-Training Examination is important to housestaff and is a useful tool for knowledge assessment for program directors and residents alike. It remains to be shown whether feedback from the In-Training Examination leads to better performance on the ABIM certifying examination or improves the competence of residents in any other way.

    The American Board of Internal Medicine Longitudinal Evaluation Form

    The ABIM has expended considerable effort to develop a global rating scale for measuring various facets of a resident's performance, particularly “soft end points,” such as clinical judgment, humanism, interpersonal skills, attitudes, and professionalism. Program directors generally use the forms to assess resident progress and to produce the composite evaluations required for matriculation to the ABIM certifying examination.

    The ABIM rating form uses a 9-point scale: Scores 1 to 3 denote unsatisfactory performance, scores 4 to 6 denote satisfactory performance, and scores 7 to 9 denote superior performance. The categories evaluated are clinical judgment, medical knowledge, history taking and physical examination skills, procedural skills, interpersonal skills, medical care, attitudes and professionalism, and overall competence. For each domain, the form provides “behaviorally anchored” descriptors at each end of the scale to guide the rater.

    The rating scale is easy to use and time friendly. The form, which is less intrusive than other evaluation tools, provides an opportunity for a composite evaluation of a resident's knowledge, skills, and attitudes over a period of time [31]. The rating scale is an excellent template for communicating expectations and objectives or giving feedback to residents; most forms allow space for written feedback and evaluation. Ratings forms for use by peers and patients have also been developed [32-35]. Evaluations by peers and patients appear to complement faculty ratings in the areas of professionalism and humanism [32-35].

    Unfortunately, rating scales as they are generally used do not effectively discriminate among the domains of clinical competence in individual residents or even between the competence levels of different trainees. The “halo effect,” under which all ratings are unduly influenced in a positive way by a single characteristic, is evident in evaluations of residents. Thompson and colleagues [36] noted that 96% of all rating scores for 85 residents fell between 6 and 9. Factor analysis shows that ratings tend to be weighted heavily on perceived knowledge and interpersonal skills [36, 37]. Standardizing the observation of the behavior of interest and defining the nomenclature for the desired expectations may help reduce errors resulting from the halo effect [38].

    Why are rating scales associated with such difficulties? Important domains, such as history taking, physical examination, and interpersonal skills, are often not directly observed by faculty. Evaluations, which are based on limited direct observation, usually reflect only skills used in case presentations and team management. Thus, the major limitation of the ABIM form is really a reflection of the skill of the raters, and major changes in the forms alone do not improve reliability or discrimination [39]. More research is needed on the training of raters in the effective use of these forms, especially given the central and vital role of the ABIM evaluation form for the assessment of residents' competence.

    Performance-Based Assessment

    We have been focusing on traditional methods of competency assessment (medical record audit and rating scales), which primarily concentrate on knowledge or the “end products” of the clinical encounter. Direct observation of trainees is necessary to evaluate the process of data acquisition and care. A trainee's ability to take a complete history; perform an accurate, thorough physical examination; communicate effectively; and demonstrate appropriate interpersonal and professional behavior can best be measured through the direct sampling of these clinical skills.

    Program directors may not feel a compelling need to directly observe the above skills and attitudes because they believe that these skills and attitudes have been sufficiently mastered in medical school. However, trainees often enter residencies with significant deficiencies in clinical skills [40-42]. Furthermore, recent work clearly shows the striking lack of proficiency in physical examination skills among residents [43, 44]. The following sections review evaluation methods that involve direct observation: the CEX and standardized-patient-based testing.

    The Clinical Evaluation Exercise

    The CEX was conceived as part of the ABIM's clinical competency program after the oral portion of the ABIM certifying examination was abandoned. It is designed to introduce the direct observation of a trainee's clinical competence and to assess integration of clinical skills. The ABIM recommends that the CEX be performed during the first 6 months of internship so that deficiencies can be detected early. In the CEX, which is used by 80% to 88% of training programs, the resident is observed while taking a comprehensive history and doing a physical examination. He or she then presents the case and discusses management plans with the faculty observer [41].

    Recently, the ABIM introduced the “mini-CEX,” a new format designed to evaluate residents in a setting that better reflects day-to-day practice [42]. In this exercise, which is designed to last approximately 20 minutes, a resident is directly observed while taking a focused history and doing a physical examination in a clinic, an emergency department, or an inpatient ward. On completion of the exercise, the observer gives the resident feedback and completes a 9-point rating scale.

    Both the CEX and the mini-CEX use direct observation of residents and give the opportunity for immediate feedback. These exercises are less expensive than the use of standardized patients because actual patients are used. The CEX also promotes one-on-one “bonding” with a faculty member, and resident satisfaction with the exercise is high [42].

    For the evaluation of competency, however, the CEX has several important problems, most notably inter-rater variability [41, 45-47]. Noel and colleagues [46] found a wide range of ratings produced by faculty from 12 teaching hospitals. Problems included frequent failure to note both excellent and poor clinical skills and substantial disagreement in global assessments of resident performance, including pass-fail determinations. Kroboth and coworkers [48] concluded that the CEX has to be done 6 to 10 times to achieve a reliability coefficient of 0.8.

    Other difficulties with the CEX include limited reliability due to the “content specificity” of clinical skills. The quality of physician performance varies from patient to patient, and a single observation with the traditional format of the CEX is not sufficient to form an accurate impression of an examinee's clinical skills. The mini-CEX may alleviate this problem by sampling more cases with variable medical content, but more data are needed on this point [42].

    The traditional CEX requires substantial time commitments from faculty. Although the mini-CEX requires less time per encounter, 12 to 14 encounters are needed to reach a reliability coefficient of 0.8. Finally, a modest correlation has been noted with the CEX in relation to evaluation rating forms and clinical competency committee ratings [41, 49]. However, the lack of correlation between seemingly unrelated measurement tools may actually be a strength in that the CEX may measure components of competence not captured by these other two evaluation methods.

    In conclusion, the direct observation of clinical skills is important and the CEX is a valuable teaching and feedback tool. Variability in resident performance in specific content areas may compromise assessment reliability when the number of evaluation encounters is limited; here, the mini-CEX is clearly an improvement.

    Use of Standardized Patients

    Since it was first introduced by Barrows more than 30 years ago, standardized-patient-based evaluation has gradually assumed a significant role in the assessment of clinical competence [50]. A standardized patient is someone other than a physician who is trained to portray a patient in a standardized and reproducible fashion [51]. Standardized patients may use checklists or rating scales that can be used to teach or to evaluate clinical skills [52-54]. A broad range of persons have been used as standardized patients, including asymptomatic persons with normal findings on physical examination, real patients with stable physical findings, and persons who can simulate various physical findings [52, 53]. Standardized patients are most often used in the teaching and evaluation of history taking, physical examination, patient education, and counseling or in focused encounters involving combinations of these skills [55, 56].

    The incorporation of supplementary clinical material (such as radiographs, electrocardiograms, and laboratory results) into multistation standardized-patient examinations has allowed the teaching and evaluation of a broad spectrum of clinical skills. In multistation examinations, examinees perform single or related tasks (such as taking a focused history, doing a physical examination, or counseling a patient) at a series of stations, with or without ancillary data. In contrast to the more comprehensive nature of a single standardized-patient encounter, these exercises focus on distinct skills during brief encounters but sample skills more broadly overall, improving reliability [50, 51, 57, 58]. Such examinations are most often referred to as OSCEs (Objective Structured Clinical Examinations), although other descriptive acronyms have been suggested [50]. The OSCE format is widely used in the performance-based assessment of clinical competence [59]. Recent studies have described a wide variety of applications for OSCEs, including the assessment of the efficacy of an alcohol and drug curriculum; the teaching or evaluating of specific outpatient management skills; clinical breast examination skills; and the ability to address clinical ethical situations, counsel patients about risk factor modification, deliver bad news, and assess pain control in patients with cancer [60-70]. In addition, multistation standardized-patient exercises are used in “high-stakes” testing, including the Medical Council of Canada's qualifying examination for licensure and programs rating international medical graduates for Canadian and U.S. program directors [71-73].

    The psychometric qualities of standardized-patient exercises have been studied extensively. The reliability of standardized-patient-based evaluation varies from 0.41 to 0.85, and coefficients of reliability are better for pass-fail determinations than for absolute scores [53, 54]. Depending on the number of cases, the duration of encounters, and the complexity of individual cases, a reliability coefficient of 0.80 may be achieved [53, 54]. Stations of shorter duration (5 to 10 minutes), with focused instruction and evaluation of discrete skills, may provide optimal reproducibility with shorter testing times [72, 74, 75]. Standardized-patient exercises give more reliable results in the evaluation of history taking, physical examination, or communication skills than in the measurement of problem-solving or clinical reasoning skills [52, 53, 65]. Recent studies, however, suggest that this technique may adequately evaluate problem-solving skills if interstation progress notes are incorporated, postencounter questions focusing on key diagnostic findings are added, or oral presentation and problem-solving stations are included [76-78]. Further investigation is needed because older data suggest that linking written follow-up questions to standardized-patient encounters may detract from overall reproducibility [53].

    Standardized-patient exercises have face validity because the skills and behaviors taught or evaluated are intrinsic to those involved with clinical practice. Practicing physicians and residents cannot differentiate between real and standardized patients when these patients are sent into their offices unannounced [54, 79]. Examinees presumed to have different levels of competence perform as expected on standardized-patient exercises: Resident performance is better than medical student performance, senior residents score higher than junior residents, and residents from programs with stronger academic records do better than residents from programs with lesser reputations [51, 52, 54, 58]. The results of standardized-patient-based evaluations show variable correlation with written standardized examinations, faculty ratings, rotation evaluations, or clinical competency committee evaluations [51-5358, 61, 80, 81]. In the absence of a gold standard for measuring overall clinical competence, many experts suggest that the variable correlation noted may occur because standardized-patient exercises measure unique components of clinical competence and may be considered complementary to the other evaluation tools.

    Two important theoretical and practical advantages of standardized-patient-based teaching and evaluation are the patient-centered nature of these evaluation methods and their incorporation of direct observation of clinical skills. In addition, educators can control the clinical encounter and its educational content, allowing them to guarantee uniform presentation of material (ensuring exposure to important but infrequently encountered clinical situations) or to tailor the test to the level of the examinee. Trainees can practice skills in a non-threatening and low-risk environment; this is particularly important when patient education skills, counseling skills, or the ability to deal with certain emergency situations are being taught and evaluated. The quality of the feedback can be high; it can be provided immediately and can be directed from the patient's perspective [50].

    The principal disadvantages of standardized-patient-based evaluation are its expense and the lack of uniform access to experienced staff who can train standardized patients and develop clinical scenarios. The development of consortia for training standardized patients, conducting exercises, and sharing costs may allow more programs to participate [50]. In addition, although standardized patients can be trained to portray a wide range of clinical content and to simulate many physical findings, they remain a supplement to real patients. A high volume of patient contact is still required for an appreciation of the diversity of clinical presentations.

    In conclusion, standardized-patient-based methods allow for high-quality teaching and evaluation of the basic clinical skills of history taking, physical examination, communication, and interpersonal relating. They are less efficient in the assessment of knowledge, clinical reasoning, or judgment. However, reports describing the inclusion of techniques that focus on problem solving seem promising, and the requisite integration of clinical skills in these stations may further endorse the validity and relevance of this evaluation tool. Concerns about reliability are important when standardized-patient-based testing is used for “high-stakes” summative testing but are less critical when it is used primarily for formative assessment and teaching.

    Conclusions

    The evaluation of clinical competence is a daunting, complex task, and no single evaluation tool can adequately assess a resident's knowledge, skills, and attitudes. Successful completion of a certification examination is not an adequate measure of the overall clinical competence of physicians-in-training. To ensure an effective assessment of competence, a multifaceted approach is needed. Table 1 summarizes the strengths and weaknesses of each evaluation tool, but use of all of these tools can be time-consuming and expensive. The “right” combination depends mainly on each program's goals, needs, and resources. A competency program should include tools that successfully measure knowledge, skills, and attitudes and should incorporate direct observation.

    If effectively used, the In-Training Examination (knowledge), rating scales (skill and attitude), and the CEX (skills) can serve as the core of a successful assessment program. However, the reliability and accuracy of faculty evaluation with these measures are consistently shown to be suboptimal, and programs may need to use such techniques as standardized-patient encounters (to better evaluate skills and attitudes) or medical record audit (to examine clinical care practices more effectively). We clearly need to find ways to equip faculty better for their role as evaluators as we struggle to find the optimal strategy with which to assess clinical competence. Regardless, all programs should choose the combination of tools that best meets the needs of their trainees.

    References

    1. 1.
    2. 2.
    3. 3.
    4. 4.
    5. 5.
    6. 6.
    7. 7.
    8. 8.
    9. 9.
    10. 10.
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15.
    16. 16.
    17. 17.
    18. 18.
    19. 19.
    20. 20.
    21. 21.
    22. 22.
    23. 23.
    24. 24.
    25. 25.
    26. 26.
    27. 27.
    28. 28.
    29. 29.
    30. 30.
    31. 31.
    32. 32.
    33. 33.
    34. 34.
    35. 35.
    36. 36.
    37. 37.
    38. 38.
    39. 39.
    40. 40.
    41. 41.
    42. 42.
    43. 43.
    44. 44.
    45. 45.
    46. 46.
    47. 47.
    48. 48.
    49. 49.
    50. 50.
    51. 51.
    52. 52.
    53. 53.
    54. 54.
    55. 55.
    56. 56.
    57. 57.
    58. 58.
    59. 59.
    60. 60.
    61. 61.
    62. 62.
    63. 63.
    64. 64.
    65. 65.
    66. 66.
    67. 67.
    68. 68.
    69. 69.
    70. 70.
    71. 71.
    72. 72.
    73. 73.
    74. 74.
    75. 75.
    76. 76.
    77. 77.
    78. 78.
    79. 79.
    80. 80.
    81. 81.
    « Previous | Next Article »Table of Contents