Assessment of emergency medicine residents: a systematic review.

BACKGROUND
Competency-based medical education is becoming the new standard for residency programs, including Emergency Medicine (EM). To inform programmatic restructuring, guide resources and identify gaps in publication, we reviewed the published literature on types and frequency of resident assessment.


METHODS
We searched MEDLINE, EMBASE, PsycInfo and ERIC from Jan 2005 - June 2014. MeSH terms included "assessment," "residency," and "emergency medicine." We included studies on EM residents reporting either of two primary outcomes: 1) assessment type and 2) assessment frequency per resident. Two reviewers screened abstracts, reviewed full text studies, and abstracted data. Reporting of assessment-related costs was a secondary outcome.


RESULTS
The search returned 879 articles; 137 articles were full-text reviewed; 73 met inclusion criteria. Half of the studies (54.8%) were pilot projects and one-quarter (26.0%) described fully implemented assessment tools/programs. Assessment tools (n=111) comprised 12 categories, most commonly: simulation-based assessments (28.8%), written exams (28.8%), and direct observation (26.0%). Median assessment frequency (n=39 studies) was twice per month/rotation (range: daily to once in residency). No studies thoroughly reported costs.


CONCLUSION
EM resident assessment commonly uses simulation or direct observation, done once-per-rotation. Implemented assessment systems and assessment-associated costs are poorly reported. Moving forward, routine publication will facilitate transitioning to competency-based medical education.


Background
2][3][4] This movement harkens back to mid-20 th century where educational systems were being changed to ensure pre-specified discrete learner outcomes. 5Since the 1980s, a revival of this movement has given rise to various bodies and initiatives within medical education, namely: The General Medical Council (GMC) guidance of the United Kingdom;  The current shift in educational systems towards emphasizing learner-oriented outcomes, such as competencies in various skills, has created a need for more robust (validated and reliable) tools and systems to assess learners. 8An assessment tool is a single structured scale, form, rubric or exam used to measure performance, knowledge, skills or abilities; whereas assessment programs and systems involve a formalized and multi-faceted approach used to evaluate and offer feedback to learners.Further, there is increasing interest in measuring clinical performance in the workplace, and ensuring that a learner is able to achieve the "Does" level at the peak of Miller's Pyramid (which outlines a learner's progression from "Knows" at the base of the pyramid, through "Knows How" and "Shows," to reach "Does"). 9

Importance
Within emergency medicine (EM) training, learners must develop a wide range of skills and competencies outlined by CanMEDS and ACGME.10,11   Since the introduction of CanMEDS 2005, 4 available assessment tools relevant to EM in the Western world have been described in recent consensus reports and summaries. 12-14 Still, the actual prevalence of the use of these tools has not been reported in the literature.
The growing emphasis on competency assessment in medical training increases the need for resources required for assessment. 14,15Assessment tools vary in cost: contrast, for example, the resources required to create, administer, and mark a pen-and-paper MCQ exam, with the costs of training, personnel, simulation mannequins, equipment, and software programs required for a simulation-based assessment. 16The cost and true value of a tool is determined in the context of outcomes (using, for example, cost-effectiveness, cost-benefit or costfeasibility analyses).
17 -19   In medical education, however, cost is infrequently measured.Determining overall impact and value of an assessment strategy adopted for competency assessment demands measuring not only outcomes but also associated resources or costs.
Measuring clinical competence of EM residents will require educators to understand the breadth of existing assessment tools and systems in order to identify next steps in transitioning to CBME (including implementation of existing tools or systems and development of new ones).Literature on costs associated with assessment tools/systems or effectiveness analyses will be useful in guiding planning for (re)allocation of resources to implement competency based education.To date, there has been no detailed description of how frequently different types of assessment systems are being used in Western training programs, nor has there been a review of cost reporting associated with assessment tools or systems.

Goals of this investigation
To quantify what assessment systems are in use and to summarize the regularity of their use, we systematically searched the published literature to determine 1) the type and availability of published assessment tools or systems and 2) their frequency of use in emergency medicine resident assessment.As a secondary outcome, we summarized information on the cost of these assessments.

Study design
This study is a systematic review of published literature.It does not require research ethics board approval.Our study was conducted according to an a priori protocol agreed upon by all authors and reporting follows PRISMA guidelines.

Methods and measurements
The literature search was developed in collaboration with a research librarian and included EMBASE, Medline, ERIC, and PsychInfo; these were most likely to retrieve our articles of interest, as well as abstracts from relevant EM and medical education conferences.We searched for MeSH terms such as "resident," "assessment," and "evaluation," then used published filters to limit our search to EM.

21-23
The search was limited to studies in the English language, published January 2005 through June 2014 (i.e., in the period following the release of the CanMEDS 2004 competencies).A sample search strategy is included in Appendix A.
Two authors (ICG, TMC) independently reviewed titles and abstracts for suitability, and then further reviewed full-text studies for inclusion.Inclusion criteria required full text studies or abstracts of Emergency Medicine trainees (residents) in North America, Europe, Australia or New Zealand, and a report of at least one of the primary outcomes of interest.We excluded studies in undergraduate medical students or fellows only, non-EM residency programs, and studies published before 2005.As our objective was to review assessment programs and tools that were actually used (rather than list the available types, which has been done elsewhere), 12,14,[24][25][26][27] we excluded review/summary articles and consensus reports.Data abstraction followed our pre-specified protocol and included demographics of the study population, teaching centre, assessment tools, scope of the program, frequency of assessment, and associated costs/resources.Definitions for validity have changed greatly over the past century. 28More recent definitions of validity center on the interpretations or actions that result from a tool, as well as the appropriateness of the tool for a particular context, and have moved away ; and consequential aspects (intentional or unintentional social impact of the score as basis for action or change).Although the Messick criteria are not structured on a hierarchy of validity, the more criteria a tool demonstrates, the stronger the argument for global construct validity of that tool, and the more meaningful it becomes.In an effort to characterize the strength of validity evidence for the tools found in our review, we defined a tool as demonstrating "good" construct validity if it had been tested on at least two different aspects of Messick's validity framework.Since the goal of this study was to quantify the prevalence of various assessment tools and programs reported in the literature, we did not evaluate each publication for its quality as a study in and of itself, as it would have had little or no bearing on our study results and their interpretation. e109

Outcomes
The two primary outcomes of this study are: 1) the types of assessment tools used, including assessment programs; and 2) the number of assessments per resident in whichever timeframe reported by a study.A secondary outcome is the presence of any report of cost for a described assessment system.

Analysis
Findings were tabulated and summarized using descriptive statistics calculated in Microsoft Excel (2011).Where possible, median and interquartile range (IQR) were presented.Multicentre trials were counted as individual centres when calculating program duration and number of participants.Frequency of assessment calculations assumed onemonth rotations; "one-off" or pilot studies were not included in overall frequency of assessment calculations.We used a post-hoc sensitivity analysis to test the impact of our assumption of one-month rotations by assuming a three-month rotation (i.e., when extrapolating the annual assessment frequency for a tool reported per rotation, we multiplied the number of assessments by four, rather than 12, to test our assumption).Given the descriptive nature of our study design, we did not conduct comparative analyses.

Cost reporting
Estimation and analysis of costs in the medical education literature is notoriously challenging. 309][20][21][22][23][24][25][26][27][28][29][30][31] Dividing ingredients up into a number of different categories may facilitate identifying key components of cost.Most attention should be paid to ingredients that make up most of the costs (such as equipment, resources, and personnel, including faculty or staff physicians).A list of ingredients relevant to determining costs related to assessment tools are outlined in Box 1. 18,19,31   For the purposes of this study, our cost analysis involved abstracting reports of resources (i.e., costs) required for an assessment tool/system; however, since our goal was to identify the presence (and not quantification) of resource/cost reporting, we did not conduct further analyses.

Results
The literature search returned 879 articles after removal of duplicates (Figure 1).We excluded 742 articles based on screening of title and abstract.Of the remaining 137 articles that went to full-text review, 64 were subsequently excluded, most commonly for lacking an outcome of interest (n=21), the study type (i.e., papers which were summary or consensus reports; n=17) or lacking our population of interest (n=13).Other reasons are detailed in Figure 1.In total, 73 reports met our inclusion criteria: 40 full-text articles  and 33 abstracts. Cost eporting:

Study demographics
Over 80% of reports originated from the United States and a limited portion (14%) were from Canada (Table 1).There were three multicentre studies, conducted in two, 33  We used the Messick framework of construct validity (involving six criteria described in the Methods section) to evaluate the strength of validity evidence of the reported assessment tools.The median number of Messick's validity criteria reported per tool was one (IQR: 1-2).Thirty-five reports (47.9%)fulfilled two or more of Messick's criteria, suggesting roughly half of assessment tools had attempted to demonstrate multiple forms of validity evidence (Table 1).Detailed information on the demographics of each included study is available online (eSuppl 1).

Assessment tools
Studies described a variety of assessment tools that differed in scope (Table 1 One program was scaled back in some aspects and expanded in others.

Frequency of assessment
There were 39 studies reporting information on how often residents received any form of assessment (eSuppl 2).The frequency of assessment ranged from daily to once during residency.The most common frequencies reported were twice per month/rotation (n=6), once annually (n=4) and three times ever (n=3).Daily (n=2), bi-weekly (n=3) and weekly (n=3) feedback within a month/rotation were also reported.
The reported assessment frequency per resident per tool is summarized in Figure 2. The median number of assessments was stratified by the time period over which the assessment tool was used: within the entire residency program (median: 4 [IQR: 1.75-4], n=6); per annum (1.5 [1-24], n=8); and per month/rotation (2.5 [2-5.4],n=16).Assuming the assessment frequency reported continued throughout residency, the overall median number of assessments per resident annually was twice monthly (median: 24 [IQR: 1.1-48], n=30).In pilot studies of assessment tools, the median number of assessments was one (IQR: 1-2, n=9).As a sensitivity analysis to test our assumption of one-month rotations, we calculated a separate frequency for studies reporting assessments "per rotation" (n=13), using a three-month assumption for duration of rotation.With this assumption, the median annual assessment frequency was 12 (IQR: 8-32) among the studies reporting "per rotation" assessments.Using this same three-month rotation assumption, the overall median number of assessments was 20 (IQR 8-48.5), a change of 16% from the previous model assuming one-month rotations (median 24 assessments per annum).
The most frequently used assessment tools were: daily encounter cards; 93,99 direct observation; 86 and 360 degree/multisource feedback. 48Of studies reporting higher assessment frequency, only one 48 was a fully implemented program.Month/Rota/on" (n=16)" Annual"(n=8)" Ever"in"residency" (n=6)" Pilot"studies"(n=9)" Number'of'evalua-ons'within'-me'period' Time'period'' e112 Lower frequency of assessment was associated with being a pilot or "one-off" study.Tools used for more infrequent assessments (four or less per year) include: written exams;

Cost reporting
We reported the presence of cost reporting for a given assessment tool or system within our 73 studies.Though no article presented the exact cost of their assessment tool or curriculum, two provided estimates.Over half of the reports in our review describe pilot (i.e., "one-off") studies.Clearly, there is an abundance of literature describing, testing, and validating novel assessment tools; what is missing, however, is follow-up from such studies on higher levels of outcome, including the learner-level (e.g., achievement on standardized exams, advancement or promotion within a residency program or graduating sooner), patient-level (e.g., improved satisfaction with care, time waiting to be seen), or system-level (e.g., readmission rates, productivity, medical errors, near-misses, etc). 109As we continue to adopt CBME and its educational approach, innovation will be key to building capacity in sound competency assessment.
Studies in this review largely omitted cost reporting.Estimates of costs related to assessment tools were provided by only two studies.Determining cost(s) associated with an assessment tool is paramount to its existence; without securing resources (including funding), an assessment system will be difficult to sustain.Medical education researchers should be strongly encouraged to determine the value of an intervention -beyond an instrument's correlation with other learning tools, whether learners and/or faculty enjoy it, and so on.The move towards CBME is already in progress and, by determining costs, administrators and directors can anticipate how they must (re)allocate resources to support this approach to learner assessment.
Cost analyses of medical education programs are notoriously difficult; competency assessment systems are no exception. 30There are insufficient precedent, experience and, to a certain degree, interest among medical educators in conducting cost e113 analyses of proposed assessment tools.19]30 In the context of resource analyses for learner assessment tools, the "ingredients" method, which compiles a list of resources required, is useful to tabulate total cost.Common categories that have emerged in the literature and are relevant to learner assessment tools or systems are summarized in Box 1. Further, we suggest cost be reported in three ways: 1) ingredients; 2) total cost; and 3) per-learner cost.Should there be a large upfront investment cost required (for example, purchasing of new equipment for simulation training), reporting the "initial investment cost" will provide context for interpreting the three aforementioned costs.

Limitations
We may have captured the full breadth of information available on this topic, for two main reasons.First, as with all systematic reviews, it is possible our search did not capture the full extent of indexed literature.However, we did capture a large number of abstracts, which suggests a broad search.As well, given our interest in English language studies published after 2004, the vast majority of published studies would likely be indexed and captured.Secondly, and most importantly, peerreview publication of assessment methods and systems is not done systematically across all residency programs.This limitation was anticipated a priori and our study intentionally highlights the paucity of publications of assessment tools and systems.Capturing unpublished information on resident assessment, such as through a survey of program directors or review of program websites, was a delimitation of our study and out of the scope of our systematic review but would be valuable to pursue with future studies.Another limitation is our inability to assess costs.We did abstract cost metrics, however these are challenging to approximate or report.For example, "ingredients" such as hours spent by faculty members, running a computer system, or hospital supplies are difficult to quantify but are key in implementing and establishing a CBME system.Lastly, we made a reasonable assumption that a rotation was onemonth, which allowed us to calculate an overall median frequency of assessment.If the average duration of rotations is longer than one-month then our assumption is an overestimation of assessment frequency.Despite our bias toward the "best case scenario" of rotations of one month, assessment still occurred rather infrequently.Our sensitivity analysis, which checked the one-month assumption by assuming three-month rotations, showed minimal change in the overall frequency of annual assessment (24 vs 20 assessments annually).

Lessons learned
As medical educators develop and validate methods of learner assessment, their research should be held to the same standards as any other area of rigorous scientific inquiry; this necessitates (peer-reviewed) publication and distribution of knowledge and experiences as well as related costs.Through this, we can develop assessment methods that are feasible, resource-effective and, hence, sustainable.CBME presents a great opportunity to galvanize our nation's community of medical educators.We hope that by pointing out the deficits in the present literature we can encourage our community to share their innovations and contribute to the community as a whole.Key take-home points for medical educators are summarized in Box 2.

Box 2: Key findings and next steps for resident assessment
Key findings:

Figure 2 :
Figure 2: Median number of assessment of residents by time interval reported for each tool

Table 1 : Characteristics of included papers
*65 publications comprising 74 programs **55 publications comprising 59 programs ***Number of construct validity levels demonstrated for each assessment tool (based on 6 criteria of the Messick Framework for global construct validity) *Messick levels of validity evidence coding: 0 = None met (and no alternative paradigm used); 1 = Structural validity; 2 = Content validity; 3 = Substantive validity; 4 = External validity; 5 = Generalizability validity; 6 = Consequential validity; 0 = none reported Abbreviations: ACGME = American Council of Graduate Medical Education; CORD-EM = Council Of Emergency Medicine Residency Directors; EM = Emergency Medicine; ITER = In-Training Evaluation Report; Mini-CEX = Mini-Clinical Evaluation Exercise; OSAT = Objective Structured Assessment of Technical skills; OSCE = Objective Structured Clinical Exam; PGY = Post-Graduate Year (i.e., residency year); SDOT = Standardized Direct Observation Tool *