Competency-based simulation assessment of resuscitation skills in emergency medicine postgraduate trainees - a Canadian multi-centred study.

BACKGROUND
The use of high-fidelity simulation is emerging as a desirable method for competency-based assessment in postgraduate medical education. We aimed to demonstrate the feasibility and validity of a multi-centre simulation-based Objective Structured Clinical Examination (OSCE) of resuscitation competence with Canadian Emergency Medicine (EM) trainees.


METHOD
EM postgraduate trainees (n=98) from five Canadian academic centres participated in a high fidelity, 3-station simulation-based OSCE. Expert panels of three emergency physicians evaluated trainee performances at each centre using the Queen's Simulation Assessment Tool (QSAT). Intraclass correlation coefficients were used to measure the inter-rater reliability, and analysis of variance was used to measure the discriminatory validity of each scenario. A fully crossed generalizability study was also conducted for each examination centre.


RESULTS
Inter-rater reliability in four of the five centres was strong with a median absolute intraclass correlation coefficient (ICC) across centres and scenarios of 0.89 [0.65-0.97]. Discriminatory validity was also strong (p < 0.001 for scenarios 1 and 3; p < 0.05 for scenario 2). Generalizability studies found significant variations at two of the study centres.


CONCLUSIONS
This study demonstrates the successful pilot administration of a multi-centre, 3-station simulation-based OSCE for the assessment of resuscitation competence in post-graduate Emergency Medicine trainees.


Introduction
The assessment of resuscitation skills in postgraduate medical trainees is moving towards competency-based methods, and away from knowledge-based examination. 1,2 As noted by Miller and colleagues, 3 there is a progression from "knows" to "knows how," "shows how" and "does" through medical training. Thus, it is necessary to assess not only technical knowledge, but also practical competencies such as clinical reasoning and teamwork. This move towards appropriate competency-based assessment in postgraduate education has been endorsed by Accreditation Council for Graduate Medical Education (ACGME, USA) and Royal College of Physicians and Surgeons of Canada (RCPSC), as well as at the 2010 Ottawa Conference "Assessment of Competence in Medicine and the Healthcare Professions." [4][5][6] High fidelity simulations using computer-controlled mannequins with mechanized movements, cues, and simulated vital signs are being used throughout medical education to emulate real patient encounters. 7,8 This simulation-based medical education gives trainees opportunities to practice, while simultaneously allowing them opportunities to develop their teamwork and communication skills. 9,10 A recent meta-analysis highlighted this method of training to be superior to opportunistic exposures in clinical medical education in achieving specific clinical skill acquisition goals. 11 As a result, there has been a dramatic increase in the use of high-fidelity simulation in Emergency Medicine (EM) and Anesthesia. 6 In fact, the ACGME has required simulation be directly incorporated into postgraduate EM curricula -not as a separate adjunct, but as the primary education strategy for topics that have been deemed best taught in a simulation format. 12 Despite the recent development and integration of high-fidelity mannequin-based simulation in post-graduate medical education, there have been limited advancements in integrating simulation within competency-based assessment systems. 13 An examination of the literature for simulationbased assessment within EM and across the other specialties reveals a deficit of easily modifiable and useful assessment tools for dynamic resuscitation skill performance. Many studies have been published within Anesthesia, Pediatrics, and EM that demonstrate excellent discriminatory ability and inter-rater reliability in the assessment of specific ACLS, [14][15][16][17] CRM, 18,19 or team competency based skills. [20][21][22] However, these assessments lack the appropriate metrics for widespread use by postgraduate licensing bodies and training programs. 8 This lack of metrics is further supported by a recent systematic review of technologyenhanced simulation in the assessment of health professionals that states evidence for validity of previous studies is sparse with "room for improvement." 23,24 Although many studies have repeatedly illustrated strong assessment tool performance for measuring defined outcomes, they have often been limited in scope, used simple checklists or a specified algorithmic approach, and have failed to satisfy the "unified model" for validity. 24,25 As a result, there remains a great need for valid and reliable simulation-based assessment tools for assessing competency-based resuscitation skills of medical trainees.
Valid and reliable simulation-based competency assessment of resuscitation skills would be beneficial to postgraduate licensing bodies and training programs both in EM and in other specialties. 8 In order for simulation-based activities to be used for assessment of resuscitation skills, appropriate metrics must be constructed. In Pediatrics and Anesthesia, 26

Study design
A prospective observational design was employed to study a multi-centre, simulation-based OSCE for assessment of resuscitation skills in EM postgraduate trainees. A previously developed and validated simulation-based resuscitation assessment tool was used to assess the performances of all trainees.
Assessment system: The QSAT was designed to be simple and modifiable in assessing specific and generalized resuscitation parameters. 32 It is unique in its basic framework in that it can be modified for specific clinical scenarios. The QSAT uses a standardized format with two components: 1) four domain scores (initial assessment, diagnostic approach, therapeutic approach, and communication skills, and 2) a single global assessment score (GAS). All domain scores and the GAS are based on a 5-point Likert rating scale (1=inferior to 5=superior) with descriptors for each numerical score. Each domain score also contains anchored skills to assist in scoring (sample assessment available from authors upon request).

Study setting
The study took place over an 18-month period at five academic University simulation centres across Canada: 1) Queen's University (Kingston Resuscitation Institute Lab), 2) University of Toronto (Sunnybrook campus), 3) University of Ottawa (Ottawa Civic Campus), 4) University of Calgary (Foothills Hospital Campus), and 5) Dalhousie University (Queen Elizabeth II Campus, Halifax). Multiple models of high-fidelity simulation mannequins were employed: Gaumard Hal® and Susie® (Gaumard Scientific, Miami, Fl), and Laerdal SimJunior® (Laerdal Medical Canada, Ltd., Toronto, ON). All simulations were run by a simulation technician and a member of the research team (AKH, DD). Physiologic parameters (e.g. vital signs, eye opening, breath sounds) were adjusted using a predetermined set of palettes. The progression of palettes followed the therapeutic actions of the trainees during the OSCE scenario in a standardized fashion. The simulation lab at each centre was set-up on each occasion to re-create the physical environment of an emergency department resuscitation bay, and all necessary equipment or tools were available to the trainees.

Scenarios and assessment tool development
Three previously developed and validated standardized emergency department resuscitation scenarios, each with a corresponding QSAT, were used. 32 All scenarios were based on core content and objectives for EM postgraduate programs, as outlined by the RCPSC. An expert panel of EM faculty, with training in high fidelity simulation-based instruction, developed the scenarios. The chosen scenarios were: 1) Acute Congestive Heart Failure with Respiratory Distress, 2) Subarachnoid Hemorrhage with Decreased Level of Consciousness, and 3) Sympathomimetic Stimulant Ingestion. The scenario duration was 7 minutes for each station. Each scenario included scripted roles and clear e60 instructions for the trained actors and the simulation technician.

OSCE administration and evaluation
The 3-station, simulation-based OSCE was administered at each of the five academic centres. Each trainee completed the 3 standardized scenarios over a total of 30 minutes. Each trainee was provided with verbal instructions for the OSCE prior to their participation and given one minute to read a written scenario stem immediately prior to performing each scenario. The trainees' performances were recorded from 3 fixed camera angles to allow adequate views of the trainee, the mannequin, and the cardiac monitors (Appendix 1). Members of the research team observed and directed each session from an obscured location. Each candidate received a formal debriefing by an associated faculty member immediately following their performance for the purposes of formative feedback. For each centre, the videotaped performances were stored on a secure laptop computer and subsequently rated by three independent, blinded content experts. All raters were EM faculty with 5 years or more experience and specific training in simulation-based education. To minimize bias, the raters were not faculty physicians at the same centre as the trainees they were assessing, and the raters were not given any information about the trainees' identities or level of training. Each centre had a different group of 3 raters; the same 3 raters scored all of the trainees at a centre. In order to standardize evaluations, all raters underwent a 3-hour training session, led by an investigator, on the use of the assessment tools. They practiced using the assessment tool with standardized recorded performances designed specifically for orientation.

Statistical analysis
Data analyses were performed using IBM SPSS version 22.0. The initial analyses focused on the inter-rater reliability in order to establish that the raters could provide consistent scores for the residents across centres. Intraclass correlation coefficients (ICCs) are a preferred method to determine the average rate of absolute agreement when there are two or more raters scoring the same trainees. 34 Given that distinct groups of three raters assessed residents from different locations, separate two -way random ICCs were calculated to determine the average level of absolute rater agreement across the three raters at each centre for each scenario. The set of ICCs from the centres and items were compared to determine if the raters were able to use the QSAT consistently. ). These groups were compared using three one-way (level of training) ANOVAs, using the three scenarios as dependent variables. For each scenario, the null hypothesis was that trainees' scores would not differ based on their level of training. Discriminatory evidence was provided if trainees with higher levels of training obtained significantly higher scores on the 3 scenarios. An omnibus F-test was conducted to determine if any group differences exited. Given that there were three scenarios, a Bonferroni correction was used to reduce the likelihood of a Type I error. Thus the null hypothesis was rejected if < 0.017 (0.05/3). If a significant group difference was found, post hoc analyses of the main effects were conducted using the Tukey's HSD method ( < 0.017).
Finally, a Generalizability study (G-study) was conducted for trainees' scores at each centre (Trainee X Rater X Scenario) to determine the variance components and G-coefficients. G-Studies are useful for initial test design research as they can be used to identify the sources of score variation and then to help determine the optimal number of stations needed for reliable estimates of trainee performance.
G-studies identify variance components for main and interaction effects. For these analyses, trainees were the object of measurement and raters and scenarios served as the potential sources of error. The calculated variance components were subsequently used to conduct a series of Decision studies (D-studies) to determine the generalizability that could be obtained using different combinations of raters and scenarios, maximizing efficiency and accuracy of assessments.  (Table 1). The intraclass correlation coefficients (ICC), which provide a measure of the inter-rater reliability of the ratings, of the QSAT ratings for each scenario are shown in   Note: a = PGY-1,2 average scores are significantly lower than PGY-3,4,5 scores (p<0.001) Finally, a Generalizability study (G-study) was conducted using a fully crossed trainee by rater by scenario (T x R x S) design. Since each group of raters only marked the trainees from a single centre, separate analyses were conducted for each of the 5 centres. The estimated variance components, the relative contributions to score variance, and the Gcoefficients are provided in Table 4. With one exception, the largest source of variance was the trainee by scenario interaction. This means that trainees' performances varied from scenario to scenario. This finding is not surprising as the trainee by scenario interaction in G-theory is essentially highlighting content specificity, which has been previously documented in the medical education literature. 36 D-studies were conducted to evaluate the effectiveness of alternative designs with differing numbers of facets for each of the administered examinations (Table 4). Coefficients had significant variability based on centre, with Queen's, Ottawa, and Dalhousie having the most similar estimates.
The D-studies suggest that increasing the number of scenarios per OSCE to between 6 and 9 with only a single rater per station would produce G-coefficients ranging from 0.81 to 0.91. For instance, with 6 scenarios and 1 rater, the range of G-coefficients would be 0.83-0.91. As noted previously, the subsamples from specific centres resulted in differences in the centre level estimates. In contrast, the Toronto and Calgary centres demonstrated minimal improvements in G-coefficients (0.31-0.76) when increasing the number of scenarios regardless of the number of raters.

Discussion
Simulation-based education has become an integral part of postgraduate training of resuscitation skills for EM, Anesthesiology, Internal Medicine, Surgery, and Critical Care. 7 Understandably, there has been a broad call for the development and implementation of tools to assess competency of postgraduate medical trainees. [4][5][6] Within EM, validated competency-based assessment tools which demonstrate validity and reliability sufficient to be used for learning progress, readiness for practice, and high stakes decision making, have been recommended by numerous accreditation e64 bodies. 5,28,29,37 This pilot study examined the performance of a previously validated simulationbased assessment tool (the QSAT) in a 3-station resuscitation OSCE at 5 post-graduate EM training programs across Canada. 32 Despite the challenges of standardizing the environmental testing conditions for all trainees, recruiting trainees for participation, and orienting new expert examiners at each of the five separate centres, the QSAT performed well, demonstrating its promise for high stakes or summative assessment contexts.
The inter-rater reliability (ICCs) of the QSAT was high to very high (0.84-0.97) in all but one centre (Toronto). The lower values at the Toronto centre were likely due to the raters of the Toronto trainees noticing various aspects of poor performance, resulting from a restricted, homogenous sample. 38 Overall, the set of findings demonstrate that our study methodology was successful in training experts to be reliable raters.
Importantly, the QSAT was found to be able to differentiate between levels of trainees, highlighting its effectiveness to discriminate between likely levels of competence. Residents with higher levels of training (SR-FRCP) consistently demonstrated higher scores compared to those with lower levels of training (JR-FRCP). This means the 3-stations were able to differentiate between varying levels of trainee competence and could be used in future high stakes summative exams to assess thresholds of performance. It is important to note that a third group of trainees was also examined: the CCFP-EM senior resident group. This group was specifically separated out from the Jr-FRCP and Sr-FRCP groups for analysis because of their different certification model. The CCFP-EM group, with two years of Family Medicine training followed by a single year of EM, is a very heterogeneous group with high variability in their expected performances for resuscitation scenarios. There were no significant differences between CCFP-EM trainees when compared to either the Jr-FRCP or Sr-FRCP groups. Admittedly, the smaller CCFP-EM sample likely impacted the analyses. Our future studies will revisit this CCFP-EM trainee group while also continue to examine the different levels within the FRCP.
Lastly, the generalizability of the QSAT was examined, with the intention of providing guidance for the subsequent use of QSAT in actual testing conditions. In our study, the largest sources of variance were the trainees, followed by the trainee x scenario component at four of the five centres. 39 Across all five centres, there was minimal variance for the rater and scenario components individually.
These findings indicate that more than three cases would be required to increase the generalizability to a sufficient level for high-stakes assessments of trainees. The D-study results which found that an ideal number of scenarios to achieve a G-coefficient greater than 0.8 with only 1 rater would be greater than 6, which is consistent with existing literature. 40 This combination would likely be the best methodology to pursue when considering the feasibility, resource allocation, and statistical acceptability when designing future simulation OSCEs of resuscitation competence. Admittedly, only three of the five centres demonstrated this finding. Nevertheless, our analyses suggest the Toronto and Calgary centres contained very homogeneous groups of trainees. Recruitment at each of these centres was complicated by a nonuniform distribution of participating trainees' abilities.
Our methodology and results were not without limitations. Although this study did not reproduce ICC and generalizability (G) values as strong as the previous single centre QSAT study, 32 this was not surprising as there were many variables that were less easily controlled for (e.g., different simulation lab environments at test centres, variable trainee experience in simulation education and assessment, and non-standardized resident recruitment). As well, the feasibility of trialing more than 3 OSCE stations at 5 separate centres was too great due to limitations of time and resources at the various centres. As a result, the 3-station OSCE was not as generalizable as was hoped and future studies would require a more comprehensive approach with six or more stations to decrease trainee by scenario variance component. Nevertheless, our study provides the first multi-centre exploration of assessment of EM postgraduate trainees' resuscitation skills in a dynamic, high fidelity simulation environment.
Currently, the Anaesthesia Training Program in Israel employs a high stakes simulation-based OSCE examination for its post-graduate trainees. With its centralized approach and testing centre, every trainee undergoes competency-based testing in a simulation environment to fulfill licensing requirements. Creating a similar system of assessment is consistent with the desired direction of the ACGME (USA), the FMEC (Canada), and the Ottawa conference proceedings (International). [4][5][6] Looking forward, the next step will be to develop a competency-based summative assessment resuscitation OSCE with a centralized single centre model with multi-centre recruitment, with six or more scenarios, and one expert examiner per station. Such a study could focus on senior level residents alone and work towards an appropriate benchmarking strategy for competency thresholds for high stakes pass/fail summative examination. As well, a single centre model would help standardize trainee recruitment, OSCE station execution, and training of expert raters.

Conclusion
In this study, we conducted a 3-station simulationbased OSCE at 5 academic EM training programs across Canada. We assessed the performance of EM postgraduate trainee resuscitation skills using the QSAT, a simple and modifiable competency-based assessment tool that has been validated previously at Queen's University. Our study demonstrates a framework by which competency-based assessment can be performed in a simulation setting with a tool that shows promise in its discriminatory capabilities, inter-rater reliabilities, and generalizability. Future research should be directed at developing a multicentre, single centre, multi-station OSCE format that further tests the use of this assessment system.