Are You Using High-Stakes Assessment? Do You Have Intra/Interrater Reliability?

By: Jean Ellen Zavertnik and Ann Holland

Is high-stakes assessment in simulation used in your program? By this we mean “an evaluation process associated with a simulation activity that has a major academic or educational consequence” (Meakim et al., 2013, p. S7).

hst_01_stt11As greater emphasis is placed on high-stakes assessment of simulation performance in nursing education, programs must ensure that assessment methods are fair and reliable (National League for Nursing [NLN], 2012). The NLN Project to Explore the Use of Simulation for High Stakes Assessment (Rizzolo, Kardong-Edgren, Oermann, & Jeffries, 2015) evaluated the process and feasibility of using manikin-based high-fidelity simulation for high-stakes assessment in pre-licensure RN programs. The study produced as many questions as answers. One such question was: What are the best methods to train raters?

This blog post reports on our unique perspectives, derived from participating in a study that tested the effectiveness of a training intervention for faculty evaluators in achieving intra- and interrater reliability of simulation performance. Interrater reliability is the extent to which raters assign the same score to the same variable (McHugh, 2012). Intrarater reliability is the extent to which a rater assigns the same score to separate observations of the same performance variables.

Describing the Study (Ann Holland)
I served as principal investigator of the study titled “The Effect of Evaluator Training on Intra/Inter Rater Reliability in High-Stakes Assessment in Simulation.”

My research team received permission from the NLN to use the Creighton Competency Evaluation Instrument (CCEI) and student performance videos produced for the NLN project. We launched the study in 2015 by creating a training intervention using best practices from the nursing literature. Since we recruited participants from across the country, the evaluator training was delivered online. We created training videos and documents, conducted training webinars, and provided feedback to participants based on their scoring of practice videos.

Preliminary results of the study were presented at the 2017 NLN Education Summit, and articles are now being prepared for publication. Here is a sneak peek at some of the study highlights and lessons learned by the research team.

  1. We piloted the study procedures with five pilot participants. We learned with this small sample how to reach agreement about interpreting student performances and applying the CCEI competencies. Since we adopted the CCEI tool used in the NLN project, we did not modify performance criteria. Pilot participants wished they could change some criteria, yet they acknowledged that “no tool is perfect.”
  2. We used only 10 of the 28 student performance videos produced by the NLN.
    There was significant variation in the quality of these recordings. We selected for consistency in cueing, sound or visual quality, and length of recording to ensure that students had sufficient time to perform the expected skills. Consistency contributes to higher inter/intrarater reliability, allowing raters to consistently apply all criteria on the evaluation tool.
  3. Clear communication is a foundational skill for achieving reliability in assessment. We succeeded through active listening, clear expression of ideas, and valuing and respecting different opinions.
  4. Developing a shared mental model of evaluation is an iterative process. We can identify several points along our study timeline at which we thought the shared mental model had crystalized, only to morph again. The addition of new evaluators provided new perspectives that prompted modifications to the shared mental model. Since time was limited, it was necessary to ask in the consensus-building process: “Is this model good enough?” and “Can everybody live with this?”
  5. A meaningful research study is only as good as its data analysis. Our pilot study helped us refine data analysis strategies to better fit our research questions and data collection tools. We computed percent correct, intraclass coefficient, and kappa statistics to analyze inter/intrarater reliability.

The Participant Perspective (Jean Ellen Zavertnik)
I benefited as a participant in the study. The evaluator training gave me insight into the importance of using a quality assessment tool. I gained knowledge about evaluating high-stake assessments and the significance of training evaluators to increase intra/interrater reliability.

Here are some takeaways from my point of view as a participant.

  1. A well-developed assessment tool is key. The criteria need to be simple, clearly stated, and well defined. Multiple elements (for example, BP, heart rate, level of consciousness, and peripheral pulses) within one criterion make it difficult to score and can lead to decreased interrater reliability.
  2. Criteria should be weighted by importance. All assessment criteria in the study had equal value. Critical elements should be determined and required for successful completion of the scenario. For example, the student must ID the patient with two identifiers to pass.
  3. Training evaluators takes time. In this study, the advanced training took about 10 hours and consisted of several practice assessments and reassessments. The more videos I reviewed the better I became at noticing the criteria I was evaluating. I believe evaluators need to be very familiar with the tool before using it in a real situation.
  4. Gaining interrater reliability among faculty may prove to be more difficult than first thought. In this study, the raters defined a “shared mental model” by which to judge the performance. It can be difficult for raters to find that common ground. Flexibility and consensus are necessary to move the process forward.
  5. Video recording is a good idea, because it can be easy to miss something the student does or says. I did review a few of the videos more than once in order to make sure I was scoring correctly, thus improving intrarater reliability.
  6. Video quality can affect the raters’ ability to judge the performance. In many of the labs, the angle of the camera did not allow for visualization of hand hygiene upon entering the patient room. I did not feel comfortable scoring that criterion unless I could see it on the video.

Summative evaluation of student performance through high-stakes testing can be a valuable method to assess clinical competency and progression in the program. We believe a quality assessment tool, sufficient evaluator training, and adequate video recording are key to improving intra/interrater reliability and fair appraisal of student performance.


Meakim C., Boese T., Decker S., Franklin, A.E., Gloe, D., Lioce L., . . . Borum J.C. (2013). Standards of best practice: simulation standard I: terminology. Clinical Simulation in Nursing. 9(6S): S3-S11. doi: 10.1016/j.ecns.2013.04.001

McHugh, M. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276-282.

National League for Nursing (NLN). (2012). Fair testing guidelines for nursing education.

Rizzolo, M.A., Kardong-Edgren, S., Oermann, M.H., & Jeffries, P.R. (2015). The National League for Nursing project to explore the use of simulation for high-stakes assessment: Process, outcomes, and recommendations. Nursing Education Perspectives, 36(5), 299-303. doi:10.5480/15-1639

3 thoughts

  1. I also had the opportunity to be a participant in this study. I saw the importance of using a shared mental model utilizing the CCEI tool for evaluation. This study motivated me to created shared mental models for evaluation of the simulations we now run at our college along with standardizing the evaluation tool. Thanks you for the research on this topic of interrater reliability.

  2. I too was a participant in this study. Unfortunately I was in the control group and therefore was not afforded the opportunity to partake in this added rater training. Despite this I did come to some important conclusions regarding high-stakes testing. First, the test must be recorded for repeat examination. It was amazing how easy it was to miss items on the grading rubric when the student was performing a number of tasks quickly and simultaneously while communicating with the client. Secondly, since no grading rubric is perfect for every situation, I can see where a strong shared mental model between raters is extremely valuable, for reliability and ease of scoring the rubric.
    Thank you for your research on this important topic.

  3. I’m not sure where you’re getting your info, but good topic. I needs to spend some time learning much more or understanding more. Thanks for wonderful information I was looking for this info for my mission.

Leave a Reply to Rudolph BloombergCancel reply