Reliability by Design: Human–AI Writing Assessment

Authors

  • Ali Mikaeili University of Calgary

DOI:

https://doi.org/10.55016/tspy3m52

Abstract

High-stakes writing assessment is a standards-referenced judgment practice in which reliability depends on shared interpretations of quality rather than mechanical measurement. This study examines reliability and standards communication in a rubric-driven, AI-supported system for Alberta English Language Arts 30–1. Fifteen publicly released exemplar essays (2022–2024) were evaluated using verbatim rubric embedding, criterion-level evidence requirements, and a gated decision process without numerical aggregation. Reliability was assessed through alignment with authorized classifications and stability across repeated evaluations (150 decisions). The system matched official classifications in 93.3% of cases and was fully stable across repetitions, with all disagreements confined to adjacent performance boundaries. Qualitative comparison of teacher and AI commentaries indicates that the system more consistently externalized criterion-referenced warrants, enhancing transparency without displacing human authority. Findings frame reliability as an emergent property of assessment-system design and position AI as infrastructure for contestable standards communication.

Downloads

Published

2026-05-19