Reliability by Design: Human–AI Writing Assessment
DOI:
https://doi.org/10.55016/tspy3m52Abstract
High-stakes writing assessment is a standards-referenced judgment practice in which reliability depends on shared interpretations of quality rather than mechanical measurement. This study examines reliability and standards communication in a rubric-driven, AI-supported system for Alberta English Language Arts 30–1. Fifteen publicly released exemplar essays (2022–2024) were evaluated using verbatim rubric embedding, criterion-level evidence requirements, and a gated decision process without numerical aggregation. Reliability was assessed through alignment with authorized classifications and stability across repeated evaluations (150 decisions). The system matched official classifications in 93.3% of cases and was fully stable across repetitions, with all disagreements confined to adjacent performance boundaries. Qualitative comparison of teacher and AI commentaries indicates that the system more consistently externalized criterion-referenced warrants, enhancing transparency without displacing human authority. Findings frame reliability as an emergent property of assessment-system design and position AI as infrastructure for contestable standards communication.