Reliability by Design: Human–AI Writing Assessment

Ali Mikaeili

doi:10.55016/tspy3m52

Authors

Ali Mikaeili University of Calgary

DOI:

https://doi.org/10.55016/tspy3m52

Abstract

High-stakes writing assessment is a standards-referenced judgment practice in which reliability depends on shared interpretations of quality rather than mechanical measurement. This study examines reliability and standards communication in a rubric-driven, AI-supported system for Alberta English Language Arts 30–1. Fifteen publicly released exemplar essays (2022–2024) were evaluated using verbatim rubric embedding, criterion-level evidence requirements, and a gated decision process without numerical aggregation. Reliability was assessed through alignment with authorized classifications and stability across repeated evaluations (150 decisions). The system matched official classifications in 93.3% of cases and was fully stable across repetitions, with all disagreements confined to adjacent performance boundaries. Qualitative comparison of teacher and AI commentaries indicates that the system more consistently externalized criterion-referenced warrants, enhancing transparency without displacing human authority. Findings frame reliability as an emergent property of assessment-system design and position AI as infrastructure for contestable standards communication.

Reliability by Design: Human–AI Writing Assessment

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

Why Open Access?

Why Transdisciplinarity?

ISSN

Land Acknowledgement

Information

Language

Make a Submission