(In)Stability of Test Scores


  • Stefan Merchant Queen's University
  • Jessica Rich Queen's University
  • Don Klinger Waikato University


large-scale testing, G-theory, educational policy, test reliability


Both school and district administrators use the results of standardized, large-scale tests to inform decisions about the need for, or success of, educational programs and interventions. However, test results at the school level are subject to random fluctuations due to changes in cohort, test items, and other factors outside of the school’s control. This study examined year to year changes in school level results on standardized tests delivered in Ontario, Canada. G-theory analyses found that test scores are not stable enough for meaningful conclusions to be made based on year to year changes in school level results. For small and medium sized schools, years of data need to be collected before defensible decisions can be made about trends in test scores. The authors introduce a ‘bounce’ statistic that provides a simple, easy to interpret measure of test score stability.

Author Biography

Stefan Merchant, Queen's University

Ph.D. Candidate | Faculty of Education | Queen's University


Alberta Ministry of Education. (2021). Student learning assessments. https://www.alberta.ca/student-learning-assessments.aspx

Anderson, J. O., Lin, H. S., Treagust, D. F., Ross, S. P., & Yore, L. D. (2007). Using large-scale assessment datasets for research in science and mathematics education: Programme for International Student Assessment (PISA). International Journal of Science and Mathematics Education, 5(4), 591-614. https://doi.org/10.1007/s10763-007-9090-y

Artuso, A. (2016, February, 28). School rankings raise many questions. The Toronto Sun.


Bolden, B., Christou, T., DeLuca, C., Klinger, D. A., Kutsyuruba, B., Pyper, J., Shulha, L. M., & Wade-Woolley, L. (2014). Collaborative inquiry in Ontario schools. An evaluation report for the Ontario Ministry of Education. Literacy and Numeracy Secretariat.

Brennan, R. L. (2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1-21. https://doi.org/10.1080/08957347.2011.532417

Briesch, A. M., Chafouleas, S. M., & Johnson, A. (2016). Use of generalizability theory within k–12 school-based assessment: A critical review and analysis of the empirical literature. Applied Measurement in Education, 29(2), 83-107. https://doi.org/10.1080/08957347.2016.1138955

British Columbia Minstry of Education. (2021). Foundation skills assessment. https://www2.gov.bc.ca/gov/content/education-training/k-12/administration/program-management/assessment/foundation-skills-assessment.

Broglio, S. P., Zhu, W., Sopiarz, K., & Park, Y. (2009). Generalizability theory analysis of balance error scoring system reliability in healthy young adults. Journal of Athletic Training, 44(5), 497-502. https://doi.org/10.4085/1062-6050-44.5.497

Calder, M. (2015). Board working to improve grade 9 EQAO math scores. http://www.ucdsb.on.ca/ucdsbnews/2015-2016SchoolYear/October/Pages/UCDSBGrade9MathEQAOScores.aspx

Canadian Language and Literacy Research Network. (2008). The impact of the literacy and numeracy secretariat: Phase 2 program evaluation. University of Western Ontario.

Cowley, P., & Emes, J. (2020). Report card in Ontario’s elementary schools 2020. Fraser Institute. https://www.fraserinstitute.org/sites/default/files/ontario-elementary-school-rankings-2020-13385.pdf

Earl, L. (2008). Leadership for evidence-informed conversations. In L. M. Earl & H. Timperley (Eds.), Professional learning conversations: Challenges in using evidence for improvement (Vol. 1, pp. 43-52). Springer Science & Business Media.

Earl, L., & Katz, S. (2006). Leading in a data rich world: Harnessing data for school improvement. Corwin.

Educational Quality and Accountability Office. (2017). Ontario student achievement: EQAO’s provincial elementary school report: Results of the assessments of reading, writing and mathematics, primary division (grades 1–3) and junior division (grades 4–6), 2016–2017. https://www.eqao.com/provincial-report-elementary-2017/

Educational Quality and Accountability Office. (2020). About EQAO. https://www.eqao.com/about-eqao/

Gagnon, R., Charlin, B., Lambert, C., Carriere, B., & Van der Vleuten, C. (2009). Script concordance testing: more cases or more questions? Advances in Health Sciences Education, 14(3), 367-375.

Goren, P. (2012). Data, data, and more data—What’s an educator to do? American Journal of Education, 118(2), 233-237.

Hamilton Wentworth District School Board. (2019). HWDSB EQAO results leads to investment in people, practice and progress. https://www.hwdsb.on.ca/wp-content/uploads/2019/09/EQAO-Infographic-2019.pdf

Hastings Prince Edward District School Board. (2012). EQAO results for grades 3, 6 and 9 continue to improve. http://www.hpedsb.on.ca/archives/eqao-results-for-grade-3-6-and-9-continued-to-improve/

Hattie, J. (2008). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

Hollingshead, L., & Childs, R. A. (2011). Reporting the percentage of students above a cut score: The effect of group size. Educational Measurement: Issues and Practice, 30(1), 36-43. https://doi.org/10.1111/j.1745-3992.2010.00198.x

Klinger, D. A., DeLuca, C., & Miller, T. (2008). The evolving culture of large-scale assessments in Canadian education. Canadian Journal of Educational Administration and Policy, 76(3), 1–34.

Klinger, D. A., & Rogers, W. T. (2011). Teachers’ perceptions of large-scale assessment programs within low-stakes accountability frameworks. International Journal of Testing, 11(2), 122–143. https://doi.org/10.1080/15305058.2011.552748

Klinger, D. A., Rogers, W. T., Anderson, J. O., Poth, C., & Calman, R. (2006). Contextual and school factors associated with achievement on a high-stakes examination. Canadian Journal of Education, 29(3), 771–797. https://doi.org/10.2307/20054195

Klinger, D. A., & Wade-Woolley, L. (2009). Supporting low performing schools in Ontario. Technical report prepared for the U. S. department of education. WestEd Organization.

Leithwood, K. (2011). School leadership, evidence-based decision making, and large-scale student assessment. In C. Webber & J. Lupart (Eds.), Leading student assessment (pp. 17-39). Springer.

Limestone District School Board. (2017). EQAO results show achievement in some levels continuing to improve. https://www.limestone.on.ca/news/news_releases_2017-2018/e_q_a_o_results_show_achievement_in_some_levels_co

Manitoba Ministry of Education. (n.d.). Assessment and evaluation. https://www.edu.gov.mb.ca/k12/assess/assess_program.html

McDonnell, L. M. (2005). Assessment and accountability from the policy maker’s perspective. In J. Herman & E. Haertel (Eds.), Uses and misuses of data for educational accountability and improvement (104th Yearbook of the National Society for the Study of Education) (pp. 35–54). Blackwell.

McNeish, D. (2017). Small sample methods for multilevel modeling: A colloquial elucidation of REML and the Kenward-Roger correction. Multivariate Behavioral Research, 52(5), 661-670. https://doi.org/10.1080/00273171.2017.1344538

Ontario Ministry of Education. (2010). Growing success: Assessment, evaluation and reporting in Ontario schools. Author. http://www.edu.gov.on.ca/eng/policyfunding/growSuccess.pdf

Prince Edward Island Ministry of Education. (2019). Provincial assessments. https://www.princeedwardisland.ca/en/information/education-and-lifelong-learning/provincial-assessments

Rainbow District School Board. (2016). School valuation framework. https://www.rainbowschools.ca/wp-content/uploads/2016/04/School_Information_Profile.pdf

Renfrew County District School Board. (2016). Board improvement plan for student achievement and well-being kindergarten to grade 12: 2016-2017. https://www.rcdsb.on.ca/en/resourcesGeneral/RCDSBBIPSA2016-2017-1.pdf

Rogers, W. T. (2014). Improving the utility of large-scale assessments in Canada. Canadian Journal of Education/Revue canadienne de l'éducation, 37(3), 1-22.

Scholarhood. (2017). Compare schools & neighbourhoods. We help families find homes in the boundaries of the best schools. www.scholarhood.ca

Toronto District School Board. (2018). Multi-year strategic plan. https://www.tdsb.on.ca/Portals/0/leadership/board_room/Multi-Year_Strategic_Plan.pdf

Ungerleider, C. (2006). Reflections on the use of large-scale student assessment for improving student success. Canadian Journal of Education, 29(3), 873–873. https://doi.org/10.2307/20054200

Upper Canada District School Board. (2018). Board improvement plan for student achievement and wellness 2018-2019. https://p16cdn4static.sharpschool.com/UserFiles/Servers/Server_148343/File/Our_Board/District%20Plans/BIPSAW/BIPSAW%20UCDSB%202018-2019%20Full%20Version.pdf

Volante, L. (2004). Teaching to the test: What every educator and policy-maker should know. Canadian Journal of Educational Administration and Policy, 35, 1-9.

Waterloo Region District School Board. (2016). Standardized test results show room to improve. https://cle.wrdsb.ca/2016/09/22/eqao-message-from-our-director/