Using Statistical and Judgmental Reviews to Identify and Interpret Translation Differential Item Functioning

  • Mark Gierl
  • W. Todd Rogers
  • Don A. Klinger


The purpose of this study was to evaluate the equivalence of two translated tests using statistical and judgmental methods. Performance differences for a large random sample of English- and French-speaking examinees were compared on a grade 6 mathematics and social studies provincial achievement test. Items displaying differential item functioning (DIF) were flagged using three popular statistical methods—ManteTHaenszel, Simultaneous Item Bias Test, and logistic regression—and the substantive meaning of these items was studied by comparing the back-translated form with the original English version. The items flagged by the three statistical procedures were relatively consistent, but not identical across the two tests. The correlation between the DIF effect size measures were also strong, but far from perfect, suggesting that two procedures should be used to screen items for translation DIF. To identify the DIF items with translation differences, the French items were back-translated into English and compared with the original English items by three reviewers. Two of seven and six of 26 DIF items in mathematics and social studies respectively were judged to be nonequivalent across language forms due to differences introduced in the translation process. There were no apparent translation differences for the remaining items, revealing the necessity for further research on the sources of translation differential item functioning. Results from this study provide researchers and practitioners with a better understanding of how three popular DIF statistical methods compare and contrast. The results also demonstrate how statistical methods inform substantive reviews intended to identify items with translation differences.