Comparing Change-Score And ANCOVA Methods of Analyzing Data

When analyzing educational data, statisticians often find associations between the initial status of their subjects and the subjects' growth after an experiment. These associations incur a bias which can obfuscate the true improvement of a subject. The two methods typically used in this sort of analysis, Change-Score and ANCOVA (also known as regressor-variable), incur different biases. Statisticians have debated for decades which method is generally better suited to the task—in other words, which method generally incurs a smaller bias.

In their 2016 article "Accounting for the Relationship Between Initial Status and Growth in Regression Models," Sean Kelly and Feifei Ye argue that the appropriate method depends on multiple factors, including measurement error and any intrinsic associations between initial status and growth. They simulated hundreds of possible scenarios and provided several tables showing what the bias would be in each case, for both methods of analysis. These used particular statistics: Cohen's d (for the group difference in initial status), the sample size (N), any intrinsic association between initial status and growth (δ), and the reliability of the achievement scores at Time 1 and Time 2* (r). Given this data, one could find the corresponding bias, which is considered significantly large if it is greater than 0.5.

We have converted these tables into a program that will take input for the statistics above and return a table of the bias discovered for that scenario for each method. If the reliability is unknown, it will return a table with the biases for every reliability in the list. The bias with the smaller absolute value will be highlighted in green, and any biases with an absolute value greater than 0.5 will be printed in red text.

You can read Kelly and Ye's article here.

Cohen's D:

Sample size (n):

Estimated intrinsic association between initial status and growth (δ):

Estimated reliability (r):

* This assumes a stable reliability. In practice, it is certainly possible for the reliability to be different at each time point, but the authors did not feel this warranted simulations, as there is no rule of thumb as to whether it would generally go up or down. Due to fatigue, the second might be lower, but in education, the spring scores are often the high-stakes ones, so that rule doesn't apply in much ed policy research. Yet there is an assumption there the reader should be aware of.