K. Gwet's Inter-Rater Reliability Blog : 2015Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

On August 9, 2015, I received an email from a researcher of the University of Manchester about the standard error associated with Krippendorff's alpha coefficient. He was asking why my software AgreeStat for Excel produces a standard error for Krippendorff's alpha that is always higher than that produced by Dr. Hayes' SAS macro and SPSS macro called KALPHA. On this isssue, I like to make two comments:

1) AgreeStat uses a variance expression given in equation (7) of the document entitled "On Krippendorff's Alpha," while the KALPHA macro is based on the bootstrap standard error. However, the use of these two approaches cannot and should not explain the observed difference in standard error estimations.

2) I do not recommend using Dr Hayes’ macro programs for computing the standard error of Krippendorff’s alpha. It always underestimates (often by a wide margin) the magnitude of the standard error associated with Krippendorff’s alpha. Here is why I believe so. In his paper “Answering the Call for a Standard Reliability Measure for Coding Data,” released in 2007 and co-authored by Krippendorff himself , Dr Hayes says the following regarding the algorithm he has used:

The bootstrap sampling distribution of alpha is generated by taking a random sample of 239 pairs of judgments from the available pairs, weighted by how many observers judged a given unit. Alpha is computed in this “resample” of 239 pairs, and this process is repeated very many times, producing the bootstrap sampling distribution of Alpha.

This bootstrapping algorithm is terrible, and does not reflect in any way the bootstrap method previously introduced by Efron & Tibshirani (1998). Instead of replicating Table 1 of their article before re-computing the alpha coefficient (this is what Eforn & Tibshirani recommend), Dr Hayes generates several sets of 239 pairs of judgments and computes alpha for each of them. First the number 239 came from the original table, and is supposed to change from one bootstrap sample to the next. By keeping the exact same structure of the original sample, with the exact same number of missing judgments, you can only obtain a constrained (therefore smaller) variance. What Dr Hayes should have done is to simply generate several random samples with replacement from the set {1,2,3,4, ..., 40}. A with-replacement random sample will have duplicates, which is ok. The next step would be to extract from Table 1 only the rows whose numbers were selected in the with-replacement sample, and use them to form the bootstrap sample. This bootstrap sample would then be used to compute the alpha coefficient.

K. Gwet's Inter-Rater Reliability Blog

Saturday, August 22, 2015

Standard Error of Krippendorff's Alpha Coefficient