1) AgreeStat uses a variance expression given in equation (7) of the document entitled "On Krippendorff's Alpha," while the KALPHA macro is based on the bootstrap standard error. However, the use of these two approaches cannot and should not explain the observed difference in standard error estimations.

2) I do not recommend using Dr Hayes’ macro programs for computing the standard error of Krippendorff’s alpha. It always underestimates (often by a wide margin) the magnitude of the standard error associated with Krippendorff’s alpha. Here is why I believe so. In his paper “Answering the Call for a Standard Reliability Measure for Coding Data,” released in 2007 and co-authored by Krippendorff himself , Dr Hayes says the following regarding the algorithm he has used:

*The bootstrap sampling distribution of alpha is generated by taking a random sample of 239 pairs of judgments from the available pairs, weighted by how many observers judged a given unit. Alpha is computed in this “resample” of 239 pairs, and this process is repeated very many times, producing the bootstrap sampling distribution of Alpha.*

This bootstrapping algorithm is terrible, and does not reflect in any way the bootstrap method previously introduced by Efron & Tibshirani (1998). Instead of replicating Table 1 of their article before re-computing the alpha coefficient (this is what Eforn & Tibshirani recommend), Dr Hayes generates several sets of 239 pairs of judgments and computes alpha for each of them. First the number 239 came from the original table, and is supposed to change from one bootstrap sample to the next. By keeping the exact same structure of the original sample, with the exact same number of missing judgments, you can only obtain a constrained (therefore smaller) variance. What Dr Hayes should have done is to simply generate several random samples with replacement from the set {1,2,3,4, ..., 40}. A with-replacement random sample will have duplicates, which is ok. The next step would be to extract from Table 1 only the rows whose numbers were selected in the with-replacement sample, and use them to form the bootstrap sample. This bootstrap sample would then be used to compute the alpha coefficient.

I have another question related to KAlpha that I hope you can answer. I am also using the Hayes macro in SPSS (don't judge me!) and I have quite a few variables that have high %agreement and low Kappa because of a homogenous group of subjects and the variable has a rare value (eg No is very rare in a yes/no question). My question is why are my bootstrapping CI ranges quite large for these variables as well? I understand the paradox of the low Kappa and high %agreement in the presence of rare values, but why the large CI? Unfortunately I'm not great in understanding the maths behind the KAlpha (or Kappa etc) and I'm sure if I did I could answer my question! I'm hoping you can make it easy to understand. Many thanks, Natasha.

ReplyDeleteHi Natasha,

ReplyDeleteKappa as well as many other agreement coefficients is calculated by taking the difference between the percent agreement Pa and the percent chance agreement (Pe) and dividing it by 1-Pe. That is kappa=(Pa-Pe)/(1-Pe). The kappa paradox is often caused by unexpected unduly large values of the percent chance agreement Pe. When Pe gets close to 1 then kappa's denominator 1-Pe gets closer to 0, and the kappa itself becomes very unstable in a bootstrap process, taking values that vary from small to large to very large depending on Pa turns out to be. That is the wide bootstrap CI you get is also the result of the kappa paradox.