Saturday, August 28, 2021

Cohen's Kappa paradoxes make sample size calculation impossible

Cohen's kappa coefficient often yields unduly low estimates, which can be counter intuitive when compared to the observed agreement level quantified by the percent agreement. This problem has been referred to in the literature as the Kappa paradoxes and has been widely discussed by several authors. Feinstein and Cicchetti (1990) for example among others wrote about it.

Although researchers have primarily been concerned about the magnitude of kappa, another equally serious and often overlooked consequence of the paradoxes is the difficulty to perform sample calculations. Supposed you want to know the number \(n\) of subjects that is required to obtain a kappa's standard error smaller than 0.3.  The surprising reality is that, no matter how large the number of \(n\) of subjects, there is no guarantee that kappa's standard error will be smaller than 0.50.  In other words, a particular set of ratings can always be found that would yield a standard error that exceeds 0.30 for example.

Note that for an arbitrarily large number of raters \(r\), Conger's kappa, which reduces to Cohen's kappa for \(r=2\), Krippendorff's alpha or Fleiss' generalized kappa have similar large-sample variances.  Therefore, I have decided to investigate Fleiss' generalized kappa only. The maximum variance of Fleiss' kappa is given by:

\[MaxVar\bigl(\widehat{\kappa}_F\bigr) =\frac{an}{n-b},\hspace{3cm}(1)\]

where \(a\) and \(b\) are 2 constants that depend on the number of raters \(r\) and the number of categories \(q\).  For more details about the derivation of this expression see Gwet (2021, chapter 6).

For 2 raters, \(a=0.099\) and \(b=3.08\). Consequently, even if the number of subjects goes to infinity, the maximum standard error will still exceed the \(\sqrt{a}=\sqrt{0.09}=0.312\). That is, it will  always be possible to find a set of ratings that leads to a standard error that exceeds 0.3.


Feinstein, A.R. and D.V. Cicchetti (1990), High agreement but low kappa: I. the problems of two paradoxes." Journal of Clinical Epidemiology, 43, 543-549.

Gwet, K. (2021), Handbook of Inter-Rater Reliability, 5th Edition. Volume 1: Analysis of Categorical Ratings, AgreeStat Analytics, Maryland USA