Saturday, August 28, 2021

Cohen's Kappa paradoxes make sample size calculation impossible

Cohen's kappa coefficient often yields unduly low estimates, which can be counter intuitive when compared to the observed agreement level quantified by the percent agreement. This problem has been referred to in the literature as the Kappa paradoxes and has been widely discussed by several authors. Feinstein and Cicchetti (1990) for example among others wrote about it.

Although researchers have primarily been concerned about the magnitude of kappa, another equally serious and often overlooked consequence of the paradoxes is the difficulty to perform sample calculations. Supposed you want to know the number \(n\) of subjects that is required to obtain a kappa's standard error smaller than 0.3.  The surprising reality is that, no matter how large the number of \(n\) of subjects, there is no guarantee that kappa's standard error will be smaller than 0.50.  In other words, a particular set of ratings can always be found that would yield a standard error that exceeds 0.30 for example.

Note that for an arbitrarily large number of raters \(r\), Conger's kappa, which reduces to Cohen's kappa for \(r=2\), Krippendorff's alpha or Fleiss' generalized kappa have similar large-sample variances.  Therefore, I have decided to investigate Fleiss' generalized kappa only. The maximum variance of Fleiss' kappa is given by:

\[MaxVar\bigl(\widehat{\kappa}_F\bigr) =\frac{an}{n-b},\hspace{3cm}(1)\]

where \(a\) and \(b\) are 2 constants that depend on the number of raters \(r\) and the number of categories \(q\).  For more details about the derivation of this expression see Gwet (2021, chapter 6).

For 2 raters, \(a=0.099\) and \(b=3.08\). Consequently, even if the number of subjects goes to infinity, the maximum standard error will still exceed the \(\sqrt{a}=\sqrt{0.09}=0.312\). That is, it will  always be possible to find a set of ratings that leads to a standard error that exceeds 0.3.


Feinstein, A.R. and D.V. Cicchetti (1990), High agreement but low kappa: I. the problems of two paradoxes." Journal of Clinical Epidemiology, 43, 543-549.

Gwet, K. (2021), Handbook of Inter-Rater Reliability, 5th Edition. Volume 1: Analysis of Categorical Ratings, AgreeStat Analytics, Maryland USA

Tuesday, March 30, 2021

Agreement Among 3 Raters or More When A Subject Can be Rated by No More than 2 Raters

Most methods proposed in the literature for evaluating the extent of agreement among 3 raters assume that each rater is expected to rate all subjects. In some inter-rater reliability applications however, this requirement cannot be satisfied, either because of the prohibitive costs associated with the rating process or because of a rating process too demanding to a human subject.  For example, scientific laboratories are often rated by accrediting agencies to have their work quality officially certified. These accrediting agencies themselves need to conduct inter-rater reliability studies to demonstration the high quality of their accreditation process.  Given the high costs associated with accrediting a laboratory (a staggering number of lab procedures must be verified and documentation reviewed), agencies are willing to fund a single round of rating for each laboratory with one rater, and use another rater to provide the ratings during the regular accreditation process, which is funded by each lab.

The question now becomes ``Is it possible to evaluate the extent of agreement among 3 raters or more, given that a maximum of 2 raters are allowed to rate the same subject?''  The good news is that it is indeed possible to design an experiment that would achieve that goal.  However, a price that must be paid to make this happen. The agreement coefficient based on such a design will has a higher variance than the traditional coefficient based on the fully-crossed design where each rater must rate all subjects.  The general approach is as follows:

  • Suppose your problem is to quantify the extent of agreement among the group of 5 raters \({\cal R}=\{Rater1, Rater2, Rater3, Rarer4, Rater5 \}\)
  • Out of the roster of 5 raters \(\cal R\), one can form the following 10 different pairs of raters (Note that if \(r\) is the number of raters, then the associated number of pairs that can be formed is \(r(r-1)/2=5\times4/2=10\)): 
  • Suppose that a total of \(n=15\) subjects will participate in your experiment.  The procedure consists of selecting 15 pairs of raters randomly and with replacement (i.e. one pair of raters could be selected more than once) from the above 10 pairs. The 15 selected pairs of raters will be assigned to the 15 subjects on a flow basis (i.e. sequentially as they are selected). 
  • Select with replacement 15 random integers between 1 and 10.  Suppose the 15 random integers are \(\{2, 6, 2, 5, 4, 1, 8, 1, 3, 3, 5, 4, 2, 5, 9\}\).  That is, the \(2^{nd}\) pair (Rater1, Rater3) will be assigned to subjects 1, 3 and 13. The \(6^{th}\) pair (Rater2, Rater4) will be assigned to subject 2 and so on. The experimental design will look this:
  • Once all 15 subjects are rated, the dataset of ratings will have 3 columns.  The first Subject column will identify subjects, the remain 2 columns will contain the ratings from the different pairs of raters assigned to subjects.  The agreement coefficient will then be calculated as if the same 2 raters produced all the ratings.  What will be different is the variance associated with the agreement coefficient. 
  • What is described here is referred to as a Partially Crossed design with 2 raters per subject, or \(\textsf{PC}_2\) design and is discussed in details in the \(5^{th}\) edition of the Handbook of Inter-Rater Reliability to be released in July of 2021.

Monday, February 22, 2021

Testing the Difference Between 2 Agreement Coefficients for Statistical Significance

 Researchers who use chance-corrected agreement coefficients such as Cohen's Kappa, Gwet's AC1 or AC2, Fleiss' Kappa and many other alternatives in their research, often need to compare two coefficients calculated with 2 different sets of ratings.  A rigorous way to do such a comparison is to evaluate the difference between these 2 coefficients for statistical significance. This issue was extensively discussed in my paper entitled Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. AgreeTest, a cloud-based application can help you perform the techniques discussed in this paper and more.  Do not hesitate to check it out when find time.

The 2 sets of ratings used to compute the agreement coefficients under comparison may be totally independent or many have several aspects in common. Here 2 possible scenarios you may encounter in practice:

  • Both datasets of ratings were produced by 2 independent samples of subjects and 2 independent groups of raters.  In this case, the 2 agreement coefficients associated with these datasets are said to be uncorrelated. Their difference can be tested for statistical significance with an Unpaired t-Test (also implemented in AgreeTest).    
  • Both datasets of ratings were produced either by 2 overlapping samples of subjects or 2 overlapping groups of raters, or both.  In this case, the 2 agreement coefficients associated with these datasets are said to be correlated. Their difference can be tested for statistical significance with a Paired t-Test (also implemented in AgreeTest).
Several researchers have successfully used these statistical techniques in their research.  Here is a small sample of these publications:

Tuesday, February 16, 2021

New peer-reviewed article

Many statistical statistical packages have implemented the wrong variance equation of Fleiss' generalized kappa (Fleiss, 1971). SPSS and the R package "rel" are among these packages. I recently published in "Educational and Psychological Measurement" an article entitled "Large-Sample Variance of Fleiss Generalized Kappa." I show in this article that it is not Fleiss' variance equation that is wrong. Instead, it is the way it has been used that is. Fleiss' variance equation was developed under the assumption of no agreement among raters and for the sole purpose of being used in hypothesis testing. It does not quantify the precision of Fleiss' generalized kappa and cannot be used for constructing confidence intervals either.