Monday, July 20, 2020

Large-sample variance of Fleiss generalized kappa coefficient

A number of researchers brought to my attention the fact that the variance associated with Fleiss' generalized kappa (Fleiss, 1971) calculated with my R package irrCAC differs from the variance obtained from alternative software products such as SPSS (Reliability with option FleissKappa) and with another R package named rel. In fact, SPSS and the R package rel produce the same variance estimate for Fleiss' generalized kappa. So, why that discrepancy and what should you do about it? Answering this question is the purpose of this rather long post. 

If SPSS and the rel package agree, it is because they are both based on the variance formula proposed by Fleiss et al. (1979).  While writing my own package irrCAC, I knew very well about this paper and read it multiple times. I decided not to use the variance formula that Fleiss, Nee, and Landis proposed and would strongly discourage anybody else from using it for any purpose other than for testing the null hypothesis of no agreement among raters.  In this post, I explain the rationale behind my decision, and will briefly discuss my alternative approach.
Initially, I thought about writing and submitting a formal article to a peer-reviewed journal on this issue.  Because I am uncertain that I will find time to do it (although I might still end up doing it at a much later time), I thought I would share with everybody the general approach I often use for deriving these variance formulas. My approach relies heavily on what is known in mathematics as the linearization technique. Linearization is a very popular technique that has been widely used across several fields of mathematics. Although I personally learned it well as a PhD student in mathematics (several years ago), in my opinion this technique should and can be introduced much earlier and in other non-mathematics fields. Here, I will restrict myself to the way linearization is often used in mathematical statistics only (I used this technique in one of previous papers - see Gwet, 2016).   

Let me consider an inter-rater reliability experiment, which involves \(n\) subjects, \(r\) raters and \(q\) categories into which each of the \(r\) raters is expected to classify all \(n\) subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of \(r_{ik}\) raters have classified the specific subject \(i\) into category \(k\). Now, \(\pi_k\) the probability for a random rater to classify a subject into category \(k\) is given by, \[ \pi_k = \frac{1}{n}\sum_{i=1}^nr_{ik}/r\hspace{3cm}(1). \] The complement of this probability is given by \(\pi_k^\star = 1-\pi_k\), representing the probability for a rater to classify subject \(i\) into a category other than \(k\).

Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as \(\widehat{\kappa} = (p_a-p_e)/(1-p_e)\), where \(p_a\) is the percent agreement and \(p_e\) the percent chance agreement. These 2 quantities are defined as follows:

\begin{equation} p_a = \frac1n\sum_{i=1}^n\sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)},\mbox{ and }p_e = \sum_{k=1}^q\pi_k^2. \end{equation}
Variance Proposed by Fleiss et al. (1979)

Here is the variance that Fleiss et al. (1979) has proposed:
\begin{equation}\small  Var(\hat{\kappa}) = \frac{2}{\displaystyle nr(r-1)\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2}\times \Biggl[\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2 - \sum_{k=1}^q\pi_k\pi_k^\star(\pi_k^*-\pi_k)\Biggr],\hspace{5mm}(2) \end{equation}
In this paper, the authors clearly say the following in the second column on page 974:
In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is large and the null hypothesis is true are derived
The authors clearly state that their variance formulas are only valid when the null hypothesis is true (this generally means there is no agreement beyond chance). They go on to say on page 975, right after equation (5) the following:
Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies \(n_{i1}, n_{i2}, \cdots, n_{ik}\) are a set of multinominal frequencies with parameters \(n\) and \((P_1, P_2, \cdots, P_k)\), where \(\sum P_j = 1\).
The fact that the ratings are purely random translates in practice by an absence of agreement among raters beyond chance (you may notice that the notations in their paper are different from mine - \(n_{i1}\) for example is what I refer to as \(r_{i1}\)).  Consequently, equation (2) should only be used if you are testing the null hypothesis of no agreement among raters, and should never ever be used to construct confidence intervals for example, nor to do anything else unrelated to hypothesis testing

Variance Proposed by Gwet (2014)

The variance of Fleiss' generalized kappa I proposed in my book (see Gwet, 2014) is defined as follows: \begin{equation} Var(\hat{\kappa}) = \frac{1-f}{n}\frac1{n-1}\sum_{i=1}^n\bigl(\kappa_i^\star - \hat{\kappa})^2,\hspace{2cm}(3) \end{equation} where, \begin{equation} \kappa_i^\star = \kappa_i - 2(1-\hat{\kappa})\frac{p_{e|i}-p_e}{1-p_e}, \end{equation} with \(\kappa_i = (p_{a|i}-p_e)/(1-p_e)\). Moreover \(p_{a|i}\) and \(p_{e|i}\) are defined as follows: \begin{equation} p_{a|i} = \sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)}\mbox{ and }p_{e|i} = \sum_{k=1}^q\pi_kr_{ik}/r. \end{equation}
Equation (3) is a general purpose variance estimator, which is valid either for hypothesis testing or for confidence interval construction or for anything else. I expect equations (2) and (3) to agree reasonably well when the extent of agreement among raters is close to 0. I did not personally verify this, but I expect it to be true if equation (2) was properly derived. But if there is an agreement to some extent among raters, then I expect equation (2) to often yield a smaller variance and to be an understatement of the true variance.

How to you get to equation (3)?

The Linearization technique is based upon the basic fact that the human mind has always found it convenient to deal with linear expressions.  It is because they involve summing and averaging stuff (stuff here represent variables to which fixed scalar coefficients may or may not be attached). That is why averages or means have been so popular and so thoroughly investigated in statistics.  

Now, Fleiss' generalized kappa \(\widehat{\kappa}\) is anything but linear.  Its calculation involves more, a lot more than summing and averaging basic entities.  When dealing with a nonlinear expressions such as kappa, the question always becomes, can we not find a linear expression that is sufficiently close to kappa, derive the variance of the linear expression and show that it is sufficiently close to the actual variance of kappa?  As it turned out, when the number of subjects is very large, Fleiss' generalized kappa tends to take a linear form, and the larger the number of subjects, the closer that linear form gets to kappa. But what does that linear form look like?  Since linearity is only taking place when the number of subjects is far away from what you would normally have in practice, we need to be able to zoom in enough to take a decent shot at that linear form from a distance. Finding a powerful zoom and using it well is the challenge we need to overcome to resolve this issue. To be successful, we do need to tackle this issue step by step, piece by piece.
  • When calculating the variance of Fleiss' generalized kappa \(\widehat{\kappa}\), the first very annoying issue you face is the \(1-p_e\) factor that appears in the denominator.  Here is a very clear-cut way to get rid of it.  The fundamental building block of the percent chance agreement \(p_e\) is \(\pi_k\).  Since this quantity depends on the sample of \(n\) subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by \(\widehat{\pi}_k\) to stress out that it is an approximation of something fixed \((\pi_k)\) that we will call the "true" propensity for classification into category \(k\). Equation (1) as well as the percent chance agreement \(p_e\) have to be rewritten as follows:
    \begin{equation} \widehat{\pi}_k = \frac1n\sum_{k=1}^nr_{ik}/r, \mbox{ and }p_e = \sum_{k=1}^q\widehat{\pi}_k^2. \end{equation} Since \(\widehat{\pi}_k\) is a simple average, the Law of Large Numbers (LLN) stipulates that \(\widehat{\pi}_k\) converges (in probability) to a fixed number \(\pi_k\) as the number of subjects \(n\) grows. \begin{equation} \widehat{\pi}_k \overset{P}{\longrightarrow} \pi_k\hspace{3cm}(4) \end{equation} We might never know for sure what the real value of \(\pi_k\) is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible linear form for kappa. Once we have it, then we will proceed to estimate what we don't know.
    The Continuous Mapping Theorem (CMT) (applied twice) allows me to deduce from equation (4) that, \begin{equation} \widehat{\pi}_k^2 \overset{P}{\longrightarrow} \pi_k^2, \mbox{ and } p_e \overset{P}{\longrightarrow} P_e\hspace{3cm}(5) \end{equation} That is, as the number of subjects grows the (random) percent agreement \(p_e\) converges in probability to a fixed quantity \(P_e\) that is the sum of all \(k\) terms \(\pi_k^2\). Note that \(p_e\) the sample-based random quantity is in lowercase, while \(P_e\), the fixed limit value is in uppercase.
  • Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by \(1-p_e\) or it must be multiplied by \(1/(1-p_e)\).  This inverse can be rewritten as follows: \begin{equation} \frac1{1-p_e} = \frac{1}{1-P_e}\times\frac{1}{1-\displaystyle\biggl(\frac{p_e-P_e}{1-P_e}\biggr)}\hspace{1cm}(6) \end{equation} First consider the expression on the right side of the multiplication sign \(\times\) in equation (6) and let \(\varepsilon = (p_e-P_e)/(1-P_e)\). It folows from Taylor's theorem that \(1/(1-\varepsilon) = 1+\varepsilon + Remainder\), where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from Slustky's theorem that the large-sample probability distribution of \(1/(1-\varepsilon)\) is the same as the probability distribution of \(1+\varepsilon\). It follows from equation (6) that the large-sample distribution of the inverse \(1/(1-p_e)\) is the same as the distribution of the following statistic: \begin{equation} L_e = \frac{1+(p_e-P_e)/(1-P_e)}{1-P_e}\hspace{3cm}(7). \end{equation} You will note that the \(L_e\) statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish.
  • Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of \(\kappa_0\) defined as follows: \begin{equation} \kappa_0 = (p_a-p_e)L_e\hspace{4cm}(8) \end{equation} I also know that by applying the Law of Large Numbers again that the percent agreement \(p_a\) also converges (in probability) to a fixed probability \(P_a\), whose exact value may never be known to us (an issue I'll worry about later). \(\kappa_0\) can now be rewritten as: \begin{equation} \kappa_0 = \frac{p_a-P_e}{1-P_e} - (1-\kappa)\frac{p_e-P_e}{1-P_e},\hspace{2cm}(9) \end{equation} where \(\kappa=(P_a-P_e)/(1-P_e)\) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement \(p_a\) is already a linear expression, that is not yet the case for the percent chance agreement, which depends on \(\widehat{\pi}_k^2\) - a sample-dependent statistic that is squared. Let us deal with it.
  • I indicated earlier that the estimated propensity \(\widehat{\pi}_k\) for classification into category \(k\) converges (in probability) to the fixed quantity \(\pi_k\). It follows from Taylor's theorem again that \(\widehat{\pi}_k^2 = \pi_k^2 +2\pi_k(\widehat{\pi}_k-\pi_k) + \mbox{Remainder}\), where the remainder goes to 0 faster than the difference \(\widehat{\pi}_k-\pi_k\). Consequently, the large-sample distribution of the difference \((p_e-P_e)\) of equation (9) is the same as that of  \(2(p_{e|0}-P_e)\) where \(p_{e|0}\) is given by: \begin{equation} p_{e|0} = \sum_{k=1}^q\pi_k\widehat{\pi}_k = \frac{1}{n}\sum_{i=1}^np_{e|i}, \mbox{ where }p_{e|i}=\sum_{k=1}^q\pi_kr_{ik}/r. \hspace{1cm}(10) \end{equation} Consequently, the large-sample distribution of \(\kappa_0\) of equation (9) is the same as the distribution of \(\kappa_1\) given by, \begin{equation} \kappa_1 = \frac{p_a-P_e}{1-P_e} - 2(1-\kappa)\frac{p_{e|0}-P_e}{1-P_e}. \hspace{3cm}(11) \end{equation} Note that equation (11) is the pure linear expression I was looking for. That is, \begin{equation} \kappa_1 =\frac{1}{n}\sum_{i=1}^n\kappa_i^\star\hspace{5cm}(12), \end{equation} where \(\kappa_i^\star\) is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of \(\kappa_1\). If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated.
  • Note that the Central Limit Theorem ensures that the large-sample distribution of the sample mean is Normal.  Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance.
  • To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings.  I only used (sometimes multiple times) the following 5 theorems:
    • The Law of Large Numbers
    • Taylor's Theorem
    • Slustky's Theorem
    • The Continuous Mapping Theorem
    • The Central Limit Theorem
The linearization technique used here is definitely the most effective way for deriving the variance of complicated statistics.  Unfortunately, students outside of traditional mathematics departments have hardly been exposed to it.   I attempted here to outline the key steps and the main theorems needed to zoom in on kappa in the vicinity of infinity so that one can read its linear structure. Hopefully researchers with some background in mathematics will get a glimpse into the mechanics of this powerful technique.  The most delicate aspect of this method is when the Taylor's theorem must be applied.  In fact, the "reminder" term must be carefully looked at to ensure that it does not include a term not supposed to be there. 

A rigorous mathematical demonstration of the steps discussed above should not be a concern to researchers.  It would definitely require PhD-level mathematics that should be left to PhD students in maths.     

Side Note:
In his book entitled "How Not to Be Wrong: The Power of Mathematical Thinking", here is what the author Jordan Ellenberg says:
A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.


  • Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
  • Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters, Psychological Bulletin, 86, 974-977.
  • Gwet, K.L. (2014). Handbook of Inter-Rater Reliability (4th Edition), Advanced Analytics, LLC, Maryland, USA
  • Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical SignificanceEducational and Psychological Measurement76(4), 609–637.