*I decided not to use the variance formula that Fleiss, Nee, and Landis proposed and would strongly discourage anybody else from using it for any purpose other than for testing the null hypothesis of no agreement among raters*. In this post, I explain the rationale behind my decision, and will briefly discuss my alternative approach.

Let me consider an inter-rater reliability experiment, which involves \(n\) subjects, \(r\) raters and \(q\) categories into which each of the \(r\) raters is expected to classify all \(n\) subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of \(r_{ik}\) raters have classified the specific subject \(i\) into category \(k\). Now, \(\pi_k\) the probability for a random rater to classify a subject into category \(k\) is given by, \[ \pi_k = \frac{1}{n}\sum_{i=1}^nr_{ik}/r\hspace{3cm}(1). \] The complement of this probability is given by \(\pi_k^\star = 1-\pi_k\), representing the probability for a rater to classify subject \(i\) into a category other than \(k\).

Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as \(\widehat{\kappa} = (p_a-p_e)/(1-p_e)\), where \(p_a\) is the percent agreement and \(p_e\) the percent chance agreement. These 2 quantities are defined as follows:

\begin{equation} p_a = \frac1n\sum_{i=1}^n\sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)},\mbox{ and }p_e = \sum_{k=1}^q\pi_k^2. \end{equation}**Variance Proposed by Fleiss et al. (1979)**

In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is largeand the null hypothesis is trueare derived

Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies \(n_{i1}, n_{i2}, \cdots, n_{ik}\) are a set of multinominal frequencies with parameters \(n\) and \((P_1, P_2, \cdots, P_k)\), where \(\sum P_j = 1\).

**equation (2) should only be used if you are testing the null hypothesis of no agreement among raters, and should never ever be used to construct confidence intervals for example, nor to do anything else unrelated to hypothesis testing**.

**Variance Proposed by Gwet (2014)**

**How to you get to equation (3)?**

*can we not find a linear expression that is sufficiently close to kappa, derive the variance of the linear expression and show that it is sufficiently close to the actual variance of kappa?*As it turned out, when the number of subjects is very large, Fleiss' generalized kappa tends to take a linear form, and the larger the number of subjects, the closer that linear form gets to kappa. But what does that linear form look like? Since linearity is only taking place when the number of subjects is far away from what you would normally have in practice, we need to be able to zoom in enough to take a decent shot at that linear form from a distance. Finding a powerful zoom and using it well is the challenge we need to overcome to resolve this issue. To be successful, we do need to tackle this issue step by step, piece by piece.

- When calculating the variance of Fleiss' generalized kappa \(\widehat{\kappa}\), the first very annoying issue you face is the \(1-p_e\) factor that appears in the denominator. Here is a very clear-cut way to get rid of it. The fundamental building block of the percent chance agreement \(p_e\) is \(\pi_k\). Since this quantity depends on the sample of \(n\) subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by \(\widehat{\pi}_k\) to stress out that it is an approximation of something fixed \((\pi_k)\) that we will call the "true" propensity for classification into category \(k\). Equation (1) as well as the percent chance agreement \(p_e\) have to be rewritten as follows:

\begin{equation} \widehat{\pi}_k = \frac1n\sum_{k=1}^nr_{ik}/r, \mbox{ and }p_e = \sum_{k=1}^q\widehat{\pi}_k^2. \end{equation} Since \(\widehat{\pi}_k\) is a simple average, the**Law of Large Numbers (LLN)**stipulates that \(\widehat{\pi}_k\) converges (in probability) to a fixed number \(\pi_k\) as the number of subjects \(n\) grows. \begin{equation} \widehat{\pi}_k \overset{P}{\longrightarrow} \pi_k\hspace{3cm}(4) \end{equation} We might never know for sure what the real value of \(\pi_k\) is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible*linear form*for kappa. Once we have it, then we will proceed to estimate what we don't know.

- The
- Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by \(1-p_e\) or it must be multiplied by \(1/(1-p_e)\). This inverse can be rewritten as follows:
\begin{equation}
\frac1{1-p_e} = \frac{1}{1-P_e}\times\frac{1}{1-\displaystyle\biggl(\frac{p_e-P_e}{1-P_e}\biggr)}\hspace{1cm}(6)
\end{equation}
First consider the expression on the right side of the multiplication sign \(\times\) in equation (6) and let \(\varepsilon = (p_e-P_e)/(1-P_e)\). It folows from
**Taylor's theorem**that \(1/(1-\varepsilon) = 1+\varepsilon + Remainder\), where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from**Slustky's theorem**that the large-sample probability distribution of \(1/(1-\varepsilon)\) is the same as the probability distribution of \(1+\varepsilon\). It follows from equation (6) that the large-sample distribution of the inverse \(1/(1-p_e)\) is the same as the distribution of the following statistic: \begin{equation} L_e = \frac{1+(p_e-P_e)/(1-P_e)}{1-P_e}\hspace{3cm}(7). \end{equation} You will note that the \(L_e\) statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish. - Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of \(\kappa_0\) defined as follows:
\begin{equation}
\kappa_0 = (p_a-p_e)L_e\hspace{4cm}(8)
\end{equation} I also know that by applying the
**Law of Large Numbers**again that the percent agreement \(p_a\) also converges (in probability) to a fixed probability \(P_a\), whose exact value may never be known to us (an issue I'll worry about later). \(\kappa_0\) can now be rewritten as: \begin{equation} \kappa_0 = \frac{p_a-P_e}{1-P_e} - (1-\kappa)\frac{p_e-P_e}{1-P_e},\hspace{2cm}(9) \end{equation} where \(\kappa=(P_a-P_e)/(1-P_e)\) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement \(p_a\) is already a linear expression, that is not yet the case for the percent chance agreement, which depends on \(\widehat{\pi}_k^2\) - a sample-dependent statistic that is squared. Let us deal with it. - I indicated earlier that the estimated propensity \(\widehat{\pi}_k\) for classification into category \(k\) converges (in probability) to the fixed quantity \(\pi_k\). It follows from
**Taylor's theorem**again that \(\widehat{\pi}_k^2 = \pi_k^2 +2\pi_k(\widehat{\pi}_k-\pi_k) + \mbox{Remainder}\), where the remainder goes to 0 faster than the difference \(\widehat{\pi}_k-\pi_k\). Consequently, the large-sample distribution of the difference \((p_e-P_e)\) of equation (9) is the same as that of \(2(p_{e|0}-P_e)\) where \(p_{e|0}\) is given by: \begin{equation} p_{e|0} = \sum_{k=1}^q\pi_k\widehat{\pi}_k = \frac{1}{n}\sum_{i=1}^np_{e|i}, \mbox{ where }p_{e|i}=\sum_{k=1}^q\pi_kr_{ik}/r. \hspace{1cm}(10) \end{equation} Consequently, the large-sample distribution of \(\kappa_0\) of equation (9) is the same as the distribution of \(\kappa_1\) given by, \begin{equation} \kappa_1 = \frac{p_a-P_e}{1-P_e} - 2(1-\kappa)\frac{p_{e|0}-P_e}{1-P_e}. \hspace{3cm}(11) \end{equation} Note that equation (11) is the pure linear expression I was looking for. That is, \begin{equation} \kappa_1 =\frac{1}{n}\sum_{i=1}^n\kappa_i^\star\hspace{5cm}(12), \end{equation} where \(\kappa_i^\star\) is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of \(\kappa_1\). If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated. - Note that the
**Central Limit Theorem**ensures that the large-sample distribution of the sample mean is Normal. Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance. - To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings. I only used (sometimes multiple times) the following 5 theorems:
- The Law of Large Numbers
- Taylor's Theorem
- Slustky's Theorem
- The Continuous Mapping Theorem
- The Central Limit Theorem

**Continuous Mapping Theorem (CMT)**(applied twice) allows me to deduce from equation (4) that, \begin{equation} \widehat{\pi}_k^2 \overset{P}{\longrightarrow} \pi_k^2, \mbox{ and } p_e \overset{P}{\longrightarrow} P_e\hspace{3cm}(5) \end{equation} That is, as the number of subjects grows the (random) percent agreement \(p_e\) converges in probability to a fixed quantity \(P_e\) that is the sum of all \(k\) terms \(\pi_k^2\). Note that \(p_e\) the sample-based random quantity is in lowercase, while \(P_e\), the fixed limit value is in uppercase.

**Side Note:**

A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.

**Bibliography**

- Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters.
*Psychological Bulletin*,**76**, 378-382. - Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters,
*Psychological Bulletin*,**86**, 974-977. - Gwet, K.L. (2014).
*Handbook of Inter-Rater Reliability (4th Edition)*, Advanced Analytics, LLC, Maryland, USA - Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance,
*Educational and Psychological Measurement*,**76(4),**609–637.