Let me consider an inter-rater reliability experiment, which involves \(n\) subjects, \(r\) raters and \(q\) categories into which each of the \(r\) raters is expected to classify all \(n\) subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of \(r_{ik}\) raters have classified the specific subject \(i\) into category \(k\). Now, \(\pi_k\) the probability for a random rater to classify a subject into category \(k\) is given by, \[ \pi_k = \frac{1}{n}\sum_{i=1}^nr_{ik}/r\hspace{3cm}(1). \] The complement of this probability is given by \(\pi_k^\star = 1-\pi_k\), representing the probability for a rater to classify subject \(i\) into a category other than \(k\).
Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as \(\widehat{\kappa} = (p_a-p_e)/(1-p_e)\), where \(p_a\) is the percent agreement and \(p_e\) the percent chance agreement. These 2 quantities are defined as follows:
\begin{equation} p_a = \frac1n\sum_{i=1}^n\sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)},\mbox{ and }p_e = \sum_{k=1}^q\pi_k^2. \end{equation}In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is large and the null hypothesis is true are derived
Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies \(n_{i1}, n_{i2}, \cdots, n_{ik}\) are a set of multinominal frequencies with parameters \(n\) and \((P_1, P_2, \cdots, P_k)\), where \(\sum P_j = 1\).
- When calculating the variance of Fleiss' generalized kappa \(\widehat{\kappa}\), the first very annoying issue you face is the \(1-p_e\) factor that appears in the denominator. Here is a very clear-cut way to get rid of it. The fundamental building block of the percent chance agreement \(p_e\) is \(\pi_k\). Since this quantity depends on the sample of \(n\) subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by \(\widehat{\pi}_k\) to stress out that it is an approximation of something fixed \((\pi_k)\) that we will call the "true" propensity for classification into category \(k\). Equation (1) as well as the percent chance agreement \(p_e\) have to be rewritten as follows:
\begin{equation} \widehat{\pi}_k = \frac1n\sum_{k=1}^nr_{ik}/r, \mbox{ and }p_e = \sum_{k=1}^q\widehat{\pi}_k^2. \end{equation} Since \(\widehat{\pi}_k\) is a simple average, the Law of Large Numbers (LLN) stipulates that \(\widehat{\pi}_k\) converges (in probability) to a fixed number \(\pi_k\) as the number of subjects \(n\) grows. \begin{equation} \widehat{\pi}_k \overset{P}{\longrightarrow} \pi_k\hspace{3cm}(4) \end{equation} We might never know for sure what the real value of \(\pi_k\) is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible linear form for kappa. Once we have it, then we will proceed to estimate what we don't know.
- The Continuous Mapping Theorem (CMT) (applied twice) allows me to deduce from equation (4) that,
\begin{equation}
\widehat{\pi}_k^2 \overset{P}{\longrightarrow} \pi_k^2, \mbox{ and } p_e \overset{P}{\longrightarrow} P_e\hspace{3cm}(5)
\end{equation}
That is, as the number of subjects grows the (random) percent agreement \(p_e\) converges in probability to a fixed quantity \(P_e\) that is the sum of all \(k\) terms \(\pi_k^2\). Note that \(p_e\) the sample-based random quantity is in lowercase, while \(P_e\), the fixed limit value is in uppercase.
- Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by \(1-p_e\) or it must be multiplied by \(1/(1-p_e)\). This inverse can be rewritten as follows: \begin{equation} \frac1{1-p_e} = \frac{1}{1-P_e}\times\frac{1}{1-\displaystyle\biggl(\frac{p_e-P_e}{1-P_e}\biggr)}\hspace{1cm}(6) \end{equation} First consider the expression on the right side of the multiplication sign \(\times\) in equation (6) and let \(\varepsilon = (p_e-P_e)/(1-P_e)\). It folows from Taylor's theorem that \(1/(1-\varepsilon) = 1+\varepsilon + Remainder\), where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from Slustky's theorem that the large-sample probability distribution of \(1/(1-\varepsilon)\) is the same as the probability distribution of \(1+\varepsilon\). It follows from equation (6) that the large-sample distribution of the inverse \(1/(1-p_e)\) is the same as the distribution of the following statistic: \begin{equation} L_e = \frac{1+(p_e-P_e)/(1-P_e)}{1-P_e}\hspace{3cm}(7). \end{equation} You will note that the \(L_e\) statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish.
- Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of \(\kappa_0\) defined as follows: \begin{equation} \kappa_0 = (p_a-p_e)L_e\hspace{4cm}(8) \end{equation} I also know that by applying the Law of Large Numbers again that the percent agreement \(p_a\) also converges (in probability) to a fixed probability \(P_a\), whose exact value may never be known to us (an issue I'll worry about later). \(\kappa_0\) can now be rewritten as: \begin{equation} \kappa_0 = \frac{p_a-P_e}{1-P_e} - (1-\kappa)\frac{p_e-P_e}{1-P_e},\hspace{2cm}(9) \end{equation} where \(\kappa=(P_a-P_e)/(1-P_e)\) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement \(p_a\) is already a linear expression, that is not yet the case for the percent chance agreement, which depends on \(\widehat{\pi}_k^2\) - a sample-dependent statistic that is squared. Let us deal with it.
- I indicated earlier that the estimated propensity \(\widehat{\pi}_k\) for classification into category \(k\) converges (in probability) to the fixed quantity \(\pi_k\). It follows from Taylor's theorem again that \(\widehat{\pi}_k^2 = \pi_k^2 +2\pi_k(\widehat{\pi}_k-\pi_k) + \mbox{Remainder}\), where the remainder goes to 0 faster than the difference \(\widehat{\pi}_k-\pi_k\). Consequently, the large-sample distribution of the difference \((p_e-P_e)\) of equation (9) is the same as that of \(2(p_{e|0}-P_e)\) where \(p_{e|0}\) is given by: \begin{equation} p_{e|0} = \sum_{k=1}^q\pi_k\widehat{\pi}_k = \frac{1}{n}\sum_{i=1}^np_{e|i}, \mbox{ where }p_{e|i}=\sum_{k=1}^q\pi_kr_{ik}/r. \hspace{1cm}(10) \end{equation} Consequently, the large-sample distribution of \(\kappa_0\) of equation (9) is the same as the distribution of \(\kappa_1\) given by, \begin{equation} \kappa_1 = \frac{p_a-P_e}{1-P_e} - 2(1-\kappa)\frac{p_{e|0}-P_e}{1-P_e}. \hspace{3cm}(11) \end{equation} Note that equation (11) is the pure linear expression I was looking for. That is, \begin{equation} \kappa_1 =\frac{1}{n}\sum_{i=1}^n\kappa_i^\star\hspace{5cm}(12), \end{equation} where \(\kappa_i^\star\) is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of \(\kappa_1\). If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated.
- Note that the Central Limit Theorem ensures that the large-sample distribution of the sample mean is Normal. Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance.
- To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings. I only used (sometimes multiple times) the following 5 theorems:
- The Law of Large Numbers
- Taylor's Theorem
- Slustky's Theorem
- The Continuous Mapping Theorem
- The Central Limit Theorem
A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.
Bibliography
- Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
- Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters, Psychological Bulletin, 86, 974-977.
- Gwet, K.L. (2014). Handbook of Inter-Rater Reliability (4th Edition), Advanced Analytics, LLC, Maryland, USA
- Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, 76(4), 609–637.
No comments:
Post a Comment