Let me consider an inter-rater reliability experiment, which involves n subjects, r raters and q categories into which each of the r raters is expected to classify all n subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of rik raters have classified the specific subject i into category k. Now, πk the probability for a random rater to classify a subject into category k is given by, πk=1nn∑i=1rik/r(1). The complement of this probability is given by π⋆k=1−πk, representing the probability for a rater to classify subject i into a category other than k.
Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as ˆκ=(pa−pe)/(1−pe), where pa is the percent agreement and pe the percent chance agreement. These 2 quantities are defined as follows:
pa=1nn∑i=1q∑k=1rik(rik−1)r(r−1), and pe=q∑k=1π2k.In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is large and the null hypothesis is true are derived
Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies ni1,ni2,⋯,nik are a set of multinominal frequencies with parameters n and (P1,P2,⋯,Pk), where ∑Pj=1.
- When calculating the variance of Fleiss' generalized kappa ˆκ, the first very annoying issue you face is the 1−pe factor that appears in the denominator. Here is a very clear-cut way to get rid of it. The fundamental building block of the percent chance agreement pe is πk. Since this quantity depends on the sample of n subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by ˆπk to stress out that it is an approximation of something fixed (πk) that we will call the "true" propensity for classification into category k. Equation (1) as well as the percent chance agreement pe have to be rewritten as follows:
ˆπk=1nn∑k=1rik/r, and pe=q∑k=1ˆπ2k. Since ˆπk is a simple average, the Law of Large Numbers (LLN) stipulates that ˆπk converges (in probability) to a fixed number πk as the number of subjects n grows. ˆπkP⟶πk(4) We might never know for sure what the real value of πk is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible linear form for kappa. Once we have it, then we will proceed to estimate what we don't know.
- The Continuous Mapping Theorem (CMT) (applied twice) allows me to deduce from equation (4) that,
ˆπ2kP⟶π2k, and peP⟶Pe(5)
That is, as the number of subjects grows the (random) percent agreement pe converges in probability to a fixed quantity Pe that is the sum of all k terms π2k. Note that pe the sample-based random quantity is in lowercase, while Pe, the fixed limit value is in uppercase.
- Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by 1−pe or it must be multiplied by 1/(1−pe). This inverse can be rewritten as follows: 11−pe=11−Pe×11−(pe−Pe1−Pe)(6) First consider the expression on the right side of the multiplication sign × in equation (6) and let ε=(pe−Pe)/(1−Pe). It folows from Taylor's theorem that 1/(1−ε)=1+ε+Remainder, where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from Slustky's theorem that the large-sample probability distribution of 1/(1−ε) is the same as the probability distribution of 1+ε. It follows from equation (6) that the large-sample distribution of the inverse 1/(1−pe) is the same as the distribution of the following statistic: Le=1+(pe−Pe)/(1−Pe)1−Pe(7). You will note that the Le statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish.
- Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of κ0 defined as follows: κ0=(pa−pe)Le(8) I also know that by applying the Law of Large Numbers again that the percent agreement pa also converges (in probability) to a fixed probability Pa, whose exact value may never be known to us (an issue I'll worry about later). κ0 can now be rewritten as: κ0=pa−Pe1−Pe−(1−κ)pe−Pe1−Pe,(9) where κ=(Pa−Pe)/(1−Pe) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement pa is already a linear expression, that is not yet the case for the percent chance agreement, which depends on ˆπ2k - a sample-dependent statistic that is squared. Let us deal with it.
- I indicated earlier that the estimated propensity ˆπk for classification into category k converges (in probability) to the fixed quantity πk. It follows from Taylor's theorem again that ˆπ2k=π2k+2πk(ˆπk−πk)+Remainder, where the remainder goes to 0 faster than the difference ˆπk−πk. Consequently, the large-sample distribution of the difference (pe−Pe) of equation (9) is the same as that of 2(pe|0−Pe) where pe|0 is given by: pe|0=q∑k=1πkˆπk=1nn∑i=1pe|i, where pe|i=q∑k=1πkrik/r.(10) Consequently, the large-sample distribution of κ0 of equation (9) is the same as the distribution of κ1 given by, κ1=pa−Pe1−Pe−2(1−κ)pe|0−Pe1−Pe.(11) Note that equation (11) is the pure linear expression I was looking for. That is, κ1=1nn∑i=1κ⋆i(12), where κ⋆i is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of κ1. If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated.
- Note that the Central Limit Theorem ensures that the large-sample distribution of the sample mean is Normal. Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance.
- To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings. I only used (sometimes multiple times) the following 5 theorems:
- The Law of Large Numbers
- Taylor's Theorem
- Slustky's Theorem
- The Continuous Mapping Theorem
- The Central Limit Theorem
A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.
Bibliography
- Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
- Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters, Psychological Bulletin, 86, 974-977.
- Gwet, K.L. (2014). Handbook of Inter-Rater Reliability (4th Edition), Advanced Analytics, LLC, Maryland, USA
- Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, 76(4), 609–637.
No comments:
Post a Comment