K. Gwet's Inter-Rater Reliability Blog Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

Wednesday, January 11, 2023

On Bangdiwala's B Coefficient and Its Variance

Bangdiwala's B, is an agreement coefficient for 2 raters, where agreement and disagreement are measured by the areas of some rectangles, and was suggested by Bangwidala S (1985). See Bangdiwala, S.I., Shankar, V. (2013) and Shankar, V., Bangdiwala, S.I. (2014) for more information about this coefficient.

Consider Table 1, an abstract table showing the distribution of \(n\) subjects by rater and category. Both raters Rater A and Rater B have classified each of the \(n\) subjects into one of \(q\) possible categories.

Table 1: Distribution of n subjects by rater and by category

Bangdiwala's B statistic, denoted by \(\widehat{B}\) is given by the following equation:

\[\widehat{B}=\frac{\displaystyle\sum_{k=1}^qp_{kk}^2}{\displaystyle\sum_{k=1}^qp_{k+}p_{+k}}, \hspace{3cm}(1) \]

where \(p_{kk}=n_{kk}/n\), \(p_{k+}=n_{k+}/n\) and \(p_{+k}=n_{+k}/n\). This coefficient will be rewritten as \(\widehat{B}=\widehat{B}_1/\widehat{B}_2\), \(\widehat{B}_1\) and \(\widehat{B}_2\) being respectively the numerator and the denominator of equation (1).

The variance estimator of Bangdiwala's B that I propose, is given by the following equation:

\[var\bigl(\widehat{B}\bigr) = \frac{2}{\widehat{B}_2^2}\Biggl[2\sum_{k=1}^qp_{kk}^2\bigl(p_{kk}-2\widehat{B}\pi_k\bigr) + \widehat{B}^2\biggl(\sum_{k=1}^q\pi_kp_{k+}p_{+k} + \sum_{k=1}^q\sum_{l=1}^qp_{kl}p_{+k}p_{l+}\biggr)\Biggr]\]

where \(\pi_k=(p_{+k}+p_{k+})/2\), and \(p_{kl}=n_{kl}/n\).

How do I know this variance estimator works?

Almost all statistics that can be represented as continuous functions of some variables has a probability distribution, which is asymptotically Normal. That is, as the sample size (in our case the number of subjects) increases, the law of probability associated with the statistic in question looks more and more like the Normal distribution.

Traditionally, the way you demonstrate that a variance formula works, is to start with an initial population of subjects for which the \(B\) coefficient is known and represents the target parameter to be estimated from smaller samples. Then, you would conduct a Monte-Carlo experiment by simulating a large number of samples of subjects of various sizes selected from the population. For each simulated sample of subjects, you would compute a 95% confidence interval. This process will produce a long series of 95% confidence intervals. If the variance formula works, approximately 95% of confidence intervals will include the target parameter B and the coverage rate should improve as the sample size increases.

The Monte-Carlo Simulation

I first created the subject population, a file containing 10,000 records the extract of which is shown in Table 2. This file contains ratings from Rater A and Rater B who classified each of the 10,000 subjects into one of 4 categories labeled as 1, 2, 3 and 4. You can download the entire subject population file as an Excel spreadsheet. Bangdiwala's coefficient calculated for this population is \(B=0.2730\) and is the estimand, or the target population parameter that each sample will approximate.

Note that the classification of subjects to categories was done randomly according the classification probabilities shown in Table 3. For example, both raters would classify a subject into category 1 with a probability \(p_{11}=0.251\), whereas RaterA and RaterB would classify a subject into categories 1 and 2 respectively with probability \(p_{12}=0.034\) and so on. An appendix at the end of this post shows an R script used for creating the subject population.

The next step was to select 4,000 samples of a given size from the subject population and use each of them to compute Bangdiwala's B estimate with equation (1), along with its variance estimate. For each sample \(s\), I calculated \(\widehat{B}_s\), its variance \(v_s\), the associated 95% confidence interval given by \(\bigl(\widehat{B}_s - 1.96\sqrt{v_s};\widehat{B}_s-1.96\sqrt{v_s}\bigr)\), and a 0-1 dichotomous variable \(I_s\) that indicates whether the population value \(B=0.273\) is included into the 95% confidence interval or not. Consequently, for each sample size (\(n\)), a file with 4,000 rows and 4 columns is created. I repeated this process with sample sizes \(n=25, 50, 75, 100, 125, 150, 175, 200, 250, 300, 350\).

I expect the average of all 4,000 \(I_s\) values to be reasonably close to 95%, and this coverage rate should get closer and closer to 95% as the sample size increases. It is indeed the case as shown in Figure 1. Even for a small sample size as 25, the coverage rate is already close to 92%. From a sample size of 75 all coverage rates fluctuate between 94% and 95.5%.

The Monte-Carlo simulation program was written in R and can be downloaded. It also contains commands for creating the subject population, although it may not generate the exact same population as the one I used here, because the random number generator will use a different seed.

Table 2: Extract of the population of 10,000 subjects

Table 3: Classification probabilities used for creating the subject population

Figure 1: Coverage rates of 95% confidence intervals by sample size

Table 4: Summary statistics from the Monte-Carlo experiment

References

Bangwidala S (1985) A graphical test for observer agreement. Proc 45th Int Stats Institute Meeting, Amsterdam, 1, 307–308
Bangdiwala, S.I., Shankar, V. The agreement chart. BMC Med Res Methodol 13, 97 (2013). https://doi.org/10.1186/1471-2288-13-97.
Shankar, V., Bangdiwala, S.I. Observer agreement paradoxes in 2x2 tables: comparison of agreement measures. BMC Med Res Methodol 14, 100 (2014). https://doi.org/10.1186/1471-2288-14-100

Appendix: R script for creating the population of 10,000 subjects

pop.size <- 10000

sframe <- sample(x=c(11,12,13,14,

21,22,23,24,

31,32,33,34,

41,42,43,44),

size=pop.size,

prob = c(0.251,0.034,0.004,0.007,

0.216,0.074,0.020,0.005,

0.067,0.094,0.034,0.040,

0.020,0.047,0.020,0.067),

replace = T)

sframe1 <- as.matrix(sframe)

no.neuro <- trunc(sframe1/10)

w.neuro <- sframe1-10*trunc(sframe1/10)

sframe2 <- cbind(sframe1,no.neuro,w.neuro)

sfra <- as_tibble(sframe2) #- This is the subject population -

#write.xlsx(sfra,file=paste0(datadir,"sfra.xlsx"))

Saturday, August 28, 2021

Cohen's Kappa paradoxes make sample size calculation impossible

Cohen's kappa coefficient often yields unduly low estimates, which can be counter intuitive when compared to the observed agreement level quantified by the percent agreement. This problem has been referred to in the literature as the Kappa paradoxes and has been widely discussed by several authors. Feinstein and Cicchetti (1990) for example among others wrote about it.

Although researchers have primarily been concerned about the magnitude of kappa, another equally serious and often overlooked consequence of the paradoxes is the difficulty to perform sample calculations. Supposed you want to know the number \(n\) of subjects that is required to obtain a kappa's standard error smaller than 0.3. The surprising reality is that, no matter how large the number of \(n\) of subjects, there is no guarantee that kappa's standard error will be smaller than 0.50. In other words, a particular set of ratings can always be found that would yield a standard error that exceeds 0.30 for example.

Note that for an arbitrarily large number of raters \(r\), Conger's kappa, which reduces to Cohen's kappa for \(r=2\), Krippendorff's alpha or Fleiss' generalized kappa have similar large-sample variances. Therefore, I have decided to investigate Fleiss' generalized kappa only. The maximum variance of Fleiss' kappa is given by:

\[MaxVar\bigl(\widehat{\kappa}_F\bigr) =\frac{an}{n-b},\hspace{3cm}(1)\]

where \(a\) and \(b\) are 2 constants that depend on the number of raters \(r\) and the number of categories \(q\). For more details about the derivation of this expression see Gwet (2021, chapter 6).

For 2 raters, \(a=0.099\) and \(b=3.08\). Consequently, even if the number of subjects goes to infinity, the maximum standard error will still exceed the \(\sqrt{a}=\sqrt{0.09}=0.312\). That is, it will always be possible to find a set of ratings that leads to a standard error that exceeds 0.3.

Bibliography

Feinstein, A.R. and D.V. Cicchetti (1990), High agreement but low kappa: I. the problems of two paradoxes." Journal of Clinical Epidemiology, 43, 543-549.

Gwet, K. (2021), Handbook of Inter-Rater Reliability, 5th Edition. Volume 1: Analysis of Categorical Ratings, AgreeStat Analytics, Maryland USA

Tuesday, March 30, 2021

Agreement Among 3 Raters or More When A Subject Can be Rated by No More than 2 Raters

Most methods proposed in the literature for evaluating the extent of agreement among 3 raters assume that each rater is expected to rate all subjects. In some inter-rater reliability applications however, this requirement cannot be satisfied, either because of the prohibitive costs associated with the rating process or because of a rating process too demanding to a human subject. For example, scientific laboratories are often rated by accrediting agencies to have their work quality officially certified. These accrediting agencies themselves need to conduct inter-rater reliability studies to demonstration the high quality of their accreditation process. Given the high costs associated with accrediting a laboratory (a staggering number of lab procedures must be verified and documentation reviewed), agencies are willing to fund a single round of rating for each laboratory with one rater, and use another rater to provide the ratings during the regular accreditation process, which is funded by each lab.

The question now becomes ``Is it possible to evaluate the extent of agreement among 3 raters or more, given that a maximum of 2 raters are allowed to rate the same subject?'' The good news is that it is indeed possible to design an experiment that would achieve that goal. However, a price that must be paid to make this happen. The agreement coefficient based on such a design will has a higher variance than the traditional coefficient based on the fully-crossed design where each rater must rate all subjects. The general approach is as follows:

Suppose your problem is to quantify the extent of agreement among the group of 5 raters \({\cal R}=\{Rater1, Rater2, Rater3, Rarer4, Rater5 \}\)
Out of the roster of 5 raters \(\cal R\), one can form the following 10 different pairs of raters (Note that if \(r\) is the number of raters, then the associated number of pairs that can be formed is \(r(r-1)/2=5\times4/2=10\)):

Suppose that a total of \(n=15\) subjects will participate in your experiment. The procedure consists of selecting 15 pairs of raters randomly and with replacement (i.e. one pair of raters could be selected more than once) from the above 10 pairs. The 15 selected pairs of raters will be assigned to the 15 subjects on a flow basis (i.e. sequentially as they are selected).
Select with replacement 15 random integers between 1 and 10. Suppose the 15 random integers are \(\{2, 6, 2, 5, 4, 1, 8, 1, 3, 3, 5, 4, 2, 5, 9\}\). That is, the \(2^{nd}\) pair (Rater1, Rater3) will be assigned to subjects 1, 3 and 13. The \(6^{th}\) pair (Rater2, Rater4) will be assigned to subject 2 and so on. The experimental design will look this:
Once all 15 subjects are rated, the dataset of ratings will have 3 columns. The first Subject column will identify subjects, the remain 2 columns will contain the ratings from the different pairs of raters assigned to subjects. The agreement coefficient will then be calculated as if the same 2 raters produced all the ratings. What will be different is the variance associated with the agreement coefficient.
What is described here is referred to as a Partially Crossed design with 2 raters per subject, or \(\textsf{PC}_2\) design and is discussed in details in the \(5^{th}\) edition of the Handbook of Inter-Rater Reliability to be released in July of 2021.

Monday, February 22, 2021

Testing the Difference Between 2 Agreement Coefficients for Statistical Significance

Researchers who use chance-corrected agreement coefficients such as Cohen's Kappa, Gwet's AC1 or AC2, Fleiss' Kappa and many other alternatives in their research, often need to compare two coefficients calculated with 2 different sets of ratings. A rigorous way to do such a comparison is to evaluate the difference between these 2 coefficients for statistical significance. This issue was extensively discussed in my paper entitled Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. AgreeTest, a cloud-based application can help you perform the techniques discussed in this paper and more. Do not hesitate to check it out when find time.

The 2 sets of ratings used to compute the agreement coefficients under comparison may be totally independent or many have several aspects in common. Here 2 possible scenarios you may encounter in practice:

Both datasets of ratings were produced by 2 independent samples of subjects and 2 independent groups of raters. In this case, the 2 agreement coefficients associated with these datasets are said to be uncorrelated. Their difference can be tested for statistical significance with an Unpaired t-Test (also implemented in AgreeTest).
Both datasets of ratings were produced either by 2 overlapping samples of subjects or 2 overlapping groups of raters, or both. In this case, the 2 agreement coefficients associated with these datasets are said to be correlated. Their difference can be tested for statistical significance with a Paired t-Test (also implemented in AgreeTest).

Several researchers have successfully used these statistical techniques in their research. Here is a small sample of these publications:

Tuesday, February 16, 2021

New peer-reviewed article

Many statistical statistical packages have implemented the wrong variance equation of Fleiss' generalized kappa (Fleiss, 1971). SPSS and the R package "rel" are among these packages. I recently published in "Educational and Psychological Measurement" an article entitled "Large-Sample Variance of Fleiss Generalized Kappa." I show in this article that it is not Fleiss' variance equation that is wrong. Instead, it is the way it has been used that is. Fleiss' variance equation was developed under the assumption of no agreement among raters and for the sole purpose of being used in hypothesis testing. It does not quantify the precision of Fleiss' generalized kappa and cannot be used for constructing confidence intervals either.

Monday, July 20, 2020

Large-sample variance of Fleiss generalized kappa coefficient

A number of researchers brought to my attention the fact that the variance associated with Fleiss' generalized kappa (Fleiss, 1971) calculated with my R package irrCAC differs from the variance obtained from alternative software products such as SPSS (Reliability with option FleissKappa) and with another R package named rel. In fact, SPSS and the R package rel produce the same variance estimate for Fleiss' generalized kappa. So, why that discrepancy and what should you do about it? Answering this question is the purpose of this rather long post.

If SPSS and the rel package agree, it is because they are both based on the variance formula proposed by Fleiss et al. (1979). While writing my own package irrCAC, I knew very well about this paper and read it multiple times. I decided not to use the variance formula that Fleiss, Nee, and Landis proposed and would strongly discourage anybody else from using it for any purpose other than for testing the null hypothesis of no agreement among raters. In this post, I explain the rationale behind my decision, and will briefly discuss my alternative approach.

Initially, I thought about writing and submitting a formal article to a peer-reviewed journal on this issue. Because I am uncertain that I will find time to do it (although I might still end up doing it at a much later time), I thought I would share with everybody the general approach I often use for deriving these variance formulas. My approach relies heavily on what is known in mathematics as the linearization technique. Linearization is a very popular technique that has been widely used across several fields of mathematics. Although I personally learned it well as a PhD student in mathematics (several years ago), in my opinion this technique should and can be introduced much earlier and in other non-mathematics fields. Here, I will restrict myself to the way linearization is often used in mathematical statistics only (I used this technique in one of previous papers - see Gwet, 2016).

Let me consider an inter-rater reliability experiment, which involves \(n\) subjects, \(r\) raters and \(q\) categories into which each of the \(r\) raters is expected to classify all \(n\) subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of \(r_{ik}\) raters have classified the specific subject \(i\) into category \(k\). Now, \(\pi_k\) the probability for a random rater to classify a subject into category \(k\) is given by, \[ \pi_k = \frac{1}{n}\sum_{i=1}^nr_{ik}/r\hspace{3cm}(1). \] The complement of this probability is given by \(\pi_k^\star = 1-\pi_k\), representing the probability for a rater to classify subject \(i\) into a category other than \(k\).

Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as \(\widehat{\kappa} = (p_a-p_e)/(1-p_e)\), where \(p_a\) is the percent agreement and \(p_e\) the percent chance agreement. These 2 quantities are defined as follows:

\begin{equation} p_a = \frac1n\sum_{i=1}^n\sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)},\mbox{ and }p_e = \sum_{k=1}^q\pi_k^2. \end{equation}

Variance Proposed by Fleiss et al. (1979)

Here is the variance that Fleiss et al. (1979) has proposed:

\begin{equation}\small Var(\hat{\kappa}) = \frac{2}{\displaystyle nr(r-1)\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2}\times \Biggl[\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2 - \sum_{k=1}^q\pi_k\pi_k^\star(\pi_k^*-\pi_k)\Biggr],\hspace{5mm}(2) \end{equation}

In this paper, the authors clearly say the following in the second column on page 974:

In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is large and the null hypothesis is true are derived

The authors clearly state that their variance formulas are only valid when the null hypothesis is true (this generally means there is no agreement beyond chance). They go on to say on page 975, right after equation (5) the following:

Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies \(n_{i1}, n_{i2}, \cdots, n_{ik}\) are a set of multinominal frequencies with parameters \(n\) and \((P_1, P_2, \cdots, P_k)\), where \(\sum P_j = 1\).

The fact that the ratings are purely random translates in practice by an absence of agreement among raters beyond chance (you may notice that the notations in their paper are different from mine - \(n_{i1}\) for example is what I refer to as \(r_{i1}\)). Consequently, equation (2) should only be used if you are testing the null hypothesis of no agreement among raters, and should never ever be used to construct confidence intervals for example, nor to do anything else unrelated to hypothesis testing.

Variance Proposed by Gwet (2014)

The variance of Fleiss' generalized kappa I proposed in my book (see Gwet, 2014) is defined as follows: \begin{equation} Var(\hat{\kappa}) = \frac{1-f}{n}\frac1{n-1}\sum_{i=1}^n\bigl(\kappa_i^\star - \hat{\kappa})^2,\hspace{2cm}(3) \end{equation} where, \begin{equation} \kappa_i^\star = \kappa_i - 2(1-\hat{\kappa})\frac{p_{e|i}-p_e}{1-p_e}, \end{equation} with \(\kappa_i = (p_{a|i}-p_e)/(1-p_e)\). Moreover \(p_{a|i}\) and \(p_{e|i}\) are defined as follows: \begin{equation} p_{a|i} = \sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)}\mbox{ and }p_{e|i} = \sum_{k=1}^q\pi_kr_{ik}/r. \end{equation}

Equation (3) is a general purpose variance estimator, which is valid either for hypothesis testing or for confidence interval construction or for anything else. I expect equations (2) and (3) to agree reasonably well when the extent of agreement among raters is close to 0. I did not personally verify this, but I expect it to be true if equation (2) was properly derived. But if there is an agreement to some extent among raters, then I expect equation (2) to often yield a smaller variance and to be an understatement of the true variance.

How to you get to equation (3)?

The Linearization technique is based upon the basic fact that the human mind has always found it convenient to deal with linear expressions. It is because they involve summing and averaging stuff (stuff here represent variables to which fixed scalar coefficients may or may not be attached). That is why averages or means have been so popular and so thoroughly investigated in statistics.

Now, Fleiss' generalized kappa \(\widehat{\kappa}\) is anything but linear. Its calculation involves more, a lot more than summing and averaging basic entities. When dealing with a nonlinear expressions such as kappa, the question always becomes, can we not find a linear expression that is sufficiently close to kappa, derive the variance of the linear expression and show that it is sufficiently close to the actual variance of kappa? As it turned out, when the number of subjects is very large, Fleiss' generalized kappa tends to take a linear form, and the larger the number of subjects, the closer that linear form gets to kappa. But what does that linear form look like? Since linearity is only taking place when the number of subjects is far away from what you would normally have in practice, we need to be able to zoom in enough to take a decent shot at that linear form from a distance. Finding a powerful zoom and using it well is the challenge we need to overcome to resolve this issue. To be successful, we do need to tackle this issue step by step, piece by piece.

When calculating the variance of Fleiss' generalized kappa \(\widehat{\kappa}\), the first very annoying issue you face is the \(1-p_e\) factor that appears in the denominator. Here is a very clear-cut way to get rid of it. The fundamental building block of the percent chance agreement \(p_e\) is \(\pi_k\). Since this quantity depends on the sample of \(n\) subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by \(\widehat{\pi}_k\) to stress out that it is an approximation of something fixed \((\pi_k)\) that we will call the "true" propensity for classification into category \(k\). Equation (1) as well as the percent chance agreement \(p_e\) have to be rewritten as follows:
\begin{equation} \widehat{\pi}_k = \frac1n\sum_{k=1}^nr_{ik}/r, \mbox{ and }p_e = \sum_{k=1}^q\widehat{\pi}_k^2. \end{equation} Since \(\widehat{\pi}_k\) is a simple average, the Law of Large Numbers (LLN) stipulates that \(\widehat{\pi}_k\) converges (in probability) to a fixed number \(\pi_k\) as the number of subjects \(n\) grows. \begin{equation} \widehat{\pi}_k \overset{P}{\longrightarrow} \pi_k\hspace{3cm}(4) \end{equation} We might never know for sure what the real value of \(\pi_k\) is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible linear form for kappa. Once we have it, then we will proceed to estimate what we don't know.

Continuous Mapping Theorem (CMT)

Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by \(1-p_e\) or it must be multiplied by \(1/(1-p_e)\). This inverse can be rewritten as follows: \begin{equation} \frac1{1-p_e} = \frac{1}{1-P_e}\times\frac{1}{1-\displaystyle\biggl(\frac{p_e-P_e}{1-P_e}\biggr)}\hspace{1cm}(6) \end{equation} First consider the expression on the right side of the multiplication sign \(\times\) in equation (6) and let \(\varepsilon = (p_e-P_e)/(1-P_e)\). It folows from Taylor's theorem that \(1/(1-\varepsilon) = 1+\varepsilon + Remainder\), where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from Slustky's theorem that the large-sample probability distribution of \(1/(1-\varepsilon)\) is the same as the probability distribution of \(1+\varepsilon\). It follows from equation (6) that the large-sample distribution of the inverse \(1/(1-p_e)\) is the same as the distribution of the following statistic: \begin{equation} L_e = \frac{1+(p_e-P_e)/(1-P_e)}{1-P_e}\hspace{3cm}(7). \end{equation} You will note that the \(L_e\) statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish.
Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of \(\kappa_0\) defined as follows: \begin{equation} \kappa_0 = (p_a-p_e)L_e\hspace{4cm}(8) \end{equation} I also know that by applying the Law of Large Numbers again that the percent agreement \(p_a\) also converges (in probability) to a fixed probability \(P_a\), whose exact value may never be known to us (an issue I'll worry about later). \(\kappa_0\) can now be rewritten as: \begin{equation} \kappa_0 = \frac{p_a-P_e}{1-P_e} - (1-\kappa)\frac{p_e-P_e}{1-P_e},\hspace{2cm}(9) \end{equation} where \(\kappa=(P_a-P_e)/(1-P_e)\) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement \(p_a\) is already a linear expression, that is not yet the case for the percent chance agreement, which depends on \(\widehat{\pi}_k^2\) - a sample-dependent statistic that is squared. Let us deal with it.
I indicated earlier that the estimated propensity \(\widehat{\pi}_k\) for classification into category \(k\) converges (in probability) to the fixed quantity \(\pi_k\). It follows from Taylor's theorem again that \(\widehat{\pi}_k^2 = \pi_k^2 +2\pi_k(\widehat{\pi}_k-\pi_k) + \mbox{Remainder}\), where the remainder goes to 0 faster than the difference \(\widehat{\pi}_k-\pi_k\). Consequently, the large-sample distribution of the difference \((p_e-P_e)\) of equation (9) is the same as that of \(2(p_{e|0}-P_e)\) where \(p_{e|0}\) is given by: \begin{equation} p_{e|0} = \sum_{k=1}^q\pi_k\widehat{\pi}_k = \frac{1}{n}\sum_{i=1}^np_{e|i}, \mbox{ where }p_{e|i}=\sum_{k=1}^q\pi_kr_{ik}/r. \hspace{1cm}(10) \end{equation} Consequently, the large-sample distribution of \(\kappa_0\) of equation (9) is the same as the distribution of \(\kappa_1\) given by, \begin{equation} \kappa_1 = \frac{p_a-P_e}{1-P_e} - 2(1-\kappa)\frac{p_{e|0}-P_e}{1-P_e}. \hspace{3cm}(11) \end{equation} Note that equation (11) is the pure linear expression I was looking for. That is, \begin{equation} \kappa_1 =\frac{1}{n}\sum_{i=1}^n\kappa_i^\star\hspace{5cm}(12), \end{equation} where \(\kappa_i^\star\) is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of \(\kappa_1\). If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated.
Note that the Central Limit Theorem ensures that the large-sample distribution of the sample mean is Normal. Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance.
To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings. I only used (sometimes multiple times) the following 5 theorems:

The Law of Large Numbers
Taylor's Theorem
Slustky's Theorem
The Continuous Mapping Theorem
The Central Limit Theorem

The linearization technique used here is definitely the most effective way for deriving the variance of complicated statistics. Unfortunately, students outside of traditional mathematics departments have hardly been exposed to it. I attempted here to outline the key steps and the main theorems needed to zoom in on kappa in the vicinity of infinity so that one can read its linear structure. Hopefully researchers with some background in mathematics will get a glimpse into the mechanics of this powerful technique. The most delicate aspect of this method is when the Taylor's theorem must be applied. In fact, the "reminder" term must be carefully looked at to ensure that it does not include a term not supposed to be there.

A rigorous mathematical demonstration of the steps discussed above should not be a concern to researchers. It would definitely require PhD-level mathematics that should be left to PhD students in maths.

Side Note:

In his book entitled "How Not to Be Wrong: The Power of Mathematical Thinking", here is what the author Jordan Ellenberg says:

A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.

Bibliography

Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters, Psychological Bulletin, 86, 974-977.
Gwet, K.L. (2014). Handbook of Inter-Rater Reliability (4th Edition), Advanced Analytics, LLC, Maryland, USA
Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, 76(4), 609–637.

Friday, November 8, 2019

Handbook of Inter-Rater Reliability, 5th Edition

Work on the 5th edition of the Handbook of Inter-Rater Reliability is in progress. Due to a large increase in number of pages from the 4th edition, I decided that the 5th edition will be released in 2 volumes. Volume 1 will be devoted to the Chance-corrected Agreement Coefficients (CAC), while Volume 2 will focus on the Intraclass Correlation Coefficient (ICC). You have the opportunity to review the early drafts of many chapters of this 5th edition and to submit your comments or/and questions to me. I will appreciate it very much if you can report any typo or error to me after you review. Volume 1 chapters that are available can be downloaded here. Volume 2 chapters on the other hand would be downloaded here.