Monday, July 20, 2020

Large-sample variance of Fleiss generalized kappa coefficient

A number of researchers brought to my attention the fact that the variance associated with Fleiss' generalized kappa (Fleiss, 1971) calculated with my R package irrCAC differs from the variance obtained from alternative software products such as SPSS (Reliability with option FleissKappa) and with another R package named rel. In fact, SPSS and the R package rel produce the same variance estimate for Fleiss' generalized kappa. So, why that discrepancy and what should you do about it? Answering this question is the purpose of this rather long post. 

If SPSS and the rel package agree, it is because they are both based on the variance formula proposed by Fleiss et al. (1979).  While writing my own package irrCAC, I knew very well about this paper and read it multiple times. I decided not to use the variance formula that Fleiss, Nee, and Landis proposed and would strongly discourage anybody else from using it for any purpose other than for testing the null hypothesis of no agreement among raters.  In this post, I explain the rationale behind my decision, and will briefly discuss my alternative approach.
      
Initially, I thought about writing and submitting a formal article to a peer-reviewed journal on this issue.  Because I am uncertain that I will find time to do it (although I might still end up doing it at a much later time), I thought I would share with everybody the general approach I often use for deriving these variance formulas. My approach relies heavily on what is known in mathematics as the linearization technique. Linearization is a very popular technique that has been widely used across several fields of mathematics. Although I personally learned it well as a PhD student in mathematics (several years ago), in my opinion this technique should and can be introduced much earlier and in other non-mathematics fields. Here, I will restrict myself to the way linearization is often used in mathematical statistics only (I used this technique in one of previous papers - see Gwet, 2016).   

Let me consider an inter-rater reliability experiment, which involves \(n\) subjects, \(r\) raters and \(q\) categories into which each of the \(r\) raters is expected to classify all \(n\) subjects (there could be some missing ratings in case some raters do not rate all subjects, but I will ignore these practical considerations for now). A total of \(r_{ik}\) raters have classified the specific subject \(i\) into category \(k\). Now, \(\pi_k\) the probability for a random rater to classify a subject into category \(k\) is given by, \[ \pi_k = \frac{1}{n}\sum_{i=1}^nr_{ik}/r\hspace{3cm}(1). \] The complement of this probability is given by \(\pi_k^\star = 1-\pi_k\), representing the probability for a rater to classify subject \(i\) into a category other than \(k\).

Fleiss' generalized kappa (c.f. Fleiss, 1971) is defined as \(\widehat{\kappa} = (p_a-p_e)/(1-p_e)\), where \(p_a\) is the percent agreement and \(p_e\) the percent chance agreement. These 2 quantities are defined as follows:

\begin{equation} p_a = \frac1n\sum_{i=1}^n\sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)},\mbox{ and }p_e = \sum_{k=1}^q\pi_k^2. \end{equation}
   
Variance Proposed by Fleiss et al. (1979)

Here is the variance that Fleiss et al. (1979) has proposed:
\begin{equation}\small  Var(\hat{\kappa}) = \frac{2}{\displaystyle nr(r-1)\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2}\times \Biggl[\biggl(\sum_{k=1}^q\pi_k\pi_k^\star\biggr)^2 - \sum_{k=1}^q\pi_k\pi_k^\star(\pi_k^*-\pi_k)\Biggr],\hspace{5mm}(2) \end{equation}
In this paper, the authors clearly say the following in the second column on page 974:
In this article, formulas for the standard error of kappa in the case of different sets of equal numbers of raters that are valid when the number of subjects is large and the null hypothesis is true are derived
The authors clearly state that their variance formulas are only valid when the null hypothesis is true (this generally means there is no agreement beyond chance). They go on to say on page 975, right after equation (5) the following:
Consider the hypothesis that the ratings are purely random in the sense that for each subject, the frequencies \(n_{i1}, n_{i2}, \cdots, n_{ik}\) are a set of multinominal frequencies with parameters \(n\) and \((P_1, P_2, \cdots, P_k)\), where \(\sum P_j = 1\).
The fact that the ratings are purely random translates in practice by an absence of agreement among raters beyond chance (you may notice that the notations in their paper are different from mine - \(n_{i1}\) for example is what I refer to as \(r_{i1}\)).  Consequently, equation (2) should only be used if you are testing the null hypothesis of no agreement among raters, and should never ever be used to construct confidence intervals for example, nor to do anything else unrelated to hypothesis testing

Variance Proposed by Gwet (2014)

The variance of Fleiss' generalized kappa I proposed in my book (see Gwet, 2014) is defined as follows: \begin{equation} Var(\hat{\kappa}) = \frac{1-f}{n}\frac1{n-1}\sum_{i=1}^n\bigl(\kappa_i^\star - \hat{\kappa})^2,\hspace{2cm}(3) \end{equation} where, \begin{equation} \kappa_i^\star = \kappa_i - 2(1-\hat{\kappa})\frac{p_{e|i}-p_e}{1-p_e}, \end{equation} with \(\kappa_i = (p_{a|i}-p_e)/(1-p_e)\). Moreover \(p_{a|i}\) and \(p_{e|i}\) are defined as follows: \begin{equation} p_{a|i} = \sum_{k=1}^q\frac{r_{ik}(r_{ik}-1)}{r(r-1)}\mbox{ and }p_{e|i} = \sum_{k=1}^q\pi_kr_{ik}/r. \end{equation}
Equation (3) is a general purpose variance estimator, which is valid either for hypothesis testing or for confidence interval construction or for anything else. I expect equations (2) and (3) to agree reasonably well when the extent of agreement among raters is close to 0. I did not personally verify this, but I expect it to be true if equation (2) was properly derived. But if there is an agreement to some extent among raters, then I expect equation (2) to often yield a smaller variance and to be an understatement of the true variance.

How to you get to equation (3)?

The Linearization technique is based upon the basic fact that the human mind has always found it convenient to deal with linear expressions.  It is because they involve summing and averaging stuff (stuff here represent variables to which fixed scalar coefficients may or may not be attached). That is why averages or means have been so popular and so thoroughly investigated in statistics.  

Now, Fleiss' generalized kappa \(\widehat{\kappa}\) is anything but linear.  Its calculation involves more, a lot more than summing and averaging basic entities.  When dealing with a nonlinear expressions such as kappa, the question always becomes, can we not find a linear expression that is sufficiently close to kappa, derive the variance of the linear expression and show that it is sufficiently close to the actual variance of kappa?  As it turned out, when the number of subjects is very large, Fleiss' generalized kappa tends to take a linear form, and the larger the number of subjects, the closer that linear form gets to kappa. But what does that linear form look like?  Since linearity is only taking place when the number of subjects is far away from what you would normally have in practice, we need to be able to zoom in enough to take a decent shot at that linear form from a distance. Finding a powerful zoom and using it well is the challenge we need to overcome to resolve this issue. To be successful, we do need to tackle this issue step by step, piece by piece.
  • When calculating the variance of Fleiss' generalized kappa \(\widehat{\kappa}\), the first very annoying issue you face is the \(1-p_e\) factor that appears in the denominator.  Here is a very clear-cut way to get rid of it.  The fundamental building block of the percent chance agreement \(p_e\) is \(\pi_k\).  Since this quantity depends on the sample of \(n\) subjects (see equation 1), and is expected to change as the subject sample changes, let us denote this probability by \(\widehat{\pi}_k\) to stress out that it is an approximation of something fixed \((\pi_k)\) that we will call the "true" propensity for classification into category \(k\). Equation (1) as well as the percent chance agreement \(p_e\) have to be rewritten as follows:
    \begin{equation} \widehat{\pi}_k = \frac1n\sum_{k=1}^nr_{ik}/r, \mbox{ and }p_e = \sum_{k=1}^q\widehat{\pi}_k^2. \end{equation} Since \(\widehat{\pi}_k\) is a simple average, the Law of Large Numbers (LLN) stipulates that \(\widehat{\pi}_k\) converges (in probability) to a fixed number \(\pi_k\) as the number of subjects \(n\) grows. \begin{equation} \widehat{\pi}_k \overset{P}{\longrightarrow} \pi_k\hspace{3cm}(4) \end{equation} We might never know for sure what the real value of \(\pi_k\) is. But that's ok, we don't really need it for now. Remember we are zooming in on something in the vicinity of infinity to see a possible linear form for kappa. Once we have it, then we will proceed to estimate what we don't know.
    The Continuous Mapping Theorem (CMT) (applied twice) allows me to deduce from equation (4) that, \begin{equation} \widehat{\pi}_k^2 \overset{P}{\longrightarrow} \pi_k^2, \mbox{ and } p_e \overset{P}{\longrightarrow} P_e\hspace{3cm}(5) \end{equation} That is, as the number of subjects grows the (random) percent agreement \(p_e\) converges in probability to a fixed quantity \(P_e\) that is the sum of all \(k\) terms \(\pi_k^2\). Note that \(p_e\) the sample-based random quantity is in lowercase, while \(P_e\), the fixed limit value is in uppercase.
  • Now, note that in the formula defining Fleiss' generalized kappa, the numerator must be divided by \(1-p_e\) or it must be multiplied by \(1/(1-p_e)\).  This inverse can be rewritten as follows: \begin{equation} \frac1{1-p_e} = \frac{1}{1-P_e}\times\frac{1}{1-\displaystyle\biggl(\frac{p_e-P_e}{1-P_e}\biggr)}\hspace{1cm}(6) \end{equation} First consider the expression on the right side of the multiplication sign \(\times\) in equation (6) and let \(\varepsilon = (p_e-P_e)/(1-P_e)\). It folows from Taylor's theorem that \(1/(1-\varepsilon) = 1+\varepsilon + Remainder\), where the ``Remainder" goes to 0 (in probability) as the number of subjects goes to infinity. Consequently, it follows from Slustky's theorem that the large-sample probability distribution of \(1/(1-\varepsilon)\) is the same as the probability distribution of \(1+\varepsilon\). It follows from equation (6) that the large-sample distribution of the inverse \(1/(1-p_e)\) is the same as the distribution of the following statistic: \begin{equation} L_e = \frac{1+(p_e-P_e)/(1-P_e)}{1-P_e}\hspace{3cm}(7). \end{equation} You will note that the \(L_e\) statistic does not involve any sample-dependent quantity in the denominator, which is the objective I wanted to accomplish.
  • Now, we know that the large-sample distribution of Fleiss' generalized kappa is the same as the distribution of \(\kappa_0\) defined as follows: \begin{equation} \kappa_0 = (p_a-p_e)L_e\hspace{4cm}(8) \end{equation} I also know that by applying the Law of Large Numbers again that the percent agreement \(p_a\) also converges (in probability) to a fixed probability \(P_a\), whose exact value may never be known to us (an issue I'll worry about later). \(\kappa_0\) can now be rewritten as: \begin{equation} \kappa_0 = \frac{p_a-P_e}{1-P_e} - (1-\kappa)\frac{p_e-P_e}{1-P_e},\hspace{2cm}(9) \end{equation} where \(\kappa=(P_a-P_e)/(1-P_e)\) is a fixed value (or estimand) to which Fleiss' kappa is expected to converge to. Our linear form is slowing and gradually taking shape. While the sample-based percent agreement \(p_a\) is already a linear expression, that is not yet the case for the percent chance agreement, which depends on \(\widehat{\pi}_k^2\) - a sample-dependent statistic that is squared. Let us deal with it.
  • I indicated earlier that the estimated propensity \(\widehat{\pi}_k\) for classification into category \(k\) converges (in probability) to the fixed quantity \(\pi_k\). It follows from Taylor's theorem again that \(\widehat{\pi}_k^2 = \pi_k^2 +2\pi_k(\widehat{\pi}_k-\pi_k) + \mbox{Remainder}\), where the remainder goes to 0 faster than the difference \(\widehat{\pi}_k-\pi_k\). Consequently, the large-sample distribution of the difference \((p_e-P_e)\) of equation (9) is the same as that of  \(2(p_{e|0}-P_e)\) where \(p_{e|0}\) is given by: \begin{equation} p_{e|0} = \sum_{k=1}^q\pi_k\widehat{\pi}_k = \frac{1}{n}\sum_{i=1}^np_{e|i}, \mbox{ where }p_{e|i}=\sum_{k=1}^q\pi_kr_{ik}/r. \hspace{1cm}(10) \end{equation} Consequently, the large-sample distribution of \(\kappa_0\) of equation (9) is the same as the distribution of \(\kappa_1\) given by, \begin{equation} \kappa_1 = \frac{p_a-P_e}{1-P_e} - 2(1-\kappa)\frac{p_{e|0}-P_e}{1-P_e}. \hspace{3cm}(11) \end{equation} Note that equation (11) is the pure linear expression I was looking for. That is, \begin{equation} \kappa_1 =\frac{1}{n}\sum_{i=1}^n\kappa_i^\star\hspace{5cm}(12), \end{equation} where \(\kappa_i^\star\) is defined right after equation (3) above. Now, we have found a simple average whose probability distribution is the same as the large-sample distribution of Fleiss' generalized kappa. All you need to do is to compute the variance of \(\kappa_1\). If there are some outstanding terms that are unknown, you estimate them based on the sample data you have. This is how these variances are calculated.
  • Note that the Central Limit Theorem ensures that the large-sample distribution of the sample mean is Normal.  Therefore, it is equation (12) that is used to show that the large-sample distribution of Fleiss' kappa is Normal and to compute the associated variance.
  • To conclude, I may say that for the derivation of Fleiss' generalized kappa variance, I did not need to make any special assumptions about the ratings.  I only used (sometimes multiple times) the following 5 theorems:
    • The Law of Large Numbers
    • Taylor's Theorem
    • Slustky's Theorem
    • The Continuous Mapping Theorem
    • The Central Limit Theorem
The linearization technique used here is definitely the most effective way for deriving the variance of complicated statistics.  Unfortunately, students outside of traditional mathematics departments have hardly been exposed to it.   I attempted here to outline the key steps and the main theorems needed to zoom in on kappa in the vicinity of infinity so that one can read its linear structure. Hopefully researchers with some background in mathematics will get a glimpse into the mechanics of this powerful technique.  The most delicate aspect of this method is when the Taylor's theorem must be applied.  In fact, the "reminder" term must be carefully looked at to ensure that it does not include a term not supposed to be there. 

A rigorous mathematical demonstration of the steps discussed above should not be a concern to researchers.  It would definitely require PhD-level mathematics that should be left to PhD students in maths.     


Side Note:
In his book entitled "How Not to Be Wrong: The Power of Mathematical Thinking", here is what the author Jordan Ellenberg says:
A basic rule of mathematical life: if the universe hands you a hard problem, try to solve an easier one instead, and hope the simple version is close enough to the original problem that the universe doesn't object.



Bibliography

  • Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
  • Fleiss, J.L., Nee, J.C.M., and Landis, J.R. (1979). Large Sample Variance of Kappa in the Case of Different Sets of Raters, Psychological Bulletin, 86, 974-977.
  • Gwet, K.L. (2014). Handbook of Inter-Rater Reliability (4th Edition), Advanced Analytics, LLC, Maryland, USA
  • Gwet, K.L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical SignificanceEducational and Psychological Measurement76(4), 609–637.

Friday, November 8, 2019

Handbook of Inter-Rater Reliability, 5th Edition

Work on the 5th edition of the Handbook of Inter-Rater Reliability is in progress. Due to a large increase in number of pages from the 4th edition, I decided that the 5th edition will be released in 2 volumes. Volume 1 will be devoted to the Chance-corrected Agreement Coefficients (CAC), while Volume 2 will focus on the Intraclass Correlation Coefficient (ICC). You have the opportunity to review the early drafts of many chapters of this 5th edition and to submit your comments or/and questions to me. I will appreciate it very much if you can report any typo or error to me after you review. Volume 1 chapters that are available can be downloaded here. Volume 2 chapters on the other hand would be downloaded here.

Tuesday, October 22, 2019

AgreeStat 360: a cloud app for analyzing the extent of agreement among raters

AgreeStat 360 is a cloud-based app for analyzing the extent of agreement among raters. You may upload your rating data to agreestat360.com as a text file in csv format or as an Excel worksheet. AgreeStat 60 will process it instantaneously. You can compute a variety of Chance-corrected Agreement Coefficients (CAC) on nominal ratings as well as various Intraclass Correlation Coefficients (ICC) on quantitative ratings. The CAC coefficients include the percent agreement, Cohen's kappa, Gwet's AC1 and AC2, Krippendorff's alpha, Brennan-Prediger, Conger's kappa, or Fleiss' kappa. The ICC coefficients cover most ANOVA models found in the inter-rater reliability coefficients.
The only tool you will ever need to use AgreeStat 360 is a web browser from a PC or a tablet. You may even do sophisticated analyzes with your cell phone browser, although some of the forms may not display very well on the small cell phone screen.
You will need to register in order to receive a password by email that allows you to log in to a trial version and test all the features of AgreeStat 360. Test it now.

Thursday, August 29, 2019

R Package for Chance-corrected Agreement Coefficients

Unknown
library(irrCAC)

Installation

devtools::install_github(“kgwet/irrCAC”)

Abstract

The irrCAC is an R package that provides several functions for calculating various chance-corrected agreement coefficients. This package closely follows the general framework of inter-rater reliability assessment presented by Gwet (2014). A similar package was developed for STATA users by Klein (2018).
The functions included in this package can handle 3 types of input data:
  • The contingency table,
  • The distribution of raters by subject and by category,
  • The raw data, which is essentially a plain dataset where each row represents a subject and each column, the ratings associated with one rater.
The list of all datasets containined in this package can be listed as follows:
  data(package="irrCAC")

Computing Agreement Coefficients


Computing agreement Coefficients from Contingency tables

cont3x3abstractors is one of 2 datasets included in this package and that contain rating data from 2 raters organized in the form of a contingency table. The following R script shows how to compute Cohen’s kappa, Scott’s Pi, Gwet’s AC1, Brennan-Prediger, Krippendorff’s alpha, and the percent agreement coefficients from this dataset.

  cont3x3abstractors
#>         Ectopic AIU NIU
#> Ectopic      13   0   0
#> AIU           0  20   7
#> NIU           0   4  56
  kappa2.table(cont3x3abstractors)
#>      coeff.name coeff.val   coeff.se     coeff.ci coeff.pval
#> 1 Cohen's Kappa 0.7964094 0.05891072 (0.68,0.913)      0e+00

  scott2.table(cont3x3abstractors)
#>   coeff.name coeff.val   coeff.se      coeff.ci coeff.pval
#> 1 Scott's Pi 0.7962397 0.05905473 (0.679,0.913)      0e+00

  gwet.ac1.table(cont3x3abstractors)
#>   coeff.name coeff.val   coeff.se      coeff.ci coeff.pval
#> 1 Gwet's AC1 0.8493305 0.04321747 (0.764,0.935)      0e+00

  bp2.table(cont3x3abstractors)
#>         coeff.name coeff.val   coeff.se      coeff.ci 
#> 1 Brennan-Prediger     0.835 0.04693346 (0.742,0.928)
#> coeff.pval
#>      0e+00
  krippen2.table(cont3x3abstractors)
#>             coeff.name coeff.val   coeff.se     coeff.ci 
#> 1 Krippendorff's Alpha 0.7972585 0.05905473 (0.68,0.914)      
#> coeff.pval
#>      0e+00
  pa2.table(cont3x3abstractors)
#>          coeff.name coeff.val   coeff.se      coeff.ci 
#> 1 Percent Agreement      0.89 0.03128898 (0.828,0.952)      
#> coeff.pval
#>      0e+00
Suppose that you only want to obtain Gwet’s AC1 coefficient, but don’t care about the associated precision measures such as the standard error, confidence intervals or p-values. You can accomplish this as follows:
  ac1 <- gwet.ac1.table(cont3x3abstractors)$coeff.val
Then use the variable ac1 to obtain AC1 = 0.849.
Another contingency table included in this package is named cont3x3abstractors. You may use it to experiment with the r functions listed above.

Computing agreement coefficients from the distribution of raters by subject & category

Included in this package is a small dataset named distrib.6raters,which contains the distribution of 6 raters by subject and category. Each row represents a subject (i.e. a psychiatric patient) and the number of raters (i.e. psychiatrists) who classified it into each category used in the inter-rater reliability study. Here is the dataset and how it can be used to compute the various agreement coefficients:
distrib.6raters
#>    Depression Personality.Disorder Schizophrenia Neurosis Other
#> 1           0                    0             0        6     0
#> 2           0                    3             0        0     3
#> 3           0                    1             4        0     1
#> 4           0                    0             0        0     6
#> 5           0                    3             0        3     0
#> 6           2                    0             4        0     0
#> 7           0                    0             4        0     2
#> 8           2                    0             3        1     0
#> 9           2                    0             0        4     0
#> 10          0                    0             0        0     6
#> 11          1                    0             0        5     0
#> 12          1                    1             0        4     0
#> 13          0                    3             3        0     0
#> 14          1                    0             0        5     0
#> 15          0                    2             0        3     1
gwet.ac1.dist(distrib.6raters)
#> coeff.name   coeff  stderr      conf.int  p.value      pa      pe
#> Gwet's AC1 0.44480 0.08419 (0.264,0.625) 0.000116 0.55111 0.19148
fleiss.kappa.dist(distrib.6raters)
#>  coeff.name   coeff  stderr     conf.int  p.value      pa      pe
#>Fleiss Kappa 0.41393 0.08119 (0.24,0.588) 0.000162 0.55111 0.23407
krippen.alpha.dist(distrib.6raters)
#>  coeff.name   coeff  stderr      conf.int p.value      pa      pe
#>Krippendorff 0.42044 0.08243 (0.244,0.597) 0.00016 0.55610 0.23407
bp.coeff.dist(distrib.6raters)
#>  coeff.name   coeff  stderr      conf.int p.value      pa      pe 
#>Brennan-Pred 0.43889 0.08312 (0.261,0.617) 0.00012 0.55111     0.2
Once again, you can request a single value from these functions. To get only Krippendorff’s alpha coefficient without it’s precision measures, you may proceed as follows:
 alpha <- krippen.alpha.dist(distrib.6raters)$coeff
The newly-created alpha variable gives the coefficient α = 0.4204384.

Two additional datasets that represent ratings in the form of a distribution of raters by subject and by category, are included in this package. These datasets are cac.dist4cat and cac.dist4cat. Note that these 2 datasets contain more columns than needed to run the 4 functions presented in this section. Therefore, the columns associated with the response categories must be extracted from the original datasets before running the functions. For example, computing Gwet’s AC1 coefficient using the cac.dist4cat dataset should be done as follows:
  ac1 <- gwet.ac1.dist(cac.dist4cat[,2:4])$coeff
Note that the input dataset supplied to the function is cac.dist4cat[,2:4]. That is, only columns 2, 3, and 4 are extracted from the original dataset and used as input data. We know from the value of the newly created variable that AC1 = 0.3518903.

Computing agreement coefficients from raw ratings

One example dataset of raw ratings included in this package is cac.raw4raters and looks like this:
  cac.raw4raters
#>    Rater1 Rater2 Rater3 Rater4
#> 1       1      1     NA      1
#> 2       2      2      3      2
#> 3       3      3      3      3
#> 4       3      3      3      3
#> 5       2      2      2      2
#> 6       1      2      3      4
#> 7       4      4      4      4
#> 8       1      1      2      1
#> 9       2      2      2      2
#> 10     NA      5      5      5
#> 11     NA     NA      1      1
#> 12     NA     NA      3     NA
As you can see, a dataset of raw ratings is merely a listing of ratings that the raters assigned to the subjects. Each row is associated with a single subject.Typically, the same subject would be rated by all or some of the raters. The dataset cac.raw4raters contains some missing ratings represented by the symbol NA, suggesting that some raters did not rate all subjects. As a matter of fact, in this particular case, no rater rated all subjects.

Here is how you can compute the various agreement coefficients using raw ratings:
pa.coeff.raw(cac.raw4raters)
#> $est
#>   coeff.name      pa pe coeff.val coeff.se  conf.int  p.value
#>Pct Agreement 0.81818  0 0.8181818  0.12561 (0.542,1) 4.35e-05
#>     w.name
#> unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
gwet.ac1.raw(cac.raw4raters)
#> $est
#>   coeff.name        pa        pe coeff.val coeff.se  conf.int   
#> 1        AC1 0.8181818 0.1903212   0.77544  0.14295 (0.461,1) 
#>       p.value     w.name
#> 1 0.000208721 unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
fleiss.kappa.raw(cac.raw4raters)
#> $est
#>      coeff.name        pa        pe coeff.val coeff.se  conf.int
#> 1 Fleiss' Kappa 0.8181818 0.2387153   0.76117  0.15302 (0.424,1)
#>       p.value     w.name
#> 1 0.000419173 unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
krippen.alpha.raw(cac.raw4raters)
#> $est
#>             coeff.name    pa   pe coeff.val coeff.se  conf.int
#> 1 Krippendorff's Alpha 0.805 0.24   0.74342  0.14557 (0.419,1)
#>        p.value     w.name
#> 1 0.0004594257 unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
conger.kappa.raw(cac.raw4raters)
#> $est
#>       coeff.name        pa        pe coeff.val coeff.se  conf.int
#> 1 Conger's Kappa 0.8181818 0.2334252   0.76282  0.14917 (0.435,1)
#>        p.value     w.name
#> 1 0.0003367066 unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
bp.coeff.raw(cac.raw4raters)
#> $est
#>         coeff.name        pa  pe coeff.val coeff.se  conf.int
#> 1 Brennan-Prediger 0.8181818 0.2   0.77273  0.14472 (0.454,1) 
#>        p.value     w.name
#> 1 0.0002375609 unweighted
#> 
#> $weights
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    0    0    0    0
#> [2,]    0    1    0    0    0
#> [3,]    0    0    1    0    0
#> [4,]    0    0    0    1    0
#> [5,]    0    0    0    0    1
#> 
#> $categories
#> [1] 1 2 3 4 5
Most users of this package will only be interested in the agreement coefficients and possibly in the related statistics such as the standard error and p-values. In this case, you should run these functions as follows (AC1 is used here as an example. Feel free to experiment with the other coefficients):
ac1 <- gwet.ac1.raw(cac.raw4raters)$est  
ac1
#>   coeff.name        pa        pe coeff.val coeff.se  conf.int     
#> 1        AC1 0.8181818 0.1903212   0.77544  0.14295 (0.461,1) 
#>       p.value     w.name
#>1 0.000208721 unweighted
You can even request only the AC1 coefficient estimate 0.77544. You will then proceed as follows:
ac1 <- gwet.ac1.raw(cac.raw4raters)$est 
ac1$coeff.val
[1] 0.77544

References:

  1. Gwet, K.L. (2014) Handbook of Inter-Rater Reliability, 4th Edition. Advanced Analytics, LLC.
  2. Klein, D. (2018) “Implementing a general framework for assessing interrater agreement in Stata,” , 18, 871-901.

Monday, August 26, 2019

R Package for Intraclass Correlation Coefficient as a Measure of Inter-Rater Reliability

Unknown
library(irrICC)

Installation

devtools::install_github(“kgwet/irrICC”)

Abstract

irrICC is an R package that provides several functions for calculating various Intraclass Correlation Coefficients (ICC). This package follows closely the general framework of inter-rater and intra-rater reliability presented by Gwet (2014). Many of the intraclass correlation coefficients discussed by Shrout and Fleiss (1979) are also implemented in this package.

All input datasets to be used with this package must contain a mandatory “Target” column of all subjects that were rated, and 2 or more columns “Rater1”, “Rater2”, …. showing the ratings assigned to the subjects. The Target variable mus represent the first column of the data frame, and every other column is assumed to contained ratings from a rater. Note that all ratings must be numeric values for the ICC to be calculated. For example, here is a dataset “iccdata1” that is included in this package:

  iccdata1
#>    Target   J1  J2  J3 J4
#> 1       1  6.0 1.0 3.0  2
#> 2       1  6.5  NA 3.0  4
#> 3       1  4.0 3.0 5.5  4
#> 4       5 10.0 5.0 6.0  9
#> 5       5  9.5 4.0  NA  8
#> 6       4  6.0 2.0 4.0 NA
#> 7       4   NA 1.0 3.0  6
#> 8       4  8.0 2.5  NA  5
#> 9       2  9.0 2.0 5.0  8
#> 10      2  7.0  NA 2.0  6
#> 11      2  8.0  NA 2.0  7
#> 12      3 10.0 5.0 6.0 NA

The first column “Taget” (the name Target can be replaced with any other name you like) contains subject identifiers, while J1, J2, J3, J4 are the 4 raters (referred to here as Judges) and the ratings they assigned to the subjects. You will notice that the Target column contains duplicates, indicating that some subjects were rated multiple times. Moreover, none of these judges rated all subjects as seen by the presencce of missing ratings identified with the symbol NA.

Two other datasets, iccdata2, and iccdata3 come with the package for you to experiment with. Even if your data frame contains several variables, note that only the Target and the Rater columns must be supplied as parameters to the functions. For example the iccdata2 data frame contains a variable named Group, which indicates the specific group each Target is categorized. It must be excluded from the input dataset as follows: iccdata2[,2:6].

Computing various ICC values

To determine what function you need, you must first have a statistical description of experimental data. There are essentially 3 statistical models recommended in the literature for describing quantitative inter-rater reliability data. These are commonly refer to as model 1, model 2 and model 3.

  • Model 1
    Model 1 is uses a single factor (hence the number 1) to explain the variation in the ratings. When the factor used is the subject then the model is referred to as Model 1A and when it is the rater the model is named Model 1B. You will want to use Model 1A if not all subjects are rated by the same roster of raters. That raters may change from subject to subject. Model 1B is more indicated if different raters may rate different rosters of subjects. Note that while Model 1A only allows for the calculation of inter-rater reliability, Model 1B on the other hand only allows for the calculation of intra-rater reliability.

Calculating the ICC under Model 1A is done as follows:

  icc1a.fn(iccdata1)
#>      sig2s    sig2e     icc1a n r max.rep min.rep Mtot ov.mean
#> 1 1.761312 5.225529 0.2520899 5 4       3       1   40     5.2

It follows that the inter-rater reliability is given by 0.252, the first 2 output statistics being the subject variance component 1.761 and error variance component 5.226 respectively. You may see a description of the other statistics from the function’s documentation.

The ICC under Model 1B is calculated as follows:

  icc1b.fn(iccdata1)
#>     sig2r    sig2e     icc1b n r max.rep min.rep Mtot ov.mean
#> 1 4.32087 3.365846 0.5621217 5 4       3       1   40     5.2

It follows that the intra-rater reliability is given by 0.562, the first 2 output statistics being the rater variance component 4.321 and error variance component 3.366 respectively. A description of the other statistics can be found in the function’s documentation.

  • Model 2
    Model 2 includes a subject and a rater factors, both of which are considered random. That is, Model 2 is a pure random factorial ANOVA model. You may have Model 2 with a subject-rater interaction and Model 2 without subject-rater interaction. Model 2 with subject-rater interaction is made up of 3 factors: the rater, subject and interaction factors, and is implemented in the function icc2.inter.fn.
    For information, the mathematical formulation of the full Model 2 is as follows: yijk = μ + si + rj +  (sr)ij + eijk, where yijk is the rating associated with subject i, rater j and replicate (or measurement) k. Moreover, μ is the average rating, si subject i’s effect, rj rater j’s effect, (sr)ij subject-rater interaction effect associated with subject i and rater j, and e ijk is the error effect. The other statistical models are similar to this one. Some may be based on fewer factors or the assumptions applicable to these factors may vary from model to model. Please read Gwet (2014) for a technical discussion of these models.

Calculating the ICC from the iccdata1 dataset (included in this package) and under the assumption of Model 2 with interaction is done as follows:

  icc2.inter.fn(iccdata1)
#>      sig2s    sig2r    sig2e    sig2sr    icc2r     icc2a n r 
#> 1 2.018593 4.281361 1.315476 0.4067361 0.251627 0.8360198 5 4
#>   max.rep min.rep Mtot ov.mean
#> 1       3       1   40     5.2

This function produces 2 intraclass correlation coefficients icc2r and icc2a. While iccr represents the inter-rater reliability estimated to be 0.252 , icc2a represents the intra-rater reliability estimated at 0.836. The first 3 output statistics are respectively the the subject, rater, and interaction variance components.

The ICC calculation with the iccdata1 dataset and under the assumption of Model 2 without interaction is done as follows:

  icc2.nointer.fn(iccdata1)
#>      sig2s   sig2r    sig2e     icc2r    icc2a n r max.rep 
#> 1 2.090769 4.34898 1.598313 0.2601086 0.801157 5 4       3
#>   min.rep Mtot ov.mean
#> 1       1   40     5.2

The 2 intraclass correlation coefficients have now become icc2r = 0.26 and icc2a=0.801. That is the estimated inter-rater reliability slightly went up while the intra-rater reliability coefficient slightly went down.

  • Model 3
    To calcule the ICC using the iccdata1 dataset and under the assumption of Model 3 with interaction, you should proceed as follows:
  icc3.inter.fn(iccdata1)
#>      sig2s    sig2e    sig2sr     icc2r     icc2a n r max.rep 
#> 1 2.257426 1.315476 0.2238717 0.5749097 0.6535279 5 4       3       
#>   min.rep Mtot ov.mean
#> 1       1   40     5.2

Here, the 2 intraclass correlation coefficients are given by icc2r = 0.575 and icc2a = 0.654. The estimated inter-rater reliability went up substantially while the intra-rater reliability coefficient went down substantially compared to Model 2 with interaction.
Assuming Model 3 without interaction, the same coefficients are calculated as follows:

  icc3.nointer.fn(iccdata1)
#>      sig2s    sig2e     icc2r     icc2a n r max.rep min.rep Mtot 
#> 1 2.241792 1.470638 0.6038611 0.6038611 5 4       3       1   40    
#>   ov.mean
#> 1     5.2

It follows that the 2 ICCs are given by icc2r = 0.604 and icc2a = 0.604. As usual, the omission of an interaction factor leads to a slight increase in inter-rater reliability and a slight decrease in intra-rater reliability. In this case, both become identical.

References:

  1. Gwet, K.L. (2014) Handbook of Inter-Rater Reliability, 4th Edition. Advanced Analytics, LLC.
  2. Shrout, P. E., and Fleiss, J. L. (1979), "Intraclass Correlations: Uses in Assessing Rater Reliability." Psychological Bulletin, 86(2), 420-428.

Saturday, January 26, 2019

Inter-Rater Reliability for Stata Users

Stata users now have a convenient way to compute a wide variety of agreement coefficients within a general framework.  The module KAPPAETC can be installed from within Stata and computes various measures of inter-rater agreement and associated standard errors and confidence intervals. 

A very interesting background article entitled "Implementing a general framework for assessing interrater agreement in Stata" by Daniel Klein is certainly a must read for Stata users who want to understand the calculations performed by KAPPAETC behind the scene. KAPPAETC is a Stata package that was remarkably well written, and  is what I strongly recommend to all Stata users for calculating the the AC1, Kappa, Krippendorff agreement coefficients and associated standard errors, and confidence intervals.

Monday, August 20, 2018

AC1 Coefficient implemented in the FREQ Procedure of SAS

As of SAS/STAT version 14.2, the AC1 (see Gwet, 2008) and PABAK (see Byrt, Bishop, and Carlin, 1993) agreement coefficients can be calculated using the FREQ procedure of SAS, in addition to Cohen's Kappa. Therefore, SAS users no longer need to use another software to obtain theses statistics.

SAS users should nevertheless be aware that by default the FREQ procedure systematically deletes all observations with one missing value.  Consequently, the results obtained with SAS may differ from those obtained with other r functions available in several packages, if your dataset contains missing ratings.  An option is available for instructing the FREQ procedure to treat missing values as true categories. However, this option is useless for the analysis of agreement among raters.  What would be of interest is for Proc FREQ developers to allow for the marginals associated with rater1 and rater2 to be calculated independently.  That is, if a rating is available from rater1 then it should be used for calculating rater1's marginals whether it is available from rater2 or not.

One last comment.  The coefficient often referred to by researchers as PABAK is also known (perhaps more rightfully so) as the Brennan-Prediger coefficient.  It was formally studied by Brennan & Prediger (1981), 13 years earlier.


Bibliography.

Byrt, T., Bishop, J., and Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46, 423-429.

Brennan, R. L., and Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699. 

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48.