K. Gwet's Inter-Rater Reliability Blog : March 2021Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

Most methods proposed in the literature for evaluating the extent of agreement among 3 raters assume that each rater is expected to rate all subjects. In some inter-rater reliability applications however, this requirement cannot be satisfied, either because of the prohibitive costs associated with the rating process or because of a rating process too demanding to a human subject. For example, scientific laboratories are often rated by accrediting agencies to have their work quality officially certified. These accrediting agencies themselves need to conduct inter-rater reliability studies to demonstration the high quality of their accreditation process. Given the high costs associated with accrediting a laboratory (a staggering number of lab procedures must be verified and documentation reviewed), agencies are willing to fund a single round of rating for each laboratory with one rater, and use another rater to provide the ratings during the regular accreditation process, which is funded by each lab.

The question now becomes ``Is it possible to evaluate the extent of agreement among 3 raters or more, given that a maximum of 2 raters are allowed to rate the same subject?'' The good news is that it is indeed possible to design an experiment that would achieve that goal. However, a price that must be paid to make this happen. The agreement coefficient based on such a design will has a higher variance than the traditional coefficient based on the fully-crossed design where each rater must rate all subjects. The general approach is as follows:

Suppose your problem is to quantify the extent of agreement among the group of 5 raters \({\cal R}=\{Rater1, Rater2, Rater3, Rarer4, Rater5 \}\)
Out of the roster of 5 raters \(\cal R\), one can form the following 10 different pairs of raters (Note that if \(r\) is the number of raters, then the associated number of pairs that can be formed is \(r(r-1)/2=5\times4/2=10\)):

Suppose that a total of \(n=15\) subjects will participate in your experiment. The procedure consists of selecting 15 pairs of raters randomly and with replacement (i.e. one pair of raters could be selected more than once) from the above 10 pairs. The 15 selected pairs of raters will be assigned to the 15 subjects on a flow basis (i.e. sequentially as they are selected).
Select with replacement 15 random integers between 1 and 10. Suppose the 15 random integers are \(\{2, 6, 2, 5, 4, 1, 8, 1, 3, 3, 5, 4, 2, 5, 9\}\). That is, the \(2^{nd}\) pair (Rater1, Rater3) will be assigned to subjects 1, 3 and 13. The \(6^{th}\) pair (Rater2, Rater4) will be assigned to subject 2 and so on. The experimental design will look this:
Once all 15 subjects are rated, the dataset of ratings will have 3 columns. The first Subject column will identify subjects, the remain 2 columns will contain the ratings from the different pairs of raters assigned to subjects. The agreement coefficient will then be calculated as if the same 2 raters produced all the ratings. What will be different is the variance associated with the agreement coefficient.
What is described here is referred to as a Partially Crossed design with 2 raters per subject, or \(\textsf{PC}_2\) design and is discussed in details in the \(5^{th}\) edition of the Handbook of Inter-Rater Reliability to be released in July of 2021.

K. Gwet's Inter-Rater Reliability Blog

Tuesday, March 30, 2021

Agreement Among 3 Raters or More When A Subject Can be Rated by No More than 2 Raters