- You want to evaluate the extent of agreement among 3 raters or more.
- For various practical reasons, the inter-rater reliability experiment is designed in such a way that
**only 2 raters are randomly assigned to each subject**. For each subject, a new pair of raters is independently chosen from the same pool of several raters. Consequently, each subject gets 2 ratings from a pair of raters that could vary from subject to subject.

Note that most inter-rater reliability coefficients found in the literature are based upon the assumption that each subject must be rated by all raters. This ubiquitous fully-crossed design may prove impractical if rating costs are prohibitive. The question now becomes, what coefficient to use for evaluating the extent of agreement among multiple raters when only 2 of them are allowed to rate a specific subject.

The solution to this problem is actually quite simple and does not involve any new coefficient not already available in the literature. It consists of using your coefficient of choice, and calculating the agreement coefficient as if the ratings were all produced by the exact same pair of raters. It is the interpretation of its magnitude that is drastically different from what it would be if only 2 raters had actually participated in the experiment. If the ratings come from 2 raters only then the standard error associated with the coefficient will be smaller than if the ratings came from 5 raters or more grouped in pairs. In the latter case, the coefficient is subject to an additional source of variation due to the random assignment of raters to subject that must be taken into consideration. I prepared an unpublished paper on this topic entitled

*"An Evaluation of the Impact of Design on the Analysis of Nominal-Scale Inter-Rater Reliability Studies"*which interested readers may want to download for a more detailed discussion of this interesting topic.
Thanks a lot for this very useful contribution! Is there reusable code somewhere to compute the CIs? Thanks!

ReplyDelete