Saturday, February 10, 2018

Inter-rater reliability among multiple raters when subjects are rated by different pairs of subjects

In this post, I like to briefly address an issue that researchers have contacted me about on many occasions.  This issue can be described as follows:
  • You want to evaluate the extent of agreement among 3 raters or more.
  • For various practical reasons, the inter-rater reliability experiment is designed in such a way that only 2 raters are randomly assigned to each subject.  For each subject, a new pair of raters is independently chosen from the same pool of several raters. Consequently, each subject gets 2 ratings from a pair of raters that could vary from subject to subject.
Note that most inter-rater reliability coefficients found in the literature are based upon the assumption that each subject must be rated by all raters.  This ubiquitous fully-crossed design may prove impractical if rating costs are prohibitive.  The question now becomes, what coefficient to use for evaluating the extent of agreement among multiple raters when only 2 of them are allowed to rate a specific subject.

The solution to this problem is actually quite simple and does not involve any new coefficient not already available in the literature. It consists of using your coefficient of choice, and calculating the agreement coefficient as if the ratings were all produced by the exact same pair of raters. It is the interpretation of its magnitude that is drastically different from what it would be if only 2 raters had actually participated in the experiment.  If the ratings come from 2 raters only then the standard error associated with the coefficient will be smaller than if the ratings came from 5 raters or more grouped in pairs. In the latter case, the coefficient is subject to an additional source of variation due to the random assignment of raters to subject that must be taken into consideration. I prepared an unpublished paper on this topic entitled "An Evaluation of the Impact of Design on the Analysis of Nominal-Scale Inter-Rater Reliability Studies" which interested readers may want to download for a more detailed discussion of this interesting topic.

1 comment:

  1. Thanks a lot for this very useful contribution! Is there reusable code somewhere to compute the CIs? Thanks!