Monday, August 20, 2018

AC1 Coefficient implemented in the FREQ Procedure of SAS

As of SAS/STAT version 14.2, the AC1 (see Gwet, 2008) and PABAK (see Byrt, Bishop, and Carlin, 1993) agreement coefficients can be calculated using the FREQ procedure of SAS, in addition to Cohen's Kappa. Therefore, SAS users no longer need to use another software to obtain theses statistics.

SAS users should nevertheless be aware that by default the FREQ procedure systematically deletes all observations with one missing value.  Consequently, the results obtained with SAS may differ from those obtained with other r functions available in several packages, if your dataset contains missing ratings.  An option is available for instructing the FREQ procedure to treat missing values as true categories. However, this option is useless for the analysis of agreement among raters.  What would be of interest is for Proc FREQ developers to allow for the marginals associated with rater1 and rater2 to be calculated independently.  That is, if a rating is available from rater1 then it should be used for calculating rater1's marginals whether it is available from rater2 or not.

One last comment.  The coefficient often referred to by researchers as PABAK is also known (perhaps more rightfully so) as the Brennan-Prediger coefficient.  It was formally studied by Brennan & Prediger (1981), 13 years earlier.


Bibliography.

Byrt, T., Bishop, J., and Carlin, J. B. (1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46, 423-429.

Brennan, R. L., and Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699. 

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48.

Saturday, February 10, 2018

Inter-rater reliability among multiple raters when subjects are rated by different pairs of subjects

In this post, I like to briefly address an issue that researchers have contacted me about on many occasions.  This issue can be described as follows:
  • You want to evaluate the extent of agreement among 3 raters or more.
  • For various practical reasons, the inter-rater reliability experiment is designed in such a way that only 2 raters are randomly assigned to each subject.  For each subject, a new pair of raters is independently chosen from the same pool of several raters. Consequently, each subject gets 2 ratings from a pair of raters that could vary from subject to subject.
Note that most inter-rater reliability coefficients found in the literature are based upon the assumption that each subject must be rated by all raters.  This ubiquitous fully-crossed design may prove impractical if rating costs are prohibitive.  The question now becomes, what coefficient to use for evaluating the extent of agreement among multiple raters when only 2 of them are allowed to rate a specific subject.

The solution to this problem is actually quite simple and does not involve any new coefficient not already available in the literature. It consists of using your coefficient of choice, and calculating the agreement coefficient as if the ratings were all produced by the exact same pair of raters. It is the interpretation of its magnitude that is drastically different from what it would be if only 2 raters had actually participated in the experiment.  If the ratings come from 2 raters only then the standard error associated with the coefficient will be smaller than if the ratings came from 5 raters or more grouped in pairs. In the latter case, the coefficient is subject to an additional source of variation due to the random assignment of raters to subject that must be taken into consideration. I prepared an unpublished paper on this topic entitled "An Evaluation of the Impact of Design on the Analysis of Nominal-Scale Inter-Rater Reliability Studies" which interested readers may want to download for a more detailed discussion of this interesting topic.