K. Gwet's Inter-Rater Reliability Blog : The Perreault-Leigh Agreement Coefficient is ProblematicInter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

Perreault and Leigh (1989) considering that there was a need to have an agreement coefficient "that is more appropriate to the type of data typically encountered in marketing contexts," decided to propose a new agreement coefficient known in the literature with the symbol I_r. I firmly believe that the mathematical derivations that led to this coefficient were wrong, even though the underlying ideas are right. A proper translation of these ideas would inevitably have led to the Brennan-Prediger coefficient or to the percent agreement depending on the assumption made.

The Perreault-Leigh agreement coefficient is formally defined as follows:

where S defined by,

is the agreement coefficient recommended by Bennet et al. (1954) and is a special case of the coefficient recommended by Brennan and Prediger (1981). The symbol Ir used by Perreault and Leigh (1989) appears to stand for “Index of Reliability.”

I carefully reviewed the Perreault and Leigh article. It presents an excellent review of the various agreement coefficients that were current at the time it was written. Perreault and Leigh define I_r as the percent of subjects that a typical judge could code consistent given the nature of the observations. Note that Ir is an attribute of the typical judge, and therefore does not represent any aspect of agreement among the judges. Perreault and Leigh (1989) consider the product NxI_r² (with N representing the number of subjects) to represent the number of reliable judgments on which judges agree. This cannot be true. To see this note that I_r² is the probability that 2 judges both independently perform a reliable judgment. If both (reliable) judgments must lead to an agreement then they have to refer to the exact same category. However the probability I_r² does not say which category was chosen and cannot represent any agreement among judges. Even if you decide to assume that any 2 reliable judgments must necessarily result in an agreement, then the judgments will no longer be independent. The probability for two judges to agree will now become equal to the probability for the first rater to perform a reliable judgment times the conditional probability for the second judge to perform a reliability judgment given that the first judge did. This second conditional probability cannot be evaluated unless there are additional assumptions.

What Perreault and Leigh (1989) have proposed is not an agreement coefficient. Their coefficient quantifies something other than the extent of agreement among raters. It should not be compared with the other coefficients available in the literature until someone can tell us what it does.

References:

[1] Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954). Communication through limited response questioning. Public Opinion Quarterly, 18, 303-308.

[2] Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699.

[3] Perreault, W. D. & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135-148.

K. Gwet's Inter-Rater Reliability Blog

Saturday, March 8, 2014

The Perreault-Leigh Agreement Coefficient is Problematic

No comments:

Post a Comment