K. Gwet's Inter-Rater Reliability Blog : March 2014Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

Monday, March 31, 2014

Some R functions for calculating chance-corrected agreement coefficients

Several researchers have shown interest in having R functions that can compute several chance-corrected agreement coefficients, their standard errors, confidence interval, and p-values as described in my book Handbook and Inter-Rater Reliability (3rd ed.). I have finally found the time to write these R functions, which can be downloaded from this r-functions page of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.

Bibliography:

[1] Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives." Educational and Psychological Measurement, 41, 687-699.
[2] Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological Measurement, 20, 37-46.
[3] Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters," Psychological Bulletin, 88, 322-328.
[4] Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382
[5] Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement." British Journal of Mathematical and Statistical Psychology, 61, 29-48.
[6] Gwet, K.L. (2012). Handbook of Inter-Rater Reliability (3rd Ed.), Advanced Analytics, LLC, Maryland, USA
[7] Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data," Educational and Psychological Measurement, 30, 61-70

Saturday, March 8, 2014

The Perreault-Leigh Agreement Coefficient is Problematic

Perreault and Leigh (1989) considering that there was a need to have an agreement coefficient "that is more appropriate to the type of data typically encountered in marketing contexts," decided to propose a new agreement coefficient known in the literature with the symbol I_r. I firmly believe that the mathematical derivations that led to this coefficient were wrong, even though the underlying ideas are right. A proper translation of these ideas would inevitably have led to the Brennan-Prediger coefficient or to the percent agreement depending on the assumption made.

The Perreault-Leigh agreement coefficient is formally defined as follows:

where S defined by,

is the agreement coefficient recommended by Bennet et al. (1954) and is a special case of the coefficient recommended by Brennan and Prediger (1981). The symbol Ir used by Perreault and Leigh (1989) appears to stand for “Index of Reliability.”

I carefully reviewed the Perreault and Leigh article. It presents an excellent review of the various agreement coefficients that were current at the time it was written. Perreault and Leigh define I_r as the percent of subjects that a typical judge could code consistent given the nature of the observations. Note that Ir is an attribute of the typical judge, and therefore does not represent any aspect of agreement among the judges. Perreault and Leigh (1989) consider the product NxI_r² (with N representing the number of subjects) to represent the number of reliable judgments on which judges agree. This cannot be true. To see this note that I_r² is the probability that 2 judges both independently perform a reliable judgment. If both (reliable) judgments must lead to an agreement then they have to refer to the exact same category. However the probability I_r² does not say which category was chosen and cannot represent any agreement among judges. Even if you decide to assume that any 2 reliable judgments must necessarily result in an agreement, then the judgments will no longer be independent. The probability for two judges to agree will now become equal to the probability for the first rater to perform a reliable judgment times the conditional probability for the second judge to perform a reliability judgment given that the first judge did. This second conditional probability cannot be evaluated unless there are additional assumptions.

What Perreault and Leigh (1989) have proposed is not an agreement coefficient. Their coefficient quantifies something other than the extent of agreement among raters. It should not be compared with the other coefficients available in the literature until someone can tell us what it does.

References:

[1] Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954). Communication through limited response questioning. Public Opinion Quarterly, 18, 303-308.

[2] Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699.

[3] Perreault, W. D. & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135-148.