Wednesday, December 18, 2013

The Paradoxes of Agreement Coefficients: An Impossible Justification

Feinstein and Cicchetti (1990) exposed the kappa coefficient of Cohen (1960) as an agreement metric prone to yield unduly low values when the distribution of subjects is skewed towards one category, even when the raters strongly agree about their ratings. This problem is known in the inter-rater reliability literature as the kappa paradox. As it turned out, kappa was not the only agreement coefficient to carry this issue. A few other agreement coefficients - Scott’s (1955) π and Krippendorff’s (1980, 2004a, 2012) α among others - used by some researchers have the same problem. Despite the availability of abundant, well-documented and strong evidence supporting its seriousness, some authors have attempted and are still attempting to present the kappa paradox as a non-issue or a side issue. Presenting one’s viewpoint is always welcome, as it increases knowledge and provides insights into a problem. What irritates me is when some scholars become demagogues in an attempt to defend their past contributions to the literature, instead of using new evidence to improve them.

Kraemer et al. (2002, p. 2114) attempted to defend the kappa coefficient by arguing that what is presented as a kappa paradox is not a paradox. I pointed out in Gwet (2012, p. 38) that Kraemer et al. (2002) only made excuses for the poor kappa performance by blaming the distribution of subjects. More recently, when commenting an article by Zhao, Liu, and Deng (2013), Krippendorff (2013) made another even more demagogic argument to come to the conclusion that the low values of Cohen’s (1960) k , Scott’s (1955) π and Krippendorff’s (1980, 2004a, 2012) α, even in the presence of high raters’ agreement are justified.

To make his case, here is what Krippendorff (2013) says:

“Suppose an instrument manufacturer claims to have developed a test to diagnose a rare disease. Rare means that the probability of that disease in a population is small and to have enough cases in
the sample, a large number of individuals need to be tested. Let us use the authors’ numerical example: Suppose two separate doctors administer the test to the same 1,000 individuals. Suppose each doctor finds one in 1,000 to have the disease and they agree in 998 cases on the outcome of the test. The authors note that Cohen’s (1960) &kappa  , Scott’s (1955) π, and Krippendorff’s (1980, 2004a, 2012) α are all below zero (-0.001 or -0.0005)...” 
“... I contend that a test which produces 99.8% negatives, 2% disagreements, and not a single case of an agreement on the presence of the disease is totally unreliable indeed. Nobody in her right mind should trust a doctor who would treat patients based on such test results. The inference of zero is perfectly justifiable. The paradox of “high agreement but low reliability” does not characterize any of the reliability indices cited but resides entirely in the authors’ conceptual limitations. How could the authors be so wrong ?”

I am outraged by this point. Here is how I perceive it. If you are going to quantify the extent of agreement among raters who strongly agree in any sense you can think of, then your scoring method must assign a high agreement coefficient to these raters. If it fails to do so, then don’t switch topics by pretending that your low coefficient must instead be associated with the unascertained shortcomings of the measuring instrument. The propensity of a test to detect a rare trait is an entirely different topic, which requires a different experimental design and different quantitative methods with little in common with agreement coefficients.

Let us scrutinize a little further what is said in Krippendorff (2013):

  • In order to justify the unjustifiable, and explain the inexplicable, Krippendorff (2013) attempts to stay away from the initial goal of agreement coefficients by bringing in some fuzzy notions such as “informational context” or “reliability to be inferred.” If an agreement coefficient is now required to quantify such broad notions, then how do we know what theoretical construct we want to quantify ? There is here an unfortunate demagogic attempt to expand the clear concept of agreement as much as necessary until it can incorporate even the most outlying estimations, which may not be justified otherwise. When the US financial industry lowered the requirements for obtaining a loan, the notion of acceptable mortgage credit risk was artificially expanded. As a result, individuals with bad credit history qualified. We all know what followed.
  • In the example above, suppose the instrument manufacturer developed a highly reliable test to diagnose a very common disease (i.e. a disease with high prevalence rate). Suppose also that each doctor finds only one individual in 1,000 without the disease, and both agree in 998 cases (i.e. correctly identify the same 998 patients with the disease), would Cohen’s (1960) &alpha  , Scott’s (1955) π, and Krippendorff’s (1980, 2004a, 2012) α tell the correct story? Unfortunately the answer is still no. This proves beyond any doubt that the poor performance of these indices has little to do with the reliability of the measuring instrument.
  • Notice the sentence “The inference of zero is perfectly justifiable.” Really ! Can an estimate of 0 be now called “an inference of 0 ?” What does the word “inference” mean in this context ? Is this inference statistical? This is what I refer to as pure demagogy, when someone decides to carry the word inference everywhere for the sole purpose of a conveying a false sense of sophistication.
  • Looking at the experiment described above, how does one know whether the instrument itself is reliable or not ? If you want to test the propensity for an instrument to properly detect the presence of a rare trait, the statistical method of choice is the odds ratio, and not the agreement coefficient. Moreover, the use of 2 ordinary raters in an experiment aimed at testing the effectiveness of an instrument is rather odd, unless they are known experts in the use of that device. I am not sure why the effectiveness of the instrument is even part of this discussion.

[1] Cohen, J. (1960). “A coefficient of agreement for nominal scales.” Educational and Psychological Measurement, 20, 37-46.
[2] Feinstein, A. R., and Cicchetti, D. V. (1990), “High agreement but low kappa : I. The problems of two paradoxes,” Journal of Clinical Epidemiology, 43, 543-549.
[3] Gwet, K. L. (2012). Handbook of inter-rater reliability: The definitive to measuring the extent of agreement among multiple raters (3rded.). Gaithersburg, MD: Advanced Analytics, LLC Statistics in Medicine, 21, 2109-2129.
[4] Kraemer, H. C., Peryakoil, V. S., and Noda, A. (2002). “Kappa Coefficients in Medical Research,” Statistics in Medicine, 21, 2109-2129.
[5] Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Thousand Oaks, Calif, USA.
[6] Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Thousand Oaks, Calif, USA.
[7] Krippendorff, K. (2012). Content Analysis: An Introduction to Its Methodology (3rd ed.). Thousand Oaks, Calif, USA.
[8] Krippendorff, K. (2013). Commentary : “A dissenting view on so-called paradoxes of reliability coefficients.” In C. T. Salmon (Ed.), Communication Yearbook, 36, (pp. 481-499).
[9] Scott, W. A. (1955). “Reliability of content analysis : the case of nominal scale coding.” Public Opinion Quarterly, XIX, 321-325.
[10] Xinshu, Zhao, Jun, S. Liu, and Ke, Deng. (2013). “Assumptions behind intercoder reliability indices,” In C. T. Salmon (Ed.), Communication Yearbook, 36, (pp. 419-499).