Saturday, August 22, 2015

Standard Error of Krippendorff's Alpha Coefficient

On August 9, 2015, I received an email from a researcher of the University of Manchester about the standard error associated with Krippendorff's alpha coefficient.  He was asking why my software AgreeStat for Excel produces a standard error for Krippendorff's alpha that is always higher than that produced by Dr. Hayes' SAS macro and SPSS macro called KALPHA. On this isssue, I like to make two comments:

1) AgreeStat uses a variance expression given in equation (7) of the document entitled "On Krippendorff's Alpha," while the KALPHA macro is based on the bootstrap standard error.  However, the use of these two approaches cannot and should not explain the observed difference in standard error estimations.

2) I do not recommend using Dr Hayes’ macro programs for computing the standard error of Krippendorff’s alpha.  It always underestimates (often by a wide margin) the magnitude of the standard error associated with Krippendorff’s alpha.  Here is why I believe so.  In his paper “Answering the Call for a Standard Reliability Measure for Coding Data,” released in 2007 and co-authored by Krippendorff himself , Dr Hayes says the following regarding the algorithm he has used:  

The bootstrap sampling distribution of alpha is generated by taking a random sample of 239 pairs of judgments from the available pairs, weighted by how many observers judged a given unit. Alpha is computed in this “resample” of 239 pairs, and this process is repeated very many times, producing the bootstrap sampling distribution of Alpha.

This bootstrapping algorithm is terrible, and does not reflect in any way the bootstrap method previously introduced by Efron & Tibshirani (1998).  Instead of replicating Table 1 of their article before re-computing the alpha coefficient (this is what Eforn & Tibshirani recommend), Dr Hayes generates several sets of 239 pairs of judgments and computes alpha for each of them.  First the number 239 came from the original table, and is supposed to change from one bootstrap sample to the next. By keeping the exact same structure of the original sample, with the exact same number of missing judgments, you can only obtain a constrained (therefore smaller) variance.  What Dr Hayes should have done is to simply generate several random samples with replacement from the set {1,2,3,4, ..., 40}. A with-replacement random sample will have duplicates, which is ok.  The next step would be to extract from Table 1 only the rows whose numbers were selected in the with-replacement sample, and use them to form the bootstrap sample. This bootstrap sample would then be used to compute the alpha coefficient. 

Friday, December 12, 2014

Benchmarking Agreement Coefficients

After computing Cohen's kappa coefficient or any alternative agreement coefficient (Gwet's AC1, Krippendorff's alpha, ...), researchers often want to interpret its magnitude.  Does it qualify as excellent? good? or poor maybe? This task is generally accomplished using a benchmark scale such one proposed by Altman (1991) and shown in Table 1 (other benchmark scales have been proposed in the literature - see Gwet (2014, Chapter 6):
Table 1
The benchmarking procedure traditionally used by researchers is straightforward, and consists of identifying the specific range of values into which the computed agreement coefficient falls, and use the associated strength of agreement to interpret its magnitude.  An agreement coefficient of 0.5 for example will be categorized as "Moderate." This simple procedure can be misleading for several reasons.
  • A calculated kappa is always based on a specific pool of subjects, and will change if different subjects are used.  Therefore, there is no point interpreting an estimation, which by definition is always exposed to statistical variation. The correct approach is to use an estimation in order to shed light into the magnitude of the construct (or estimand) that the estimation approximates. In our case, a more meaningful objective is to use the computed coefficient to form an opinion about the magnitude of the ("true" and unknown) extent of agreement among raters. The correct procedure must be probabilistic (i.e. our interpretation must be associated with a degree of certainty).  This is precisely due to the unknown nature of the magnitude of the "true" extent of agreement among raters, which is an attribute of these raters abstracted from the characteristics of any subject they may rate. 
  • Several factors may affect the magnitude of an agreement coefficient. Among these factors one can mention the number of subjects, the number categories or the distribution of subjects among categories.  A kappa value of 0.6 based on 80 subjects conveys a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. Why should you have the same interpretation of 0.6 in these two very different contexts?
  • Kappa and the many alternative agreement coefficients advocated in the literature often behave very differently when used to the same group of subjects and raters. Unless there is some form of standardization of the agreement coefficients, the use of the same benchmark scale to interpret all of them may be difficult to justify.  
Here is a benchmarking procedure that overcomes many of these problems:

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2.  The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.
Table 2

STEP 1: Computing the Benchmark Range Membership Probabilities of Table 3
  • Consider the first benchmark range 0.8 to 1.0 in Table 3.  Suppose the "true" agreement among raters falls into this range, that Coeff is the computed agreement coefficient (with any method), and StdErr the associated standard error. If Z is a random variable that follows the standard normal distribution, then you would expected (Coeff-1)/StdErr ≤ Z ≤ (Coeff-0.8)/StdErr with a high probability. The probability of this event represents our certainty level that the extent of agreement among raters belongs to the 0.8-1.0 range.  If this probability is small, then the extent of agreement among raters is likely to be smaller than 0.8.  These certainly levels can be calculated for all ranges and all coefficients as shown in Table 3. Gwet (2014, chapter 6) shows how MS Excel can be used to obtain these membership probabilities.
  • From Table 3, you can see the range of values into which the extent of agreement is most likely to fall.  However, even the highest membership probability for a particular agreement coefficient may not be sufficiently high to give us a satisfactory certainly level.  Hence the need to carry out step 2.  
STEP 2: Computing the Benchmark Range Cumulative Membership Probabilities of Table 4
  • In this second step, you would compute the cumulative membership probabilities for each agreement coefficient as shown in Table 4. That is all membership probabilities of Table 3 are added columnwise successively from the top range to the bottom. You must then set a cut-off point (e.g. 0.95) so that the first benchmark range associated with a cumulative membership probability that equals or exceeds that cut-off point will provide the basis for interpreting your agreement coefficient, and for determining the strength of the agreement  among raters.
STEP 3: Interpretation of the Agreement Coefficients
  • Table 4 indicates that the range of values 0.4 to 0.6 is the first one with Kappa-based cumulative probability that exceeds 0.95.  Consequently, Kappa is qualified as "Moderate." Note that we the procedure currently used by many researchers would qualify kappa as "Good" since its calculated value is 0.676 as shown in Table 2.
  • AC1 is qualified as "Very Good" because the 0.8-to-1 range of values is associated with a cumulative membership probability of 0.959 that exceeds 0.95.
Table 3: Benchmark Range Membership Probabilities

Table 4: Benchmark Range Cumulative Membership Probabilities

  • Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman and Hall.
  • Gwet, K. (2014). Handbook of Inter-Rater Reliability, Advanced Analytics Press, Maryland, USA. ISBN: 9 780970 806284.

Sunday, December 7, 2014

Inter-Rater Reliability in Language Testing

That paper entitled Assessing inter-rater agreement for nominal judgement variables, (alternative link) summarizes a simple comparative study between Cohen's Kappa and Gwet's AC1 for evaluating inter-rater reliability in the context of Language Testing, dichotomous variables, and high-prevalence data.  Researchers may find this analysis instructive. I personally found it attractive for its simplicity, and the clarity of the examples used. Those who are new in the area of inter-rater reliability assessment may find it useful as well.

Monday, March 31, 2014

Some R functions for calculating chance-corrected agreement coefficients

Several researchers have shown interest in having R functions that can compute several chance-corrected agreement coefficients, their standard errors, confidence interval, and p-values as described in my book Handbook and Inter-Rater Reliability (3rd ed.).  I have finally found the time to write these R functions, which can be downloaded from this r-functions page of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.


[1] Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives." Educational and Psychological Measurement, 41, 687-699.
[2] Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological  Measurement, 20, 37-46.
[3] Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters," Psychological  Bulletin, 88, 322-328.
[4] Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382
[5] Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement."  British Journal of Mathematical and Statistical Psychology, 61, 29-48.
[6] Gwet, K.L. (2012). Handbook of Inter-Rater Reliability (3rd Ed.), Advanced Analytics, LLC, Maryland, USA
[7] Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data," Educational and Psychological Measurement, 30, 61-70

Saturday, March 8, 2014

The Perreault-Leigh Agreement Coefficient is Problematic

Perreault and Leigh (1989) considering that there was a need to have an agreement coefficient "that is more appropriate to the type of data typically encountered in marketing contexts," decided to propose a new agreement coefficient known in the literature with the symbol Ir. I firmly believe that the mathematical derivations that led to this coefficient were wrong, even though the underlying ideas are right. A proper translation of these ideas would inevitably have led to the Brennan-Prediger coefficient or to the percent agreement depending on the assumption made.

The Perreault-Leigh agreement coefficient is formally defined as follows:

where S defined by,

is the agreement coefficient recommended by Bennet et al. (1954) and is a special case of the coefficient recommended by Brennan and Prediger (1981). The symbol Ir used by Perreault and Leigh (1989) appears to stand for “Index of Reliability.”

I carefully reviewed the Perreault and Leigh article. It presents an excellent review of the various agreement coefficients that were current at the time it was written. Perreault and Leigh define Ir as the percent of subjects that a typical judge could code consistent given the nature of the observations. Note that Ir is an attribute of the typical judge, and therefore does not represent any aspect of agreement among the judges. Perreault and Leigh (1989) consider the product NxIr2 (with N representing the number of subjects) to represent the number of reliable judgments on which judges agree. This cannot be true. To see this note that Ir2 is the probability that 2 judges both independently perform a reliable judgment. If both (reliable) judgments must lead to an agreement then they have to refer to the exact same category. However the probability Ir2 does not say which category was chosen and cannot represent any agreement among judges.  Even if you decide to assume that any 2 reliable judgments must necessarily result in an agreement, then the judgments will no longer be independent. The probability for two judges to agree will now become equal to the probability for the first rater to perform a reliable judgment times the conditional probability for the second judge to perform a reliability judgment given that the first judge did. This second conditional probability cannot be evaluated unless there are additional assumptions.

What Perreault and Leigh (1989)  have proposed is not an agreement coefficient.  Their coefficient quantifies something other than the extent of agreement among raters. It should not be compared with the other coefficients available in the literature until someone can tell us what it does.


[1] Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954). Communication through limited response questioning. Public Opinion Quarterly, 18, 303-308.

[2] Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699.

[3] Perreault, W. D. & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135-148.

Tuesday, February 25, 2014

Inter-rater reliability and Many-Facet Rasch Measurement

I just finished reading the book entitled "Introduction to Many-Facet Rasch Measurement" by Thomas Eckes. In this book, Mr. Thomas Eckes argues that the classical approach to inter-rater reliability that consists of training the raters and measuring their extent of agreement until they reach an acceptable level does not really work.  It is because no matter how much training the raters received, they will still not be interchangeable. A residual intrinsic disagreement will remain among the raters, some of them being more stringent than others in their approach to rating.

The solution that Mr. Eckes proposes is to develop statistical models that describe the different facets of the inter-rater reliability experiment, such as the rater facet, the subject facet and possibly other facets.  These statistical models will then be used to make some adjustments to the ratings so that the subjects supposed to be humans can get a fair test.  This adjustment will supposedly not penalize the subjects who were unlucky enough to be rated by the more severe raters.

I must say I did like this book very much in the way the author describes the different issues associated with an inter-rater reliability experiment.  The presentation of these issues by the author is very instructive and is done with considerable clarity.  That alone justifies the investment in time and money one can make on this book.  However, I have always been somehow skeptical about the use of theoretical statistical models for the purpose of making important practical decisions, especially decisions involving human subjects.  As a matter of fact, even if the raters introduce some bias in the ratings, two statisticians will probably not recommend the same statistical models either.  Using these models to adjust the ratings may only be adding the statistician bias that could compound with the rater bias to produce an outcome that can hardly be seen as more reliable. The statistical models can always help the researcher gain more insight into a reality with powerful modelling tools, but cannot and should not be seen as an expression of that reality.   Nevertheless, this book is remarkably well written, and should certainly be useful to anyone interested in the topic of inter-rater reliability.    

[1] Eckes, Thomas. (2011). Introduction to Many-Facet Rasch Measurement. Peter Lang, ISBN: 978-3-631-61350-4.

Wednesday, December 18, 2013

The Paradoxes of Agreement Coefficients: An Impossible Justification

Feinstein and Cicchetti (1990) exposed the kappa coefficient of Cohen (1960) as an agreement metric prone to yield unduly low values when the distribution of subjects is skewed towards one category, even when the raters strongly agree about their ratings. This problem is known in the inter-rater reliability literature as the kappa paradox. As it turned out, kappa was not the only agreement coefficient to carry this issue. A few other agreement coefficients - Scott’s (1955) π and Krippendorff’s (1980, 2004a, 2012) α among others - used by some researchers have the same problem. Despite the availability of abundant, well-documented and strong evidence supporting its seriousness, some authors have attempted and are still attempting to present the kappa paradox as a non-issue or a side issue. Presenting one’s viewpoint is always welcome, as it increases knowledge and provides insights into a problem. What irritates me is when some scholars become demagogues in an attempt to defend their past contributions to the literature, instead of using new evidence to improve them.

Kraemer et al. (2002, p. 2114) attempted to defend the kappa coefficient by arguing that what is presented as a kappa paradox is not a paradox. I pointed out in Gwet (2012, p. 38) that Kraemer et al. (2002) only made excuses for the poor kappa performance by blaming the distribution of subjects. More recently, when commenting an article by Zhao, Liu, and Deng (2013), Krippendorff (2013) made another even more demagogic argument to come to the conclusion that the low values of Cohen’s (1960) k , Scott’s (1955) π and Krippendorff’s (1980, 2004a, 2012) α, even in the presence of high raters’ agreement are justified.

To make his case, here is what Krippendorff (2013) says:

“Suppose an instrument manufacturer claims to have developed a test to diagnose a rare disease. Rare means that the probability of that disease in a population is small and to have enough cases in
the sample, a large number of individuals need to be tested. Let us use the authors’ numerical example: Suppose two separate doctors administer the test to the same 1,000 individuals. Suppose each doctor finds one in 1,000 to have the disease and they agree in 998 cases on the outcome of the test. The authors note that Cohen’s (1960) &kappa  , Scott’s (1955) π, and Krippendorff’s (1980, 2004a, 2012) α are all below zero (-0.001 or -0.0005)...” 
“... I contend that a test which produces 99.8% negatives, 2% disagreements, and not a single case of an agreement on the presence of the disease is totally unreliable indeed. Nobody in her right mind should trust a doctor who would treat patients based on such test results. The inference of zero is perfectly justifiable. The paradox of “high agreement but low reliability” does not characterize any of the reliability indices cited but resides entirely in the authors’ conceptual limitations. How could the authors be so wrong ?”

I am outraged by this point. Here is how I perceive it. If you are going to quantify the extent of agreement among raters who strongly agree in any sense you can think of, then your scoring method must assign a high agreement coefficient to these raters. If it fails to do so, then don’t switch topics by pretending that your low coefficient must instead be associated with the unascertained shortcomings of the measuring instrument. The propensity of a test to detect a rare trait is an entirely different topic, which requires a different experimental design and different quantitative methods with little in common with agreement coefficients.

Let us scrutinize a little further what is said in Krippendorff (2013):

  • In order to justify the unjustifiable, and explain the inexplicable, Krippendorff (2013) attempts to stay away from the initial goal of agreement coefficients by bringing in some fuzzy notions such as “informational context” or “reliability to be inferred.” If an agreement coefficient is now required to quantify such broad notions, then how do we know what theoretical construct we want to quantify ? There is here an unfortunate demagogic attempt to expand the clear concept of agreement as much as necessary until it can incorporate even the most outlying estimations, which may not be justified otherwise. When the US financial industry lowered the requirements for obtaining a loan, the notion of acceptable mortgage credit risk was artificially expanded. As a result, individuals with bad credit history qualified. We all know what followed.
  • In the example above, suppose the instrument manufacturer developed a highly reliable test to diagnose a very common disease (i.e. a disease with high prevalence rate). Suppose also that each doctor finds only one individual in 1,000 without the disease, and both agree in 998 cases (i.e. correctly identify the same 998 patients with the disease), would Cohen’s (1960) &alpha  , Scott’s (1955) π, and Krippendorff’s (1980, 2004a, 2012) α tell the correct story? Unfortunately the answer is still no. This proves beyond any doubt that the poor performance of these indices has little to do with the reliability of the measuring instrument.
  • Notice the sentence “The inference of zero is perfectly justifiable.” Really ! Can an estimate of 0 be now called “an inference of 0 ?” What does the word “inference” mean in this context ? Is this inference statistical? This is what I refer to as pure demagogy, when someone decides to carry the word inference everywhere for the sole purpose of a conveying a false sense of sophistication.
  • Looking at the experiment described above, how does one know whether the instrument itself is reliable or not ? If you want to test the propensity for an instrument to properly detect the presence of a rare trait, the statistical method of choice is the odds ratio, and not the agreement coefficient. Moreover, the use of 2 ordinary raters in an experiment aimed at testing the effectiveness of an instrument is rather odd, unless they are known experts in the use of that device. I am not sure why the effectiveness of the instrument is even part of this discussion.

[1] Cohen, J. (1960). “A coefficient of agreement for nominal scales.” Educational and Psychological Measurement, 20, 37-46.
[2] Feinstein, A. R., and Cicchetti, D. V. (1990), “High agreement but low kappa : I. The problems of two paradoxes,” Journal of Clinical Epidemiology, 43, 543-549.
[3] Gwet, K. L. (2012). Handbook of inter-rater reliability: The definitive to measuring the extent of agreement among multiple raters (3rded.). Gaithersburg, MD: Advanced Analytics, LLC Statistics in Medicine, 21, 2109-2129.
[4] Kraemer, H. C., Peryakoil, V. S., and Noda, A. (2002). “Kappa Coefficients in Medical Research,” Statistics in Medicine, 21, 2109-2129.
[5] Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Thousand Oaks, Calif, USA.
[6] Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Thousand Oaks, Calif, USA.
[7] Krippendorff, K. (2012). Content Analysis: An Introduction to Its Methodology (3rd ed.). Thousand Oaks, Calif, USA.
[8] Krippendorff, K. (2013). Commentary : “A dissenting view on so-called paradoxes of reliability coefficients.” In C. T. Salmon (Ed.), Communication Yearbook, 36, (pp. 481-499).
[9] Scott, W. A. (1955). “Reliability of content analysis : the case of nominal scale coding.” Public Opinion Quarterly, XIX, 321-325.
[10] Xinshu, Zhao, Jun, S. Liu, and Ke, Deng. (2013). “Assumptions behind intercoder reliability indices,” In C. T. Salmon (Ed.), Communication Yearbook, 36, (pp. 419-499).