Tuesday, September 6, 2016

A t-test for correlated agreement coefficients and application with the R package

Researchers must often compare two groups of raters with respect to the extent to which they agree on the rating of the same group of raters.  The extent of agreement among raters of the same group can also be measured on two occasions (e.g. before and after a training session), in order to assess the effectiveness of training on improving inter-rater reliability. An agreement coefficient must then be calculated twice. The traditional statistical approach for testing the difference for statistical significance is to divide that difference by its variance before comparing that this ratio (i.e. the test statistic) to the critical value (often 1.96). If the absolute value of the t-statistic exceeds the critical value then one may conclude that the difference is statistically significant. Sometimes the p-value is calculated and used to conclude statistical significance when it falls below 0.05. However, calculating the variance of the difference can sometimes become problematic.

If the two groups of raters (or the same group observed on 2 occasions) must rate the exact same group of raters, then any agreement coefficient used (e.g. Fleiss generalized kappa, Gwet's AC1, Conger's generalized kappa, Brennan-Prediger coefficient, or Krippendorff's alpha)  will produce two correlated coefficients, making the calculation of the variance of the difference very difficult due to the embedded correlation structure.  Gwet (2016) proposed the linearization method to resolve this problem.  This approach consists of using the linear approximation to the agreement coefficient to develop the equivalent of a paired t-test. Users of the R package may use the R functions that I developed to implement the linearization method to testing the difference of two agreement coefficients for statistical significance.

Gwet, K. L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, Vol 76(4) 609-637

Saturday, August 22, 2015

Standard Error of Krippendorff's Alpha Coefficient

On August 9, 2015, I received an email from a researcher of the University of Manchester about the standard error associated with Krippendorff's alpha coefficient.  He was asking why my software AgreeStat for Excel produces a standard error for Krippendorff's alpha that is always higher than that produced by Dr. Hayes' SAS macro and SPSS macro called KALPHA. On this isssue, I like to make two comments:

1) AgreeStat uses a variance expression given in equation (7) of the document entitled "On Krippendorff's Alpha," while the KALPHA macro is based on the bootstrap standard error.  However, the use of these two approaches cannot and should not explain the observed difference in standard error estimations.

2) I do not recommend using Dr Hayes’ macro programs for computing the standard error of Krippendorff’s alpha.  It always underestimates (often by a wide margin) the magnitude of the standard error associated with Krippendorff’s alpha.  Here is why I believe so.  In his paper “Answering the Call for a Standard Reliability Measure for Coding Data,” released in 2007 and co-authored by Krippendorff himself , Dr Hayes says the following regarding the algorithm he has used:  

The bootstrap sampling distribution of alpha is generated by taking a random sample of 239 pairs of judgments from the available pairs, weighted by how many observers judged a given unit. Alpha is computed in this “resample” of 239 pairs, and this process is repeated very many times, producing the bootstrap sampling distribution of Alpha.

This bootstrapping algorithm is terrible, and does not reflect in any way the bootstrap method previously introduced by Efron & Tibshirani (1998).  Instead of replicating Table 1 of their article before re-computing the alpha coefficient (this is what Eforn & Tibshirani recommend), Dr Hayes generates several sets of 239 pairs of judgments and computes alpha for each of them.  First the number 239 came from the original table, and is supposed to change from one bootstrap sample to the next. By keeping the exact same structure of the original sample, with the exact same number of missing judgments, you can only obtain a constrained (therefore smaller) variance.  What Dr Hayes should have done is to simply generate several random samples with replacement from the set {1,2,3,4, ..., 40}. A with-replacement random sample will have duplicates, which is ok.  The next step would be to extract from Table 1 only the rows whose numbers were selected in the with-replacement sample, and use them to form the bootstrap sample. This bootstrap sample would then be used to compute the alpha coefficient. 

Friday, December 12, 2014

Benchmarking Agreement Coefficients

After computing Cohen's kappa coefficient or any alternative agreement coefficient (Gwet's AC1, Krippendorff's alpha, ...), researchers often want to interpret its magnitude.  Does it qualify as excellent? good? or poor maybe? This task is generally accomplished using a benchmark scale such one proposed by Altman (1991) and shown in Table 1 (other benchmark scales have been proposed in the literature - see Gwet (2014, Chapter 6):
Table 1
The benchmarking procedure traditionally used by researchers is straightforward, and consists of identifying the specific range of values into which the computed agreement coefficient falls, and use the associated strength of agreement to interpret its magnitude.  An agreement coefficient of 0.5 for example will be categorized as "Moderate." This simple procedure can be misleading for several reasons.
  • A calculated kappa is always based on a specific pool of subjects, and will change if different subjects are used.  Therefore, there is no point interpreting an estimation, which by definition is always exposed to statistical variation. The correct approach is to use an estimation in order to shed light into the magnitude of the construct (or estimand) that the estimation approximates. In our case, a more meaningful objective is to use the computed coefficient to form an opinion about the magnitude of the ("true" and unknown) extent of agreement among raters. The correct procedure must be probabilistic (i.e. our interpretation must be associated with a degree of certainty).  This is precisely due to the unknown nature of the magnitude of the "true" extent of agreement among raters, which is an attribute of these raters abstracted from the characteristics of any subject they may rate. 
  • Several factors may affect the magnitude of an agreement coefficient. Among these factors one can mention the number of subjects, the number categories or the distribution of subjects among categories.  A kappa value of 0.6 based on 80 subjects conveys a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. Why should you have the same interpretation of 0.6 in these two very different contexts?
  • Kappa and the many alternative agreement coefficients advocated in the literature often behave very differently when used to the same group of subjects and raters. Unless there is some form of standardization of the agreement coefficients, the use of the same benchmark scale to interpret all of them may be difficult to justify.  
Here is a benchmarking procedure that overcomes many of these problems:

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2.  The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.
Table 2

STEP 1: Computing the Benchmark Range Membership Probabilities of Table 3
  • Consider the first benchmark range 0.8 to 1.0 in Table 3.  Suppose the "true" agreement among raters falls into this range, that Coeff is the computed agreement coefficient (with any method), and StdErr the associated standard error. If Z is a random variable that follows the standard normal distribution, then you would expected (Coeff-1)/StdErr ≤ Z ≤ (Coeff-0.8)/StdErr with a high probability. The probability of this event represents our certainty level that the extent of agreement among raters belongs to the 0.8-1.0 range.  If this probability is small, then the extent of agreement among raters is likely to be smaller than 0.8.  These certainly levels can be calculated for all ranges and all coefficients as shown in Table 3. Gwet (2014, chapter 6) shows how MS Excel can be used to obtain these membership probabilities.
  • From Table 3, you can see the range of values into which the extent of agreement is most likely to fall.  However, even the highest membership probability for a particular agreement coefficient may not be sufficiently high to give us a satisfactory certainly level.  Hence the need to carry out step 2.  
STEP 2: Computing the Benchmark Range Cumulative Membership Probabilities of Table 4
  • In this second step, you would compute the cumulative membership probabilities for each agreement coefficient as shown in Table 4. That is all membership probabilities of Table 3 are added columnwise successively from the top range to the bottom. You must then set a cut-off point (e.g. 0.95) so that the first benchmark range associated with a cumulative membership probability that equals or exceeds that cut-off point will provide the basis for interpreting your agreement coefficient, and for determining the strength of the agreement  among raters.
STEP 3: Interpretation of the Agreement Coefficients
  • Table 4 indicates that the range of values 0.4 to 0.6 is the first one with Kappa-based cumulative probability that exceeds 0.95.  Consequently, Kappa is qualified as "Moderate." Note that we the procedure currently used by many researchers would qualify kappa as "Good" since its calculated value is 0.676 as shown in Table 2.
  • AC1 is qualified as "Very Good" because the 0.8-to-1 range of values is associated with a cumulative membership probability of 0.959 that exceeds 0.95.
Table 3: Benchmark Range Membership Probabilities

Table 4: Benchmark Range Cumulative Membership Probabilities

  • Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman and Hall.
  • Gwet, K. (2014). Handbook of Inter-Rater Reliability, Advanced Analytics Press, Maryland, USA. ISBN: 9 780970 806284.

Sunday, December 7, 2014

Inter-Rater Reliability in Language Testing

That paper entitled Assessing inter-rater agreement for nominal judgement variables, (alternative link) summarizes a simple comparative study between Cohen's Kappa and Gwet's AC1 for evaluating inter-rater reliability in the context of Language Testing, dichotomous variables, and high-prevalence data.  Researchers may find this analysis instructive. I personally found it attractive for its simplicity, and the clarity of the examples used. Those who are new in the area of inter-rater reliability assessment may find it useful as well.

Monday, March 31, 2014

Some R functions for calculating chance-corrected agreement coefficients

Several researchers have shown interest in having R functions that can compute several chance-corrected agreement coefficients, their standard errors, confidence interval, and p-values as described in my book Handbook and Inter-Rater Reliability (3rd ed.).  I have finally found the time to write these R functions, which can be downloaded from this r-functions page of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.


[1] Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives." Educational and Psychological Measurement, 41, 687-699.
[2] Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological  Measurement, 20, 37-46.
[3] Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters," Psychological  Bulletin, 88, 322-328.
[4] Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382
[5] Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement."  British Journal of Mathematical and Statistical Psychology, 61, 29-48.
[6] Gwet, K.L. (2012). Handbook of Inter-Rater Reliability (3rd Ed.), Advanced Analytics, LLC, Maryland, USA
[7] Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data," Educational and Psychological Measurement, 30, 61-70

Saturday, March 8, 2014

The Perreault-Leigh Agreement Coefficient is Problematic

Perreault and Leigh (1989) considering that there was a need to have an agreement coefficient "that is more appropriate to the type of data typically encountered in marketing contexts," decided to propose a new agreement coefficient known in the literature with the symbol Ir. I firmly believe that the mathematical derivations that led to this coefficient were wrong, even though the underlying ideas are right. A proper translation of these ideas would inevitably have led to the Brennan-Prediger coefficient or to the percent agreement depending on the assumption made.

The Perreault-Leigh agreement coefficient is formally defined as follows:

where S defined by,

is the agreement coefficient recommended by Bennet et al. (1954) and is a special case of the coefficient recommended by Brennan and Prediger (1981). The symbol Ir used by Perreault and Leigh (1989) appears to stand for “Index of Reliability.”

I carefully reviewed the Perreault and Leigh article. It presents an excellent review of the various agreement coefficients that were current at the time it was written. Perreault and Leigh define Ir as the percent of subjects that a typical judge could code consistent given the nature of the observations. Note that Ir is an attribute of the typical judge, and therefore does not represent any aspect of agreement among the judges. Perreault and Leigh (1989) consider the product NxIr2 (with N representing the number of subjects) to represent the number of reliable judgments on which judges agree. This cannot be true. To see this note that Ir2 is the probability that 2 judges both independently perform a reliable judgment. If both (reliable) judgments must lead to an agreement then they have to refer to the exact same category. However the probability Ir2 does not say which category was chosen and cannot represent any agreement among judges.  Even if you decide to assume that any 2 reliable judgments must necessarily result in an agreement, then the judgments will no longer be independent. The probability for two judges to agree will now become equal to the probability for the first rater to perform a reliable judgment times the conditional probability for the second judge to perform a reliability judgment given that the first judge did. This second conditional probability cannot be evaluated unless there are additional assumptions.

What Perreault and Leigh (1989)  have proposed is not an agreement coefficient.  Their coefficient quantifies something other than the extent of agreement among raters. It should not be compared with the other coefficients available in the literature until someone can tell us what it does.


[1] Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954). Communication through limited response questioning. Public Opinion Quarterly, 18, 303-308.

[2] Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687-699.

[3] Perreault, W. D. & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26, 135-148.

Tuesday, February 25, 2014

Inter-rater reliability and Many-Facet Rasch Measurement

I just finished reading the book entitled "Introduction to Many-Facet Rasch Measurement" by Thomas Eckes. In this book, Mr. Thomas Eckes argues that the classical approach to inter-rater reliability that consists of training the raters and measuring their extent of agreement until they reach an acceptable level does not really work.  It is because no matter how much training the raters received, they will still not be interchangeable. A residual intrinsic disagreement will remain among the raters, some of them being more stringent than others in their approach to rating.

The solution that Mr. Eckes proposes is to develop statistical models that describe the different facets of the inter-rater reliability experiment, such as the rater facet, the subject facet and possibly other facets.  These statistical models will then be used to make some adjustments to the ratings so that the subjects supposed to be humans can get a fair test.  This adjustment will supposedly not penalize the subjects who were unlucky enough to be rated by the more severe raters.

I must say I did like this book very much in the way the author describes the different issues associated with an inter-rater reliability experiment.  The presentation of these issues by the author is very instructive and is done with considerable clarity.  That alone justifies the investment in time and money one can make on this book.  However, I have always been somehow skeptical about the use of theoretical statistical models for the purpose of making important practical decisions, especially decisions involving human subjects.  As a matter of fact, even if the raters introduce some bias in the ratings, two statisticians will probably not recommend the same statistical models either.  Using these models to adjust the ratings may only be adding the statistician bias that could compound with the rater bias to produce an outcome that can hardly be seen as more reliable. The statistical models can always help the researcher gain more insight into a reality with powerful modelling tools, but cannot and should not be seen as an expression of that reality.   Nevertheless, this book is remarkably well written, and should certainly be useful to anyone interested in the topic of inter-rater reliability.    

[1] Eckes, Thomas. (2011). Introduction to Many-Facet Rasch Measurement. Peter Lang, ISBN: 978-3-631-61350-4.