K. Gwet's Inter-Rater Reliability Blog : December 2014Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

After computing Cohen's kappa coefficient or any alternative agreement coefficient (Gwet's AC₁, Krippendorff's alpha, ...), researchers often want to interpret its magnitude. Does it qualify as excellent? good? or poor maybe? This task is generally accomplished using a benchmark scale such one proposed by Altman (1991) and shown in Table 1 (other benchmark scales have been proposed in the literature - see Gwet (2014, Chapter 6):

Table 1

The benchmarking procedure traditionally used by researchers is straightforward, and consists of identifying the specific range of values into which the computed agreement coefficient falls, and use the associated strength of agreement to interpret its magnitude. An agreement coefficient of 0.5 for example will be categorized as "Moderate." This simple procedure can be misleading for several reasons.

A calculated kappa is always based on a specific pool of subjects, and will change if different subjects are used. Therefore, there is no point interpreting an estimation, which by definition is always exposed to statistical variation. The correct approach is to use an estimation in order to shed light into the magnitude of the construct (or estimand) that the estimation approximates. In our case, a more meaningful objective is to use the computed coefficient to form an opinion about the magnitude of the ("true" and unknown) extent of agreement among raters. The correct procedure must be probabilistic (i.e. our interpretation must be associated with a degree of certainty). This is precisely due to the unknown nature of the magnitude of the "true" extent of agreement among raters, which is an attribute of these raters abstracted from the characteristics of any subject they may rate.
Several factors may affect the magnitude of an agreement coefficient. Among these factors one can mention the number of subjects, the number categories or the distribution of subjects among categories. A kappa value of 0.6 based on 80 subjects conveys a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. Why should you have the same interpretation of 0.6 in these two very different contexts?
Kappa and the many alternative agreement coefficients advocated in the literature often behave very differently when used to the same group of subjects and raters. Unless there is some form of standardization of the agreement coefficients, the use of the same benchmark scale to interpret all of them may be difficult to justify.

Here is a benchmarking procedure that overcomes many of these problems:

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2. The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.

Table 2

STEP 1: Computing the Benchmark Range Membership Probabilities of Table 3

Consider the first benchmark range 0.8 to 1.0 in Table 3. Suppose the "true" agreement among raters falls into this range, that Coeff is the computed agreement coefficient (with any method), and StdErr the associated standard error. If Z is a random variable that follows the standard normal distribution, then you would expected (Coeff-1)/StdErr ≤ Z ≤ (Coeff-0.8)/StdErr with a high probability. The probability of this event represents our certainty level that the extent of agreement among raters belongs to the 0.8-1.0 range. If this probability is small, then the extent of agreement among raters is likely to be smaller than 0.8. These certainly levels can be calculated for all ranges and all coefficients as shown in Table 3. Gwet (2014, chapter 6) shows how MS Excel can be used to obtain these membership probabilities.
From Table 3, you can see the range of values into which the extent of agreement is most likely to fall. However, even the highest membership probability for a particular agreement coefficient may not be sufficiently high to give us a satisfactory certainly level. Hence the need to carry out step 2.

STEP 2: Computing the Benchmark Range Cumulative Membership Probabilities of Table 4

In this second step, you would compute the cumulative membership probabilities for each agreement coefficient as shown in Table 4. That is all membership probabilities of Table 3 are added columnwise successively from the top range to the bottom. You must then set a cut-off point (e.g. 0.95) so that the first benchmark range associated with a cumulative membership probability that equals or exceeds that cut-off point will provide the basis for interpreting your agreement coefficient, and for determining the strength of the agreement among raters.

STEP 3: Interpretation of the Agreement Coefficients

Table 4 indicates that the range of values 0.4 to 0.6 is the first one with Kappa-based cumulative probability that exceeds 0.95. Consequently, Kappa is qualified as "Moderate." Note that we the procedure currently used by many researchers would qualify kappa as "Good" since its calculated value is 0.676 as shown in Table 2.
AC₁ is qualified as "Very Good" because the 0.8-to-1 range of values is associated with a cumulative membership probability of 0.959 that exceeds 0.95.

Table 3: Benchmark Range Membership Probabilities

Table 4: Benchmark Range Cumulative Membership Probabilities

References.

Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman and Hall.
Gwet, K. (2014). Handbook of Inter-Rater Reliability, Advanced Analytics Press, Maryland, USA. ISBN: 9 780970 806284.

K. Gwet's Inter-Rater Reliability Blog

Friday, December 12, 2014

Benchmarking Agreement Coefficients

Sunday, December 7, 2014

Inter-Rater Reliability in Language Testing