_{1}, Krippendorff's alpha, ...), researchers often want to interpret its magnitude. Does it qualify as excellent? good? or poor maybe? This task is generally accomplished using a benchmark scale such one proposed by Altman (1991) and shown in Table 1 (other benchmark scales have been proposed in the literature - see Gwet (2014, Chapter 6):

**Table 1**

The benchmarking procedure traditionally used by researchers is straightforward, and consists of identifying the specific range of values into which the computed agreement coefficient falls, and use the associated strength of agreement to interpret its magnitude. An agreement coefficient of 0.5 for example will be categorized as "Moderate."

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2. The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.

**This simple procedure can be misleading for several reasons**.- A calculated kappa is always based on a specific pool of subjects, and will change if different subjects are used. Therefore, there is no point interpreting an estimation, which by definition is always exposed to statistical variation. The correct approach is to use an estimation in order to shed light into the magnitude of the construct (or estimand) that the estimation approximates.
*In our case, a more meaningful objective is to use the computed coefficient to form an opinion about the magnitude of the ("true" and unknown) extent of agreement among raters*. The correct procedure must be probabilistic (i.e. our interpretation must be associated with a degree of certainty). This is precisely due to the unknown nature of the magnitude of the "true" extent of agreement among raters, which is an attribute of these raters abstracted from the characteristics of any subject they may rate. - Several factors may affect the magnitude of an agreement coefficient. Among these factors one can mention the number of subjects, the number categories or the distribution of subjects among categories. A kappa value of 0.6 based on 80 subjects conveys a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. Why should you have the same interpretation of 0.6 in these two very different contexts?
- Kappa and the many alternative agreement coefficients advocated in the literature often behave very differently when used to the same group of subjects and raters. Unless there is some form of standardization of the agreement coefficients, the use of the same benchmark scale to interpret all of them may be difficult to justify.

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2. The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.

**Table 2**

**STEP 1: Computing the Benchmark Range Membership Probabilities of Table 3**- Consider the first benchmark range 0.8 to 1.0 in Table 3. Suppose the "true" agreement among raters falls into this range, that
*Coeff*is the computed agreement coefficient (with any method), and*StdErr*the associated standard error. If*Z*is a random variable that follows the standard normal distribution, then you would expected*(Coeff-1)/StdErr ≤ Z ≤ (Coeff-0.8)/StdErr*with a high probability. The probability of this event represents our certainty level that the extent of agreement among raters belongs to the 0.8-1.0 range. If this probability is small, then the extent of agreement among raters is likely to be smaller than 0.8. These certainly levels can be calculated for all ranges and all coefficients as shown in Table 3. Gwet (2014, chapter 6) shows how MS Excel can be used to obtain these membership probabilities. - From Table 3, you can see the range of values into which the extent of agreement is most likely to fall. However, even the highest membership probability for a particular agreement coefficient may not be sufficiently high to give us a satisfactory certainly level. Hence the need to carry out step 2.

**STEP 2: Computing the Benchmark Range Cumulative Membership Probabilities of Table 4**- In this second step, you would compute the cumulative membership probabilities for each agreement coefficient as shown in Table 4. That is all membership probabilities of Table 3 are added columnwise successively from the top range to the bottom. You must then set a cut-off point (e.g. 0.95) so that the first benchmark range associated with a cumulative membership probability that equals or exceeds that cut-off point will provide the basis for interpreting your agreement coefficient, and for determining the strength of the agreement among raters.

**STEP 3: Interpretation of the Agreement Coefficients**- Table 4 indicates that the range of values 0.4 to 0.6 is the first one with Kappa-based cumulative probability that exceeds 0.95. Consequently, Kappa is qualified as "Moderate." Note that we the procedure currently used by many researchers would qualify kappa as "Good" since its calculated value is 0.676 as shown in Table 2.
- AC
_{1}is qualified as "Very Good" because the 0.8-to-1 range of values is associated with a cumulative membership probability of 0.959 that exceeds 0.95.

**Table 3: Benchmark Range Membership Probabilities**

**Table 4: Benchmark Range Cumulative Membership Probabilities**

**References.**

- Altman, D. G. (1991).
*Practical Statistics for Medical Research*. Chapman and Hall. - Gwet, K. (2014).
*Handbook of Inter-Rater Reliability*, Advanced Analytics Press, Maryland, USA. ISBN: 9 780970 806284.

Hi Dr. Gwet,

ReplyDeleteI'm Hong, an ophthalmology registrar based in New Zealand. I have been reading your book and some of your articles on inter-rater reliability. I am about to start a research comparing images taken from smartphones and those from standard fundus camera. I am desperately looking for people for some advice and i came across your blog. Thought you will be the best person to ask. Would be delighted if you could give me your email or contact me at hschiong@gmail.com Thank you so much.

Hi Dr Gwet,

ReplyDeleteI am student also working on kappa and find your work. It was the great development . I wanted to get your article. from some new development I have find the preceding article that shows your ACI statistics has some problems kindly check it the new one on 13th preceding of the conference.

http://www.isoss.net/proceedings

kindly provide me your original article.

You may download the original article using the link: http://www.agreestat.com/research_papers/kappa_statistic_is_not_satisfactory.pdf

DeleteThis is one of the most insightful discussions that I have ever read about these test statstics.

ReplyDelete