Stata users now have a convenient way to compute a wide variety of agreement coefficients within a general framework. The module KAPPAETC can be installed from within Stata and computes various measures of inter-rater agreement and associated standard errors and confidence intervals.

A very interesting background article entitled "Implementing a general framework for assessing interrater agreement in Stata" by Daniel Klein is certainly a must read for Stata users who want to understand the calculations performed by KAPPAETC behind the scene. KAPPAETC is a Stata package that was remarkably well written, and is what I strongly recommend to all Stata users for calculating the the AC1, Kappa, Krippendorff agreement coefficients and associated standard errors, and confidence intervals.

# K. Gwet's Inter-Rater Reliability Blog

On this blog, I discuss about some techniques and general issues related to the design and analysis of inter-rater reliability studies. My mission is to help researchers improve how they address inter-rater reliability assessments through the learning of simple and specific statistical techniques that the community of statisticians has left us to discover on our own.

## Saturday, January 26, 2019

## Monday, August 20, 2018

### AC1 Coefficient implemented in the FREQ Procedure of SAS

As of SAS/STAT version 14.2, the AC

SAS users should nevertheless be aware that by default the FREQ procedure systematically deletes all observations with one missing value. Consequently, the results obtained with SAS may differ from those obtained with other r functions available in several packages, if your dataset contains missing ratings. An option is available for instructing the FREQ procedure to treat missing values as true categories. However, this option is useless for the analysis of agreement among raters.

One last comment. The coefficient often referred to by researchers as PABAK is also known (perhaps more rightfully so) as the Brennan-Prediger coefficient. It was formally studied by Brennan & Prediger (1981), 13 years earlier.

Bibliography.

Byrt, T., Bishop, J., and Carlin, J. B. (1993). Bias, prevalence and Kappa.

Brennan, R. L., and Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives.

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement.

_{1}(see Gwet, 2008) and PABAK (see Byrt, Bishop, and Carlin, 1993) agreement coefficients can be calculated using the FREQ procedure of SAS, in addition to Cohen's Kappa. Therefore, SAS users no longer need to use another software to obtain theses statistics.SAS users should nevertheless be aware that by default the FREQ procedure systematically deletes all observations with one missing value. Consequently, the results obtained with SAS may differ from those obtained with other r functions available in several packages, if your dataset contains missing ratings. An option is available for instructing the FREQ procedure to treat missing values as true categories. However, this option is useless for the analysis of agreement among raters.

*What would be of interest is for Proc FREQ developers to allow for the marginals associated with rater1 and rater2 to be calculated independently. That is, if a rating is available from rater1 then it should be used for calculating rater1's marginals whether it is available from rater2 or not*.One last comment. The coefficient often referred to by researchers as PABAK is also known (perhaps more rightfully so) as the Brennan-Prediger coefficient. It was formally studied by Brennan & Prediger (1981), 13 years earlier.

Bibliography.

Byrt, T., Bishop, J., and Carlin, J. B. (1993). Bias, prevalence and Kappa.

*Journal of Clinical Epidemiology*, 46, 423-429.Brennan, R. L., and Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives.

*Educational and Psychological Measurement*, 41, 687-699.Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement.

*British Journal of Mathematical and Statistical Psychology*, 61, 29-48.## Saturday, February 10, 2018

### Inter-rater reliability among multiple raters when subjects are rated by different pairs of subjects

In this post, I like to briefly address an issue that researchers have contacted me about on many occasions. This issue can be described as follows:

- You want to evaluate the extent of agreement among 3 raters or more.
- For various practical reasons, the inter-rater reliability experiment is designed in such a way that
**only 2 raters are randomly assigned to each subject**. For each subject, a new pair of raters is independently chosen from the same pool of several raters. Consequently, each subject gets 2 ratings from a pair of raters that could vary from subject to subject.

Note that most inter-rater reliability coefficients found in the literature are based upon the assumption that each subject must be rated by all raters. This ubiquitous fully-crossed design may prove impractical if rating costs are prohibitive. The question now becomes, what coefficient to use for evaluating the extent of agreement among multiple raters when only 2 of them are allowed to rate a specific subject.

The solution to this problem is actually quite simple and does not involve any new coefficient not already available in the literature. It consists of using your coefficient of choice, and calculating the agreement coefficient as if the ratings were all produced by the exact same pair of raters. It is the interpretation of its magnitude that is drastically different from what it would be if only 2 raters had actually participated in the experiment. If the ratings come from 2 raters only then the standard error associated with the coefficient will be smaller than if the ratings came from 5 raters or more grouped in pairs. In the latter case, the coefficient is subject to an additional source of variation due to the random assignment of raters to subject that must be taken into consideration. I prepared an unpublished paper on this topic entitled

*"An Evaluation of the Impact of Design on the Analysis of Nominal-Scale Inter-Rater Reliability Studies"*which interested readers may want to download for a more detailed discussion of this interesting topic.## Tuesday, September 6, 2016

### A t-test for correlated agreement coefficients and application with the R package

Researchers must often compare two groups of raters with respect to the extent to which they agree on the rating of the same group of raters. The extent of agreement among raters of the same group can also be measured on two occasions (e.g. before and after a training session), in order to assess the effectiveness of training on improving inter-rater reliability. An agreement coefficient must then be calculated twice. The traditional statistical approach for testing the difference for statistical significance is to divide that difference by its variance before comparing that this ratio (i.e. the test statistic) to the critical value (often 1.96). If the absolute value of the t-statistic exceeds the critical value then one may conclude that the difference is statistically significant. Sometimes the p-value is calculated and used to conclude statistical significance when it falls below 0.05. However, calculating the variance of the difference can sometimes become problematic.

If the two groups of raters (or the same group observed on 2 occasions) must rate the exact same group of raters, then any agreement coefficient used (e.g. Fleiss generalized kappa, Gwet's AC

See more details on kudos.

Bibliography:

If the two groups of raters (or the same group observed on 2 occasions) must rate the exact same group of raters, then any agreement coefficient used (e.g. Fleiss generalized kappa, Gwet's AC

_{1}, Conger's generalized kappa, Brennan-Prediger coefficient, or Krippendorff's alpha) will produce two correlated coefficients, making the calculation of the variance of the difference very difficult due to the embedded correlation structure. Gwet (2016) proposed the linearization method to resolve this problem. This approach consists of using the linear approximation to the agreement coefficient to develop the equivalent of a paired t-test. Users of the R package may use the**R functions**that I developed to implement the linearization method to testing the difference of two agreement coefficients for statistical significance.See more details on kudos.

Bibliography:

*Gwet, K. L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, Vol 76(4) 609-637*## Saturday, August 22, 2015

### Standard Error of Krippendorff's Alpha Coefficient

On August 9, 2015, I received an email from a researcher of the University of Manchester about the standard error associated with Krippendorff's alpha coefficient. He was asking why my software AgreeStat for Excel produces a standard error for Krippendorff's alpha that is always higher than that produced by Dr. Hayes' SAS macro and SPSS macro called KALPHA. On this isssue, I like to make two comments:

1) AgreeStat uses a variance expression given in equation (7) of the document entitled "On Krippendorff's Alpha," while the KALPHA macro is based on the bootstrap standard error. However, the use of these two approaches cannot and should not explain the observed difference in standard error estimations.

2) I do not recommend using Dr Hayes’ macro programs for computing the standard error of Krippendorff’s alpha. It always underestimates (often by a wide margin) the magnitude of the standard error associated with Krippendorff’s alpha. Here is why I believe so. In his paper “Answering the Call for a Standard Reliability Measure for Coding Data,” released in 2007 and co-authored by Krippendorff himself , Dr Hayes says the following regarding the algorithm he has used:

*The bootstrap sampling distribution of alpha is generated by taking a random sample of 239 pairs of judgments from the available pairs, weighted by how many observers judged a given unit. Alpha is computed in this “resample” of 239 pairs, and this process is repeated very many times, producing the bootstrap sampling distribution of Alpha.*

This bootstrapping algorithm is terrible, and does not reflect in any way the bootstrap method previously introduced by Efron & Tibshirani (1998). Instead of replicating Table 1 of their article before re-computing the alpha coefficient (this is what Eforn & Tibshirani recommend), Dr Hayes generates several sets of 239 pairs of judgments and computes alpha for each of them. First the number 239 came from the original table, and is supposed to change from one bootstrap sample to the next. By keeping the exact same structure of the original sample, with the exact same number of missing judgments, you can only obtain a constrained (therefore smaller) variance. What Dr Hayes should have done is to simply generate several random samples with replacement from the set {1,2,3,4, ..., 40}. A with-replacement random sample will have duplicates, which is ok. The next step would be to extract from Table 1 only the rows whose numbers were selected in the with-replacement sample, and use them to form the bootstrap sample. This bootstrap sample would then be used to compute the alpha coefficient.

## Friday, December 12, 2014

### Benchmarking Agreement Coefficients

After computing Cohen's kappa coefficient or any alternative agreement coefficient (Gwet's AC

_{1}, Krippendorff's alpha, ...), researchers often want to interpret its magnitude. Does it qualify as excellent? good? or poor maybe? This task is generally accomplished using a benchmark scale such one proposed by Altman (1991) and shown in Table 1 (other benchmark scales have been proposed in the literature - see Gwet (2014, Chapter 6):**Table 1**

The benchmarking procedure traditionally used by researchers is straightforward, and consists of identifying the specific range of values into which the computed agreement coefficient falls, and use the associated strength of agreement to interpret its magnitude. An agreement coefficient of 0.5 for example will be categorized as "Moderate."

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2. The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.

**This simple procedure can be misleading for several reasons**.- A calculated kappa is always based on a specific pool of subjects, and will change if different subjects are used. Therefore, there is no point interpreting an estimation, which by definition is always exposed to statistical variation. The correct approach is to use an estimation in order to shed light into the magnitude of the construct (or estimand) that the estimation approximates.
*In our case, a more meaningful objective is to use the computed coefficient to form an opinion about the magnitude of the ("true" and unknown) extent of agreement among raters*. The correct procedure must be probabilistic (i.e. our interpretation must be associated with a degree of certainty). This is precisely due to the unknown nature of the magnitude of the "true" extent of agreement among raters, which is an attribute of these raters abstracted from the characteristics of any subject they may rate. - Several factors may affect the magnitude of an agreement coefficient. Among these factors one can mention the number of subjects, the number categories or the distribution of subjects among categories. A kappa value of 0.6 based on 80 subjects conveys a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only. Why should you have the same interpretation of 0.6 in these two very different contexts?
- Kappa and the many alternative agreement coefficients advocated in the literature often behave very differently when used to the same group of subjects and raters. Unless there is some form of standardization of the agreement coefficients, the use of the same benchmark scale to interpret all of them may be difficult to justify.

Consider an inter-rater reliability experiment that produced the agreement coefficients shown in Table 2. The second column shows the calculated agreement coefficients, while the third column contains the associated standard errors. The standard error is a statistical measure that tells you how far you would normally expect any given agreement coefficient value to stray away from its overall mean. This standard error plays a pivotal role in the benchmarking procedure we propose, since it quantifies the uncertainty surrounding the computed agreement coefficient.

**Table 2**

**STEP 1: Computing the Benchmark Range Membership Probabilities of Table 3**- Consider the first benchmark range 0.8 to 1.0 in Table 3. Suppose the "true" agreement among raters falls into this range, that
*Coeff*is the computed agreement coefficient (with any method), and*StdErr*the associated standard error. If*Z*is a random variable that follows the standard normal distribution, then you would expected*(Coeff-1)/StdErr ≤ Z ≤ (Coeff-0.8)/StdErr*with a high probability. The probability of this event represents our certainty level that the extent of agreement among raters belongs to the 0.8-1.0 range. If this probability is small, then the extent of agreement among raters is likely to be smaller than 0.8. These certainly levels can be calculated for all ranges and all coefficients as shown in Table 3. Gwet (2014, chapter 6) shows how MS Excel can be used to obtain these membership probabilities. - From Table 3, you can see the range of values into which the extent of agreement is most likely to fall. However, even the highest membership probability for a particular agreement coefficient may not be sufficiently high to give us a satisfactory certainly level. Hence the need to carry out step 2.

**STEP 2: Computing the Benchmark Range Cumulative Membership Probabilities of Table 4**- In this second step, you would compute the cumulative membership probabilities for each agreement coefficient as shown in Table 4. That is all membership probabilities of Table 3 are added columnwise successively from the top range to the bottom. You must then set a cut-off point (e.g. 0.95) so that the first benchmark range associated with a cumulative membership probability that equals or exceeds that cut-off point will provide the basis for interpreting your agreement coefficient, and for determining the strength of the agreement among raters.

**STEP 3: Interpretation of the Agreement Coefficients**- Table 4 indicates that the range of values 0.4 to 0.6 is the first one with Kappa-based cumulative probability that exceeds 0.95. Consequently, Kappa is qualified as "Moderate." Note that we the procedure currently used by many researchers would qualify kappa as "Good" since its calculated value is 0.676 as shown in Table 2.
- AC
_{1}is qualified as "Very Good" because the 0.8-to-1 range of values is associated with a cumulative membership probability of 0.959 that exceeds 0.95.

**Table 3: Benchmark Range Membership Probabilities**

**Table 4: Benchmark Range Cumulative Membership Probabilities**

**References.**

- Altman, D. G. (1991).
*Practical Statistics for Medical Research*. Chapman and Hall. - Gwet, K. (2014).
*Handbook of Inter-Rater Reliability*, Advanced Analytics Press, Maryland, USA. ISBN: 9 780970 806284.

## Sunday, December 7, 2014

### Inter-Rater Reliability in Language Testing

That paper entitled Assessing inter-rater agreement for nominal judgement variables, (alternative link) summarizes a simple comparative study between Cohen's Kappa and Gwet's AC

_{1}for evaluating inter-rater reliability in the context of Language Testing, dichotomous variables, and high-prevalence data. Researchers may find this analysis instructive. I personally found it attractive for its simplicity, and the clarity of the examples used. Those who are new in the area of inter-rater reliability assessment may find it useful as well.