Monday, March 31, 2014

Some R functions for calculating chance-corrected agreement coefficients

Several researchers have shown interest in having R functions that can compute several chance-corrected agreement coefficients, their standard errors, confidence interval, and p-values as described in my book Handbook and Inter-Rater Reliability (3rd ed.).  I have finally found the time to write these R functions, which can be downloaded from this r-functions page of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.


[1] Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives." Educational and Psychological Measurement, 41, 687-699.
[2] Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological  Measurement, 20, 37-46.
[3] Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters," Psychological  Bulletin, 88, 322-328.
[4] Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382
[5] Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement."  British Journal of Mathematical and Statistical Psychology, 61, 29-48.
[6] Gwet, K.L. (2012). Handbook of Inter-Rater Reliability (3rd Ed.), Advanced Analytics, LLC, Maryland, USA
[7] Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data," Educational and Psychological Measurement, 30, 61-70


  1. Dear Kilem,

    First I would like to express my gratitude for preparing R functions for computation of AC1 coefficient. It really saved our project. I would however like to ask you if there is also an R function for computation of AC2 coefficient? I found only the one for AC1 on your web site.

    many thanks for help & best, Gregor

  2. Ah, I just discovered weight matrix... tnx anyway :)

  3. Hello, Kilem!

    My name is Gustavo Arruda. I'm from Brazil.

    Congratulations for your contributions in the field of statistics.

    I would like to know if the modified Kappa (Brennan-Prediger 1981) can be used for a test-retest analysis (intra-rater) for an ordinal variable with three categories - using simple ordinal weight?
    Once the modified Kappa was originally developed for nominal variables but can be realized with a larger number of categories (without distinction) in the Agreestat.
    Thank you!

  4. Hi Gustavo,
    Although the Brennan-Prediger is most often used for computing inter-rater reliability, there is nothing that would prevent you from using it to compute intra-rater reliability. The raters in this case, would represent ratings of the same subjects on different occasions. Other than that everything else remain the same. If your ratings are ordinal then you certainly should use ordinal weights to account for the partial agreements that some disagreements represent.



  5. Thank you for your attention!
    And congratulations for your contributions!

  6. Hello Kilem, A question
    if we have multiple judges to calculate the rating sustentantes, but there may be a judge who called only 7 times and others who scored more than 100, I know it would be expected to have similar amounts and read in one of your articles that it is not possible to interpret same. What would be your recommendation? And we perform the analysis AC1

  7. Dr. Gwet,

    I have your 3rd & 4th editions. Thank you very much! Also, thank you very much for creating the R code too!

    I have a 37 (item) x 16 rater matrix of ratings (1=essential, 2=useful, 3=not necessary) with some missing values. As I understand it, the appropriate measure of interrupter reliability would be the AC2. When I copy the R function from your website (run it) and then try to compute the AC2 I get the following error:

    > gwet.ac1.raw(interrater.reliability)
    Error in gwet.ac1.raw(interrater.reliability) :
    could not find function "identity.weights"

    > gwet.ac1.raw(interrater.reliability,weights="unweighted")
    Error in gwet.ac1.raw(interrater.reliability, weights = "unweighted") :
    could not find function "identity.weights"

    I am not sure what I am doing wrong, could you give me some insight as to what I am doing wrong?

    Thank you very much!


  8. Hi Greg,
    Download the file, then load it in R using source("C:\\Your_Directory\\weights.gen.r"). Now you can use any function you want.

  9. Dear Kilem.

    I've used the functions provided in agree.coeff2.r ( to asses the agreement of two different questionnaires (without any gold standard) that classify a person in positive or negative for one condition according to its own score. Each participant is assessed once with each test in the same visit.

    With Gwet's AC1 there's a lack of agreement between the questionnaires, which provides a negative coefficient (which I assume is part of the negative biass that you point out in reference 5), but what really surprise me is a p value largely over 1.

    This lead me to a dilemma: I don't know if there's some error in the function or how can be a p over one and which is the interpretation of a p over one. Could you shed some light about this issue?



    Here are the numbers.

    A+B+ = 68
    A+B- = 2
    A-B+ = 267
    A-B- = 146

    Gwet's AC1/AC2 Coefficient
    Percent agreement: 0.4430642 Percent chance agreement: 0.4869604
    AC1/AC2 coefficient: -0.08556103 Standard error: 0.04785226
    95 % Confidence Interval: ( -0.1795858 , 0.008463783 )
    P-value: 1.9256

  10. I have updated the r function agree.coeff2.r. It now produces the correct p-value even when the agreement coefficient is negative.


    1. Hi Kilem

      Can AC1 coefficient be negative, and how should it be interpreted? We have found some negative values, but are not shure how to present them.


    2. Yes, AC1 can take a negative value. This indicates an absence of agreement among raters beyond chance.

  11. Hi Dr. Gwet,

    I am using your gwet.ac1.raw function in agree.coeff3.raw.r. When I use a dataset which contains some missing values (NA), the function returns NaN for the AC1 coefficient.

    > testAC
    [,1] [,2] [,3]
    [1,] NA NA NA
    [2,] 1 NA NA
    [3,] NA NA NA
    [4,] 1 1 1
    [5,] 1 1 1
    [6,] 1 1 1
    [7,] 1 1 1
    [8,] 1 2 1
    [9,] 1 2 1
    [10,] 1 1 1
    [11,] NA NA NA
    > gwet.ac1.raw(testAC)
    Gwet's AC1 Coefficient
    Percent agreement: 0.8095238 Percent chance agreement: NaN
    AC1 coefficient: NaN Standard error: NaN
    95 % Confidence Interval: ( NaN , NaN )
    P-value: NaN

    Hoping you can help. My understanding is the code should work for missing values. It appears to work fine when there are no missing values.


  12. Each row must contain at least one non-missing value. Rows with only NA values must first be deleted from the dataset before executing the functions.

  13. Thanks. That did solve the issue. However, I'm trying the following dataset which is causing issues for AC1 as well as for k-alpha:

    > testNoNA
    [,1] [,2] [,3]
    [1,] 2 NA NA
    [2,] 2 2 2
    [3,] 2 2 2
    [4,] 2 2 2
    [5,] 2 2 2
    [6,] 2 2 2
    [7,] 2 2 2
    [8,] 2 2 2
    > gwet.ac1.raw(testNoNA)
    Gwet's AC1 Coefficient
    Percent agreement: 1 Percent chance agreement: NaN
    AC1 coefficient: NaN Standard error: NaN
    95 % Confidence Interval: ( NaN , NaN )
    P-value: NaN
    > krippen.alpha.raw(testNoNA)
    Error in (agree.mat * (agree.mat.w - 1)) %*% rep(1, q) :
    non-conformable arguments


  14. This particular data set generates an error message mainly because it contains a single category, which is 2. A typical data table would normally show 2 categories or more. When you want to quantify the extent of agreement among raters, it is because 2 raters have the possibility of selecting 2 different categories.

    I understand very well that despite the availability of 2 categories or more, 2 raters may well decide to assign all subjects into the exact same category (e.g. 2 as in your table). In this case AC1 equals 1 and its variance would be 0. Krippendorff alpha will likely be 0 (which is clearly inaccurate, and this coefficient is know not to work well in such a scenario), its variance will also be 0.

  15. Hi Kilem,

    I recently bought your book and find it very useful.

    It is my understanding that we can compute intra-rater reliability from ordinal ratings produced by a single rater on two occasions for a number of subjects. We can achieve this by treating the ordinal ratings on the two occasions as coming from two independent raters and essentially computing measures of inter-rater agreement which reflect the ordinal nature of the ratings (e.g., AC1, generalized kappa).

    What if we have n multiple raters, such that each provides ratings on the same subjects on two occasions? Is it fair to assume that the n x 2 sets of ordinal ratings come from n x 2 independent raters and then apply measures of inter-rater agreement to them? I haven't seen anything in the literature covering explicitely the case of more than 2 raters when it comes to dealing with ordinal ratings, so I wanted to make sure I'm on the right track.

    1. Hi Isabella,

      I am sorry for taking long before responding to your inquiry. Here is how I would go about your problem:

      As far as computing intra-rater reliability goes, it is the number of times a subject is rated that must dictate the number of "virtual" raters to consider. That is, if a single rater produces 2 ratings per subject then assume both ratings are from 2 "virtual" raters producing each 1 rating per subject.

      Now, suppose each of 3 raters produces 2 ratings per subject for a total of 6 ratings per subject. In this case, assume that the 6 ratings are from 2 "virtual" raters and 3 subjects. For example,

      Subject RaterA RaterB RaterC
      1 a11 b11 c11
      1 a12 b12 c12

      This dataset should be seen as follows:

      Subject VirtualR1 VirtualR2
      1 a11 a12
      2 b11 b12
      3 c11 c12

      If the number ratings per subject (or number of occasions) increases, then increase the number of virtual raters accordingly.

      Hope this helps.