Monday, March 31, 2014

Some R functions for calculating chance-corrected agreement coefficients

Several researchers have shown interest in having R functions that can compute several chance-corrected agreement coefficients, their standard errors, confidence interval, and p-values as described in my book Handbook and Inter-Rater Reliability (3rd ed.).  I have finally found the time to write these R functions, which can be downloaded from this r-functions page of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.

Bibliography:

[1] Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives." Educational and Psychological Measurement, 41, 687-699.
[2] Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological  Measurement, 20, 37-46.
[3] Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters," Psychological  Bulletin, 88, 322-328.
[4] Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters", Psychological Bulletin, 76, 378-382
[5] Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement."  British Journal of Mathematical and Statistical Psychology, 61, 29-48.
[6] Gwet, K.L. (2012). Handbook of Inter-Rater Reliability (3rd Ed.), Advanced Analytics, LLC, Maryland, USA
[7] Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data," Educational and Psychological Measurement, 30, 61-70


28 comments:

  1. Dear Kilem,

    First I would like to express my gratitude for preparing R functions for computation of AC1 coefficient. It really saved our project. I would however like to ask you if there is also an R function for computation of AC2 coefficient? I found only the one for AC1 on your web site.

    many thanks for help & best, Gregor

    ReplyDelete
  2. Ah, I just discovered weight matrix... tnx anyway :)

    ReplyDelete
  3. Hello, Kilem!

    My name is Gustavo Arruda. I'm from Brazil.

    Congratulations for your contributions in the field of statistics.

    I would like to know if the modified Kappa (Brennan-Prediger 1981) can be used for a test-retest analysis (intra-rater) for an ordinal variable with three categories - using simple ordinal weight?
    Once the modified Kappa was originally developed for nominal variables but can be realized with a larger number of categories (without distinction) in the Agreestat.
    Thank you!

    ReplyDelete
  4. Hi Gustavo,
    Although the Brennan-Prediger is most often used for computing inter-rater reliability, there is nothing that would prevent you from using it to compute intra-rater reliability. The raters in this case, would represent ratings of the same subjects on different occasions. Other than that everything else remain the same. If your ratings are ordinal then you certainly should use ordinal weights to account for the partial agreements that some disagreements represent.

    Thanks

    Kilem

    ReplyDelete
  5. Thank you for your attention!
    And congratulations for your contributions!

    ReplyDelete
  6. Hello Kilem, A question
    if we have multiple judges to calculate the rating sustentantes, but there may be a judge who called only 7 times and others who scored more than 100, I know it would be expected to have similar amounts and read in one of your articles that it is not possible to interpret same. What would be your recommendation? And we perform the analysis AC1

    ReplyDelete
  7. Dr. Gwet,

    I have your 3rd & 4th editions. Thank you very much! Also, thank you very much for creating the R code too!

    I have a 37 (item) x 16 rater matrix of ratings (1=essential, 2=useful, 3=not necessary) with some missing values. As I understand it, the appropriate measure of interrupter reliability would be the AC2. When I copy the R function from your website (run it) and then try to compute the AC2 I get the following error:

    > gwet.ac1.raw(interrater.reliability)
    Error in gwet.ac1.raw(interrater.reliability) :
    could not find function "identity.weights"

    > gwet.ac1.raw(interrater.reliability,weights="unweighted")
    Error in gwet.ac1.raw(interrater.reliability, weights = "unweighted") :
    could not find function "identity.weights"

    I am not sure what I am doing wrong, could you give me some insight as to what I am doing wrong?

    Thank you very much!

    -Greg

    ReplyDelete
  8. Hi Greg,
    Download the file http://www.agreestat.com/software/r/new/weights.gen.r, then load it in R using source("C:\\Your_Directory\\weights.gen.r"). Now you can use any function you want.

    ReplyDelete
  9. Dear Kilem.

    I've used the functions provided in agree.coeff2.r (http://www.agreestat.com/software/r/new/agree.coeff2.r) to asses the agreement of two different questionnaires (without any gold standard) that classify a person in positive or negative for one condition according to its own score. Each participant is assessed once with each test in the same visit.

    With Gwet's AC1 there's a lack of agreement between the questionnaires, which provides a negative coefficient (which I assume is part of the negative biass that you point out in reference 5), but what really surprise me is a p value largely over 1.

    This lead me to a dilemma: I don't know if there's some error in the function or how can be a p over one and which is the interpretation of a p over one. Could you shed some light about this issue?

    Thanks

    Xavier


    Here are the numbers.

    A+B+ = 68
    A+B- = 2
    A-B+ = 267
    A-B- = 146


    Gwet's AC1/AC2 Coefficient
    ==========================
    Percent agreement: 0.4430642 Percent chance agreement: 0.4869604
    AC1/AC2 coefficient: -0.08556103 Standard error: 0.04785226
    95 % Confidence Interval: ( -0.1795858 , 0.008463783 )
    P-value: 1.9256







    ReplyDelete
  10. I have updated the r function agree.coeff2.r. It now produces the correct p-value even when the agreement coefficient is negative.

    Thanks

    ReplyDelete
    Replies
    1. Hi Kilem

      Can AC1 coefficient be negative, and how should it be interpreted? We have found some negative values, but are not shure how to present them.

      Thanks
      Kristina

      Delete
    2. Yes, AC1 can take a negative value. This indicates an absence of agreement among raters beyond chance.

      Delete
  11. Hi Dr. Gwet,

    I am using your gwet.ac1.raw function in agree.coeff3.raw.r. When I use a dataset which contains some missing values (NA), the function returns NaN for the AC1 coefficient.

    Example:
    > testAC
    [,1] [,2] [,3]
    [1,] NA NA NA
    [2,] 1 NA NA
    [3,] NA NA NA
    [4,] 1 1 1
    [5,] 1 1 1
    [6,] 1 1 1
    [7,] 1 1 1
    [8,] 1 2 1
    [9,] 1 2 1
    [10,] 1 1 1
    [11,] NA NA NA
    > gwet.ac1.raw(testAC)
    Gwet's AC1 Coefficient
    ======================
    Percent agreement: 0.8095238 Percent chance agreement: NaN
    AC1 coefficient: NaN Standard error: NaN
    95 % Confidence Interval: ( NaN , NaN )
    P-value: NaN

    Hoping you can help. My understanding is the code should work for missing values. It appears to work fine when there are no missing values.

    Thanks,
    Dharmesh

    ReplyDelete
  12. Each row must contain at least one non-missing value. Rows with only NA values must first be deleted from the dataset before executing the functions.

    ReplyDelete
  13. Thanks. That did solve the issue. However, I'm trying the following dataset which is causing issues for AC1 as well as for k-alpha:

    Example:
    > testNoNA
    [,1] [,2] [,3]
    [1,] 2 NA NA
    [2,] 2 2 2
    [3,] 2 2 2
    [4,] 2 2 2
    [5,] 2 2 2
    [6,] 2 2 2
    [7,] 2 2 2
    [8,] 2 2 2
    > gwet.ac1.raw(testNoNA)
    Gwet's AC1 Coefficient
    ======================
    Percent agreement: 1 Percent chance agreement: NaN
    AC1 coefficient: NaN Standard error: NaN
    95 % Confidence Interval: ( NaN , NaN )
    P-value: NaN
    > krippen.alpha.raw(testNoNA)
    Error in (agree.mat * (agree.mat.w - 1)) %*% rep(1, q) :
    non-conformable arguments

    Thanks,
    Dharmesh

    ReplyDelete
  14. This particular data set generates an error message mainly because it contains a single category, which is 2. A typical data table would normally show 2 categories or more. When you want to quantify the extent of agreement among raters, it is because 2 raters have the possibility of selecting 2 different categories.

    I understand very well that despite the availability of 2 categories or more, 2 raters may well decide to assign all subjects into the exact same category (e.g. 2 as in your table). In this case AC1 equals 1 and its variance would be 0. Krippendorff alpha will likely be 0 (which is clearly inaccurate, and this coefficient is know not to work well in such a scenario), its variance will also be 0.

    ReplyDelete
  15. Hi Kilem,

    I recently bought your book and find it very useful.

    It is my understanding that we can compute intra-rater reliability from ordinal ratings produced by a single rater on two occasions for a number of subjects. We can achieve this by treating the ordinal ratings on the two occasions as coming from two independent raters and essentially computing measures of inter-rater agreement which reflect the ordinal nature of the ratings (e.g., AC1, generalized kappa).

    What if we have n multiple raters, such that each provides ratings on the same subjects on two occasions? Is it fair to assume that the n x 2 sets of ordinal ratings come from n x 2 independent raters and then apply measures of inter-rater agreement to them? I haven't seen anything in the literature covering explicitely the case of more than 2 raters when it comes to dealing with ordinal ratings, so I wanted to make sure I'm on the right track.

    ReplyDelete
    Replies
    1. Hi Isabella,

      I am sorry for taking long before responding to your inquiry. Here is how I would go about your problem:

      As far as computing intra-rater reliability goes, it is the number of times a subject is rated that must dictate the number of "virtual" raters to consider. That is, if a single rater produces 2 ratings per subject then assume both ratings are from 2 "virtual" raters producing each 1 rating per subject.

      Now, suppose each of 3 raters produces 2 ratings per subject for a total of 6 ratings per subject. In this case, assume that the 6 ratings are from 2 "virtual" raters and 3 subjects. For example,

      Subject RaterA RaterB RaterC
      1 a11 b11 c11
      1 a12 b12 c12

      This dataset should be seen as follows:

      Subject VirtualR1 VirtualR2
      1 a11 a12
      2 b11 b12
      3 c11 c12

      If the number ratings per subject (or number of occasions) increases, then increase the number of virtual raters accordingly.

      Hope this helps.
      Thanks

      Delete
  16. Dear Dr. Kilem,

    Many thanks for your amazing contributions.

    We are working on designing severity classification criteria. In the absence of a gold standard, we employed latent class analysis to classify our cases using the best fit model, then we used two classification systems (ordinal classes matching those generated by latent class analysis) to assess their agreement with the latent classes. Is AC1 the accurate agreement coefficient to be used in this context? And how can we assess whether the differences in agreement coefficients (Classification system 1/LCA agreement versus Classification system 2/LCA agreement) are statistically significant?

    Thank you,

    Mahmoud Slim

    ReplyDelete
  17. Dear Dr. Kilem,

    Many thanks for your amazing contributions.

    We are working on designing severity classification criteria based on an already validated and reliable scale. In the absence of a gold standard, we employed latent class analysis to classify our cases using the best fit model, then we used two different combinations of criteria on the same scale to classify cases severities (ordinal classes matching those generated by latent class analysis) and then we assessed the agreement of each set of criteria with the LCA classification.
    Is AC1 the accurate agreement coefficient to be used in this context? And is there a way to assess whether the differences in agreement coefficients (Classification system 1/LCA agreement versus Classification system 2/LCA agreement) are statistically significant?

    Thank you,

    Mahmoud Slim

    ReplyDelete
  18. Hi, I was using the irrCAC package from CRAN (v1.0), and I noticed that if there is perfect agreement, gwet.ac1.table returns an AC1 estimate of NaN while gwet.ac1.raw returns an AC1 estimate of 1. I believe the latter is correct. It looks like gwet.ac1.raw defines pe in these cases as (1 - 1e-15), whereas gwet.ac1.table does not have that check.
    > library(irrCAC)
    > ratdat<- matrix(rep(2,40),nrow=20,ncol=2)
    > gwet.ac1.table(table(ratdat))$coeff.val
    [1] NaN
    > gwet.ac1.raw(ratdat)$est$coeff.val
    [1] 1

    ReplyDelete
  19. Hi,
    You are right, and thank you for bringing this issue to my attention. I will fix it in the package as soon as possible.

    ReplyDelete
  20. Dr. Gwet,

    I've been going through your 2016 paper on testing the difference in correlated agreement coefficients as well as the associated R code "paired t-test for agreement coefficients.r". I have a situation where I have three novice raters (group 1) and three expert raters (group 2). They are rating the same set of subjects and want to know if the methods discussed in the paper and the r function ttest.ac2 apply to testing if the two groups have the same AC1? To quote the paper: "The proposed methods are general and versatile, and can be used to analyze correlated coefficients between overlapping groups of raters, or between two rounds of ratings produced by the same group of raters on two occasions." This isn't quite the situation I have since there are no overlapping raters.

    Thanks,
    Matt

    ReplyDelete
  21. Hi Matt,
    The methods discussed are quite general and can be applied to 2 groups of raters, whether they overlap or not. Traditional methods were only applicable to non-overlapping group. The problem was with groups that overlap. So, I extended these traditional methods to groups that overlap. But non-overlapping groups can still be used with these methods.

    Thanks

    ReplyDelete
    Replies
    1. Makes sense. Thanks for the clarification. Matt

      Delete
  22. Dr Gwet,

    I am analysing some reliability data using R. I can't work out the correct format to enter category labels (Not all of the labels are used in some ratings).

    I want to specify the (ordinal) labels 1,2,3

    This does not work: gwet.ac1.raw(Data, weights = "ordinal", categ.labels = 1,2,3).

    Could you advise on the correct format? I've tried various different ways of formatting categ.labels

    Thank you

    Julie

    ReplyDelete
    Replies
    1. Found it! The correct format is categ.labels=c(1,2,3)

      Delete
  23. Coin Casino: Play Real Money Online in CA
    Coin Casino is a licensed and safe way to play online casino games at online gambling sites in Canada and US with 메리트카지노 no download required. 메리트카지노 Learn 인카지노 more.

    ReplyDelete