**r-functions page**of the agreestat website.

All these R functions can handle missing values without problems, and cover several types of agreement coefficients including Gwet AC1/AC2 (2008, 2012), Kappa coefficients of Cohen (1960), Fleiss (1971), Conger (1980), Brennan & Prediger (1981), Krippendorff (1970), and the percent agreement.

**Bibliography:**

**[1]**Brennan, R. L., and Prediger, D. J. (1981). "Coefficient Kappa: some uses, misuses, and alternatives."

*Educational and Psychological Measurement*,

**41**, 687-699.

**[2]**Cohen, J. (1960). "A coefficient of agreement for nominal scales."

*Educational and Psychological*

*Measurement*,

**20**, 37-46.

**[3]**Conger, A. J. (1980), "Integration and Generalization of Kappas for Multiple Raters,"

*Psychological*

*Bulletin*,

**88**, 322-328.

**[4]**Fleiss, J. L. (1971). "Measuring nominal scale agreement among many raters",

*Psychological*

*Bulletin*,

**76**, 378-382

**[5]**Gwet, K. L. (2008). "Computing inter-rater reliability and its variance in the presence of high agreement."

*British Journal of Mathematical and Statistical Psychology*,

**61**, 29-48.

**[6]**Gwet, K.L. (2012).

*Handbook of Inter-Rater Reliability (3rd Ed.)*, Advanced Analytics, LLC, Maryland, USA

**[7]**Krippendorff, K. (1970). "Estimating the reliability, systematic error, and random error of interval data,"

*Educational and Psychological Measurement*,

**30**, 61-70

Dear Kilem,

ReplyDeleteFirst I would like to express my gratitude for preparing R functions for computation of AC1 coefficient. It really saved our project. I would however like to ask you if there is also an R function for computation of AC2 coefficient? I found only the one for AC1 on your web site.

many thanks for help & best, Gregor

Ah, I just discovered weight matrix... tnx anyway :)

ReplyDeleteHello, Kilem!

ReplyDeleteMy name is Gustavo Arruda. I'm from Brazil.

Congratulations for your contributions in the field of statistics.

I would like to know if the modified Kappa (Brennan-Prediger 1981) can be used for a test-retest analysis (intra-rater) for an ordinal variable with three categories - using simple ordinal weight?

Once the modified Kappa was originally developed for nominal variables but can be realized with a larger number of categories (without distinction) in the Agreestat.

Thank you!

Hi Gustavo,

ReplyDeleteAlthough the Brennan-Prediger is most often used for computing inter-rater reliability, there is nothing that would prevent you from using it to compute intra-rater reliability. The raters in this case, would represent ratings of the same subjects on different occasions. Other than that everything else remain the same. If your ratings are ordinal then you certainly should use ordinal weights to account for the partial agreements that some disagreements represent.

Thanks

Kilem

Thank you for your attention!

ReplyDeleteAnd congratulations for your contributions!

Hello Kilem, A question

ReplyDeleteif we have multiple judges to calculate the rating sustentantes, but there may be a judge who called only 7 times and others who scored more than 100, I know it would be expected to have similar amounts and read in one of your articles that it is not possible to interpret same. What would be your recommendation? And we perform the analysis AC1

Dr. Gwet,

ReplyDeleteI have your 3rd & 4th editions. Thank you very much! Also, thank you very much for creating the R code too!

I have a 37 (item) x 16 rater matrix of ratings (1=essential, 2=useful, 3=not necessary) with some missing values. As I understand it, the appropriate measure of interrupter reliability would be the AC2. When I copy the R function from your website (run it) and then try to compute the AC2 I get the following error:

> gwet.ac1.raw(interrater.reliability)

Error in gwet.ac1.raw(interrater.reliability) :

could not find function "identity.weights"

> gwet.ac1.raw(interrater.reliability,weights="unweighted")

Error in gwet.ac1.raw(interrater.reliability, weights = "unweighted") :

could not find function "identity.weights"

I am not sure what I am doing wrong, could you give me some insight as to what I am doing wrong?

Thank you very much!

-Greg

Hi Greg,

ReplyDeleteDownload the file http://www.agreestat.com/software/r/new/weights.gen.r, then load it in R using source("C:\\Your_Directory\\weights.gen.r"). Now you can use any function you want.

Dear Kilem.

ReplyDeleteI've used the functions provided in agree.coeff2.r (http://www.agreestat.com/software/r/new/agree.coeff2.r) to asses the agreement of two different questionnaires (without any gold standard) that classify a person in positive or negative for one condition according to its own score. Each participant is assessed once with each test in the same visit.

With Gwet's AC1 there's a lack of agreement between the questionnaires, which provides a negative coefficient (which I assume is part of the negative biass that you point out in reference 5), but what really surprise me is a p value largely over 1.

This lead me to a dilemma: I don't know if there's some error in the function or how can be a p over one and which is the interpretation of a p over one. Could you shed some light about this issue?

Thanks

Xavier

Here are the numbers.

A+B+ = 68

A+B- = 2

A-B+ = 267

A-B- = 146

Gwet's AC1/AC2 Coefficient

==========================

Percent agreement: 0.4430642 Percent chance agreement: 0.4869604

AC1/AC2 coefficient: -0.08556103 Standard error: 0.04785226

95 % Confidence Interval: ( -0.1795858 , 0.008463783 )

P-value: 1.9256

I have updated the r function agree.coeff2.r. It now produces the correct p-value even when the agreement coefficient is negative.

ReplyDeleteThanks

Hi Kilem

DeleteCan AC1 coefficient be negative, and how should it be interpreted? We have found some negative values, but are not shure how to present them.

Thanks

Kristina

Yes, AC1 can take a negative value. This indicates an absence of agreement among raters beyond chance.

DeleteHi Dr. Gwet,

ReplyDeleteI am using your gwet.ac1.raw function in agree.coeff3.raw.r. When I use a dataset which contains some missing values (NA), the function returns NaN for the AC1 coefficient.

Example:

> testAC

[,1] [,2] [,3]

[1,] NA NA NA

[2,] 1 NA NA

[3,] NA NA NA

[4,] 1 1 1

[5,] 1 1 1

[6,] 1 1 1

[7,] 1 1 1

[8,] 1 2 1

[9,] 1 2 1

[10,] 1 1 1

[11,] NA NA NA

> gwet.ac1.raw(testAC)

Gwet's AC1 Coefficient

======================

Percent agreement: 0.8095238 Percent chance agreement: NaN

AC1 coefficient: NaN Standard error: NaN

95 % Confidence Interval: ( NaN , NaN )

P-value: NaN

Hoping you can help. My understanding is the code should work for missing values. It appears to work fine when there are no missing values.

Thanks,

Dharmesh

Each row must contain at least one non-missing value. Rows with only NA values must first be deleted from the dataset before executing the functions.

ReplyDeleteThanks. That did solve the issue. However, I'm trying the following dataset which is causing issues for AC1 as well as for k-alpha:

ReplyDeleteExample:

> testNoNA

[,1] [,2] [,3]

[1,] 2 NA NA

[2,] 2 2 2

[3,] 2 2 2

[4,] 2 2 2

[5,] 2 2 2

[6,] 2 2 2

[7,] 2 2 2

[8,] 2 2 2

> gwet.ac1.raw(testNoNA)

Gwet's AC1 Coefficient

======================

Percent agreement: 1 Percent chance agreement: NaN

AC1 coefficient: NaN Standard error: NaN

95 % Confidence Interval: ( NaN , NaN )

P-value: NaN

> krippen.alpha.raw(testNoNA)

Error in (agree.mat * (agree.mat.w - 1)) %*% rep(1, q) :

non-conformable arguments

Thanks,

Dharmesh

This particular data set generates an error message mainly because it contains a single category, which is 2. A typical data table would normally show 2 categories or more. When you want to quantify the extent of agreement among raters, it is because 2 raters have the possibility of selecting 2 different categories.

ReplyDeleteI understand very well that despite the availability of 2 categories or more, 2 raters may well decide to assign all subjects into the exact same category (e.g. 2 as in your table). In this case AC1 equals 1 and its variance would be 0. Krippendorff alpha will likely be 0 (which is clearly inaccurate, and this coefficient is know not to work well in such a scenario), its variance will also be 0.

Hi Kilem,

ReplyDeleteI recently bought your book and find it very useful.

It is my understanding that we can compute intra-rater reliability from ordinal ratings produced by a single rater on two occasions for a number of subjects. We can achieve this by treating the ordinal ratings on the two occasions as coming from two independent raters and essentially computing measures of inter-rater agreement which reflect the ordinal nature of the ratings (e.g., AC1, generalized kappa).

What if we have n multiple raters, such that each provides ratings on the same subjects on two occasions? Is it fair to assume that the n x 2 sets of ordinal ratings come from n x 2 independent raters and then apply measures of inter-rater agreement to them? I haven't seen anything in the literature covering explicitely the case of more than 2 raters when it comes to dealing with ordinal ratings, so I wanted to make sure I'm on the right track.

Hi Isabella,

DeleteI am sorry for taking long before responding to your inquiry. Here is how I would go about your problem:

As far as computing intra-rater reliability goes, it is the number of times a subject is rated that must dictate the number of "virtual" raters to consider. That is, if a single rater produces 2 ratings per subject then assume both ratings are from 2 "virtual" raters producing each 1 rating per subject.

Now, suppose each of 3 raters produces 2 ratings per subject for a total of 6 ratings per subject. In this case, assume that the 6 ratings are from 2 "virtual" raters and 3 subjects. For example,

Subject RaterA RaterB RaterC

1 a11 b11 c11

1 a12 b12 c12

This dataset should be seen as follows:

Subject VirtualR1 VirtualR2

1 a11 a12

2 b11 b12

3 c11 c12

If the number ratings per subject (or number of occasions) increases, then increase the number of virtual raters accordingly.

Hope this helps.

Thanks

Dear Dr. Kilem,

ReplyDeleteMany thanks for your amazing contributions.

We are working on designing severity classification criteria. In the absence of a gold standard, we employed latent class analysis to classify our cases using the best fit model, then we used two classification systems (ordinal classes matching those generated by latent class analysis) to assess their agreement with the latent classes. Is AC1 the accurate agreement coefficient to be used in this context? And how can we assess whether the differences in agreement coefficients (Classification system 1/LCA agreement versus Classification system 2/LCA agreement) are statistically significant?

Thank you,

Mahmoud Slim

Dear Dr. Kilem,

ReplyDeleteMany thanks for your amazing contributions.

We are working on designing severity classification criteria based on an already validated and reliable scale. In the absence of a gold standard, we employed latent class analysis to classify our cases using the best fit model, then we used two different combinations of criteria on the same scale to classify cases severities (ordinal classes matching those generated by latent class analysis) and then we assessed the agreement of each set of criteria with the LCA classification.

Is AC1 the accurate agreement coefficient to be used in this context? And is there a way to assess whether the differences in agreement coefficients (Classification system 1/LCA agreement versus Classification system 2/LCA agreement) are statistically significant?

Thank you,

Mahmoud Slim