Tuesday, September 6, 2016

A t-test for correlated agreement coefficients and application with the R package

Researchers must often compare two groups of raters with respect to the extent to which they agree on the rating of the same group of raters.  The extent of agreement among raters of the same group can also be measured on two occasions (e.g. before and after a training session), in order to assess the effectiveness of training on improving inter-rater reliability. An agreement coefficient must then be calculated twice. The traditional statistical approach for testing the difference for statistical significance is to divide that difference by its variance before comparing that this ratio (i.e. the test statistic) to the critical value (often 1.96). If the absolute value of the t-statistic exceeds the critical value then one may conclude that the difference is statistically significant. Sometimes the p-value is calculated and used to conclude statistical significance when it falls below 0.05. However, calculating the variance of the difference can sometimes become problematic.

If the two groups of raters (or the same group observed on 2 occasions) must rate the exact same group of raters, then any agreement coefficient used (e.g. Fleiss generalized kappa, Gwet's AC1, Conger's generalized kappa, Brennan-Prediger coefficient, or Krippendorff's alpha)  will produce two correlated coefficients, making the calculation of the variance of the difference very difficult due to the embedded correlation structure.  Gwet (2016) proposed the linearization method to resolve this problem.  This approach consists of using the linear approximation to the agreement coefficient to develop the equivalent of a paired t-test. Users of the R package may use the R functions that I developed to implement the linearization method to testing the difference of two agreement coefficients for statistical significance.

See more details on kudos.

Bibliography:
Gwet, K. L. (2016). Testing the Difference of Correlated Agreement Coefficients for Statistical Significance, Educational and Psychological Measurement, Vol 76(4) 609-637

3 comments:

  1. Hi,
    I am comparing the inter-rater agreement among 5 raters before and after an intervention on the same 40 subjects. Do I compare rater 1 ratings before the intervention to rater 1 ratings after the intervention and so on for each rater? How do I finally combine the 5 comparisons (1 for each rater) to decide if the intervention improves inter-rater agreement? How do I generate a final p value for the 5 comparisons.
    Thanks
    Hythem

    ReplyDelete
  2. Hi,
    No, you don't do the pairwise analysis. You should perform a global comparison. Check the agreetest app using the following link: https://agreestat.net/agreetest/. The use of test datasets will show you how this analysis should be done.

    ReplyDelete
  3. Thanks a lot. This was very helpful.

    ReplyDelete