Monday, February 22, 2021

Testing the Difference Between 2 Agreement Coefficients for Statistical Significance

 Researchers who use chance-corrected agreement coefficients such as Cohen's Kappa, Gwet's AC1 or AC2, Fleiss' Kappa and many other alternatives in their research, often need to compare two coefficients calculated with 2 different sets of ratings.  A rigorous way to do such a comparison is to evaluate the difference between these 2 coefficients for statistical significance. This issue was extensively discussed in my paper entitled Testing the Difference of Correlated Agreement Coefficients for Statistical Significance. AgreeTest, a cloud-based application can help you perform the techniques discussed in this paper and more.  Do not hesitate to check it out when find time.

The 2 sets of ratings used to compute the agreement coefficients under comparison may be totally independent or many have several aspects in common. Here 2 possible scenarios you may encounter in practice:

  • Both datasets of ratings were produced by 2 independent samples of subjects and 2 independent groups of raters.  In this case, the 2 agreement coefficients associated with these datasets are said to be uncorrelated. Their difference can be tested for statistical significance with an Unpaired t-Test (also implemented in AgreeTest).    
  • Both datasets of ratings were produced either by 2 overlapping samples of subjects or 2 overlapping groups of raters, or both.  In this case, the 2 agreement coefficients associated with these datasets are said to be correlated. Their difference can be tested for statistical significance with a Paired t-Test (also implemented in AgreeTest).
Several researchers have successfully used these statistical techniques in their research.  Here is a small sample of these publications:

Tuesday, February 16, 2021

New peer-reviewed article

Many statistical statistical packages have implemented the wrong variance equation of Fleiss' generalized kappa (Fleiss, 1971). SPSS and the R package "rel" are among these packages. I recently published in "Educational and Psychological Measurement" an article entitled "Large-Sample Variance of Fleiss Generalized Kappa." I show in this article that it is not Fleiss' variance equation that is wrong. Instead, it is the way it has been used that is. Fleiss' variance equation was developed under the assumption of no agreement among raters and for the sole purpose of being used in hypothesis testing. It does not quantify the precision of Fleiss' generalized kappa and cannot be used for constructing confidence intervals either.