K. Gwet's Inter-Rater Reliability Blog : February 2014Inter-rater reliability: Cohen kappa, Gwet AC1/AC2, Krippendorff Alpha

I just finished reading the book entitled "Introduction to Many-Facet Rasch Measurement" by Thomas Eckes. In this book, Mr. Thomas Eckes argues that the classical approach to inter-rater reliability that consists of training the raters and measuring their extent of agreement until they reach an acceptable level does not really work. It is because no matter how much training the raters received, they will still not be interchangeable. A residual intrinsic disagreement will remain among the raters, some of them being more stringent than others in their approach to rating.

The solution that Mr. Eckes proposes is to develop statistical models that describe the different facets of the inter-rater reliability experiment, such as the rater facet, the subject facet and possibly other facets. These statistical models will then be used to make some adjustments to the ratings so that the subjects supposed to be humans can get a fair test. This adjustment will supposedly not penalize the subjects who were unlucky enough to be rated by the more severe raters.

I must say I did like this book very much in the way the author describes the different issues associated with an inter-rater reliability experiment. The presentation of these issues by the author is very instructive and is done with considerable clarity. That alone justifies the investment in time and money one can make on this book. However, I have always been somehow skeptical about the use of theoretical statistical models for the purpose of making important practical decisions, especially decisions involving human subjects. As a matter of fact, even if the raters introduce some bias in the ratings, two statisticians will probably not recommend the same statistical models either. Using these models to adjust the ratings may only be adding the statistician bias that could compound with the rater bias to produce an outcome that can hardly be seen as more reliable. The statistical models can always help the researcher gain more insight into a reality with powerful modelling tools, but cannot and should not be seen as an expression of that reality. Nevertheless, this book is remarkably well written, and should certainly be useful to anyone interested in the topic of inter-rater reliability.

Bibliography
[1] Eckes, Thomas. (2011). Introduction to Many-Facet Rasch Measurement. Peter Lang, ISBN: 978-3-631-61350-4.

K. Gwet's Inter-Rater Reliability Blog

Tuesday, February 25, 2014

Inter-rater reliability and Many-Facet Rasch Measurement