Performance Measures: Cohen's Kappa statistic

Are you interested in learning more about how to become a data scientist? Then make sure to check out my webinar: what it's like to be a data scientist.

Cohen’s Kappa statistic is a very useful, but under-utilised, metric. Sometimes in machine learning we are faced with a multi-class classification problem. In those cases, measures such as the accuracy, or precision/recall do not provide the complete picture of the performance of our classifier.

In some other cases we might face a problem with imbalanced classes. E.g. we have two classes, say A and B, and A shows up on 5% of the time. Accuracy can be misleading, so we go for measures such as precision and recall. There are ways to combine the two, such as the F-measure, but the F-measure does not have a very good intuitive explanation, other than it being the harmonic mean of precision and recall.

Cohen’s kappa statistic is a very good measure that can handle very well both multi-class and imbalanced class problems.

Cohen’s kappa is defined as:

$cohen's kappa statistic$

where $p o$ is the observed agreement, and $p e$ is the expected agreement. It basically tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.

Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is useless. There is no standardized way to interpret its values. Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement.

Cohen’s kappa is provided by many software packages and libraries such as caret, Weka and scikit-learn. So, next time you face a problem with imbalanced classes or a multi-class classification problem give it a go! In the meantime, if you want to read about another interesting metric, but this time in regression, make sure to check my article about the concordance correlation coefficient.

References

Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159–174

If you are interested to know more about how data science can be used in your business, make sure to check out my book The Decision Maker's Handbook to Data Science. Also, make sure to check out my courses, as well as my webinars: