Kappa Coefficient for Dummies
How to measure the agreement between two raters, A beginner introduction to a widely used statistical technique
Imagine a city where everyone is color blind. And you are an owner of a company which manufactures balls of three different colors i.e. Green, Blue and Red. Since everyone is color blind in your city nobody can segregate the balls into right buckets. So, you decide to hire a person from another city who can identify different colors. But then it strikes you that a single person can say anything and nobody can verify that he did his job right. So, you decide to hire two persons instead of one. Since this is a new kind of job role and in order to give them the right respect you decide to call them Judges. Each judge checks every ball and put them in their right color bucket.
Now as the owner of a company you want to evaluate that how the judges are doing at their job. If both are experts with no biases, they would agree on color of all the balls i.e. they will have zero disagreement.
We know that humans are prone to biases and it’s impossible to find a person with zero bias. So, you are looking for a mechanism to measure the agreement between them, it is natural that for many occasions they would have agreed just by chance.
You would have simply calculated agreement by counting the number of balls both agree by total number of balls but this doesn’t consider the agreement that happened by chance.
Cohen the Savior
With this problem you reached out to a great statistician Jacob Cohen. Cohen came up with a mechanism to calculate a value which represents the level of agreement between judges negating the agreement by chance. He called this value as Kappa 𝜅.
In order to calculate Kappa Cohen introduced two terms.
Before we dive into how the Kappa is calculated, let’s take an example, assume there were 100 balls and both judges agreed on total of 75 balls and they agreed by chance on 25 balls (will soon uncover how to calculate this).
With the above data Kappa 𝜅 can be written as
This calculates to 0.67 as Kappa’s agreement. You can see that balls which are agreed on by chance are removed both from agreed and total number of balls. And that is the whole intuition of Kappa value aka Kappa coefficient.
Now we can define the Kappa more generically in terms of Pᴏʙsᴇʀᴠᴇᴅ and Pʙʏᴄʜᴀɴᴄᴇ.
More formally, Kappa is a robust way to find the degree of agreement between two raters/judges where the task is to put N items in K mutually exclusive categories.
If you are with me till now, rest of the reading will be all easy stuff :) just remember this intuition.
Unweighted or Cohen’s Kappa
Calculating observed probability
Let’s take the above example and dive more into how judges would have labelled the balls in different bucket of colors.
We can represent the decision of both the judges in this table.
Figure 1 outlines the matrix of different combinations of labeling given by judges for example, If we want to know how many balls are there where Judge 1 said “Green” and Judge 2 said “Blue”. From the Figure 1, we can see It will 5 balls (Last row, second column)
Figure 2 is the probability version of Figure 1. Now if we want to calculate observed probability from Figure 1, it can be written as
We can also use probability table i.e. Figure 2 to calculate the observed probability, it will simply be sum of agreed probabilities.
Calculating probability by chance
If we look at the decisions made by individual judges, we can find the probability of predicting them a particular color of ball. For example, let’s take the judge 1 decisions.
Probability of Judge 1 deciding a ball as red would be = 60/100 = 0.6. Similarly, for judge 2 decisions
Probability of Judge 2 deciding a ball as red would be = 50/100 = 0.5. So, what will be the expected probability when judge 1 decides as red and judge 2 decides as red?
It can be written as = 0.6 x 0.5 = 0.3
0.3 represents the probability when judges can agree by chance for ball to be red. We can also calculate the number of balls they would have agreed by chance as = 0.3 x 100 = 30 balls. Same way we can calculate other combinations like:
Probability when judge 1 decides as red and judge 2 decides as green = 0.6 X 0.3 = 0.18
Probability when judge 1 decides as red and judge 2 decides as blue = 0.6 X 0.2 = 0.12
We can represent above probabilities in this table
Now we can calculate the probability to agree by chance as
So, we can calculate the Kappa as
Weighted Kappa
Weighted Kappa is used when the bucket/categories of items are ordinal in nature. Let’s take the above example for understanding this.
Consider your company instead of manufacturing three different color balls, it manufactures balls of three different shades of red namely “Dark red”, “Red”, “Light red”.
Initially agreement between the judges were binary in nature i.e. either they fully agreed or fully disagreed. In case of shades of red, we cannot have binary agreement. For example, let’s consider these two scenarios.
Scene 1: When judge 1 says “Dark red” and judge 2 says “Red”.
Scene 2: When judge 1 says “Dark red” and judge 2 says “Light red”.
In both the scenarios both judges disagree but we can say that disagreement in Scene 2 is more than disagreement of Scene 1. To handle this, we can define different level of agreement, 0 being at fully disagreed and 1 being at fully agreed.
To understand in a more logical way, let’s assume there are five shades of Red: R1, R2, R3, R4, R5. Our task is to assign some value between 0 and 1 for each pair of red shades. In order to define that let’s define a variable called “Distance”, we can define this as difference between their ordinal ranks.
There are different ways by which we can define these weights.
Linearly
In this case weights are linearly spread between 0 and 1. It can be defined as
Quadratic
In this case weights are non-linearly spread between 0 and 1. It can be defined as
In our case maximum possible distance is 4. So, if we apply above formulas for calculating the weights, we will get
Coming back to our use case, we can define the agreement weight matrix as
Calculating weighted Kappa
If we will follow the process described in Unweighted Kappa, we can calculate the probabilities table for “observed” and “bychance”.
For calculating the weighted Kappa, we simply multiply these probabilities with their corresponding weights. Now all the probabilities in the matrix represents some level of agreement.
So, we calculate the weighted observed and bychance probabilities as.
Linear weighted Kappa
After multiplying with the linear weights, we get
Quadratic weighted Kappa
After multiplying with the quadratic weights, we get
Usage
It is possible that two judges are human and computer. In that case it is a robust way to find the agreement between them nullifying their biases. Kappa maximum value theoretically can be 1 when both judges take same decision for all the items. However having a Kappa score > 0.75 is considered very good. Here is a picture taken from Wikipedia which draws a comparison of Kappa vs Accuracy.
Kaggle question: https://www.kaggle.com/c/aptos2019-blindness-detection/
Existing libraries
Scikit provide a function to calculate Kappa score:
I hope this is helpful!!