Getting back to the basics of Bayes’ Theorem using Python.
Thomas Bayes and Bayesianism
Thomas Bayes was a rather obscure 18th Century English clergyman and it is not even certain when and where he was born, but it was around 1701 and possibly in Hertfordshire just north of London. His only mark on history is the eponymous Bayes’ Theorem but the name Bayesian is now used in many different areas, sometimes with only tenuous links to the original theorem.
This gives the impression that Bayesianism is a huge and complex field covering not just probability but extending in to philosophy, computer science and beyond. In this article I will get back to the basics of the theorem, firstly by applying it to its “standard” example of medical tests, and then writing a simple demonstration of its use in Python.
Bayes’ Theorem is basically a simple formula so let’s start by chalking it up.
You may be familiar with the P(A) notation used in probability theory to denote the probability of a specific outcome or event, A. We use decimals and all probabilities add to 1, so for example if we know 1% of the population has a certain disease then P(ill) = 0.01, P(healthy) = 0.99 and 0.01 + 0.99 = 1.
The | symbol used in the formula extends the notation to indicate the probability of a certain outcome given an existing state, and the | can be read as “given”. If in the above example we assume a test is available for the disease then Bayes’ Theorem allows us to calculate the probability of a person having the disease given a positive test result.
You might assume that the probability of someone having a disease if they test positive is 1, and conversely the probability is 0 if they test negative. Unfortunately no medical test is perfect: some people with the disease will test negative and some people who do not have the disease will test positive. Even with a highly accurate test this can lead to some startlingly inaccurate results, as we will see.
Let’s make up a few fictitious numbers for an equally fictitious disease, just for demonstration purposes. We need to know the population and the percentage which has the disease. We also need a couple of numbers to describe the accuracy of the test: what percentage of people with the disease test positive, and what percentage of people who are healthy test negative. These are the sensitivity and specificity.
Now let’s assume everyone has been tested and we have the following figures:
The sensitivity and specificity rates of 99% look impressive, but as you can see from the previous table the number of healthy people who wrongly tested positive (shown in bold) is exactly the same as the number of ill people who correctly tested positive (again shown in bold). Therefore if a person tests positive there is only a probability of 0.5 that they are actually ill.
Plugging the Numbers into the Formula
Using the process above we established the probability of a person testing positive actually having the disease. However, it was a messy process which can be simplified by using the formula for Bayes’ Theorem.
This is the theorem applied to our sample problem, which as you can see gives us the 0.5 result we are looking for.
The values above the line are straightforward, and come straight from our table of known data. However, the part below the line, P(positive), needs to be calculated from:
P(healthy) * P(positive|healthy) + P(ill) * P(positive|ill)
This gives us the overall probability of testing positive, irrespective of whether the subject is ill or healthy.
Let’s Code It
We can stare at a (virtual) blackboard all day but to fully understand what’s going on it’s a good idea to implement the formula in code. This also gives us the opportunity to change values quickly and easily to see how this affects the outcome.
The code for this project is all in one short file called bayes.py which you can clone or download from the Github repository.
This is the source code in its entirety.
The main Function
Here we just create a few variables for the population and probabilities which are then passed to the two functions which calculate the probability of being ill if testing positive.
The calculate_without_bayes Function
In this function we calculate a few interim values from the specified population and probabilities, and them use them for our ultimate goal of finding the probability of being ill if testing positive.
All the values are then printed which gives an intuitive idea of the process, but this is a bit long-winded so in the next function we’ll do it the “correct” way using Bayes’ formula.
The calculate_with_bayes Function
Firstly we need to calculate a couple more probabilities from those we already know: the probability of being healthy and the probability of testing positive if healthy. After doing this we can go ahead and implement Bayes’ Theorem.
The rest of the function is taken up with printing out the results, including the interim calculations.
Let’s Run It
Now we can go ahead and run the program with this command:
The output is
You might want to experiment with different sensitivities and specificities. The 99% ones I used are actually very high and many real world medical tests are much less accurate, which as you have probably realised means that the chances of a person having a disease if they test positive can be very low.
So does this mean that mass testing or screening of patients even if they have no symptoms is too inaccurate to be worthwhile? This is really a matter of opinion, but if you hear of or have personal experience of misdiagnoses then please bear in mind Thomas Bayes and his theorem.
This article was originally published on codedrome.com