Measurements in Environmental Epidemiology
Causality in Environmental Epidemiology
Earlier, we learned about cause and effect in Environmental Epidemiology and we learned that in Epidemiology, we can discuss cause and effect in four different ways, that of using the notions of chance, bias, confounding, and criteria such as Hill’s criteria (which are rather viewpoints than criteria), we learned that causes are always multiple, in other words, many causal variables act together and we can conceptually use these variables to interact with each other; finally, we learned to use causal variables using directed acyclic graphs. In order to move from concepts to our models, we now need to collect data and understand how and what data we need to make sense of these concepts. Hence in this chapter, we will learn about measurements in Epidemiology.
A Brief Tour of Julia (side note)
In order to work with measurements in Epidemiology, we will also use a software tool where we can perform some calculations and estimations. We will use the free and open software Julia for this purpose. You can learn about Julia and download a copy of Julia from the following website:
You can install Julia on your own computer. To do so, you first download the Julia downloader by visiting this page:
After you download Julia and bring up Julia, you will see a screen like as follows:
You can see it is an empty screen. We will write codes here to make them come live. For our work, we will use Julia packages. You can install a julia package if you type the following code in the empty box above:
using Pkg; Pkg.add("Pluto")
Write the above code all in a single line. When you do so, Julia will install the package Pluto in your computer locally. Pluto is a package written in Julia which lets you write interactive codes that is intuitive for you to work with. It also lets you write text in a format known as “markdown”. If you are not familiar with markdown, you may want to refer to the following page and learn about it:
Essentially, markdown enables you to write plain text on any editor and then you can convert that plain text to any other format using pandoc or you can convert the markdown to html for view on webpage. In Julia for our purpose Julia does this automatically when you hit “Shift Enter” so you will not need to have any other special software to get these done.
Once Julia has installed Pluto package, you will return to the Julia prompt and you type using Pluto; Pluto.run()
next to the prompt. When you do so, Julia will work and then will produce a webpage on which you can work and it will look like as follows:
Do as follows once you get to see this screen:
- Click on Create a new notebook
- This will open a page like as follows:
Once you get a page as above, you are ready, and we will run our measurements and calculations there.
Back to Epidemiological Measurements
When we talk about epidemiological measurements, we talk about two types of measurements: (1) measurements of health states or disease frequencies, and (2) measurement of cause and effects. These are expressed in the form of rates, and ratios as we will explore. In terms of disease frequencies, we have three main measures: (1) incidence proportion (risk), (2) incidence rate (aka incidence density), and (3) prevalence.
Before we proceed, we need to discuss whether the population you are studying is a closed or an open population. For instance, let’s take a look at the following schematic diagram of a population:
This above figure shows you how a closed population may look in a schematic diagram. We have 25 circles, each circle is equal to 100 people, so you have 2500 people in the block. Now we have defined it as a closed population because we have specified that NO ONE IS ALLOWED to enter this population once we have fixed it and the only way people can LEAVE this population is if they die or if they migrate out of it. So, for a closed population, you will see a plot somewhat like as follows:
A closed population is also referred to as a cohort. What kind of population might be a closed population? For example, all people born in the year 1966 (or any other specified year) is a cohort. This type of cohort is referred to as birth cohorts. In Environmental Epidemiological studies, you will come across references to birth cohorts or cohort studies often with reference to retrospective or prospective cohort studies. We will discuss them in the study design part of this course.
For now, if you want, you can follow the Pluto page set up for this tutorial:
https://arinscodes.neocities.org/epidemiology
Prevalence
Our first measure is the measure of prevalence. The formula for prevalence is given as:
prevalence = (x / N ) * B
where x = number of people with a health condition
N = Total number of people
B = Baseline, usually 10,000
Prevalence is a simple measure of proportion. Imagine a county of 10,000 population and 150 people are suffering from Type 2 Diabetes, you can calculate the prevalence as 150 per 10, 000 population. As simple as that. The way to estimate prevalence is to conduct cross-sectional survey. Take this cross-sectional study of ADHD in Nairobi [1]. This cross-sectional survey was conducted among 240 6–12 year old children who attended a particular hospital in Kenya, and the researchers identified 15 children with symptoms of ADHD (see the following figure):
You can see that the overall prevalence of ADHD in this sample is about 6.3% and the prevalence is higher in boys than in the girls. We are next going to study incidence.
Incidence refers to the number of new cases of a disease (“condition”) over a given period of time among people who were intially free of the disease condition. This introduces to us the concept of person-time (see the following figure)
As the above figure shows, if we study 1000 people for five years, this constitutes 5000 person-years. If, when we started, there were no one with the condition, and we have at some point in time more people with the disease conditions, then at the end of five years, we will have incidence of these people over 5000 person-years rather than 1000 people. This is the concept of cumulative incidence.
Consider this study by Renata et.al (2015) where they studied people without any features of myocarditis (a type of heart disease) but all of them received immunisation against small pox and influenza. The study’s primary objective was to determine the prospective incidence of new onset
cardiac symptoms, clinical and possible subclinical MP in temporal association with immunization. The authors presented their data as follows:
As only one year worth of data were considered, the person time in this case would be 1_390_352 (persons) times 1 years = 1_390_352, and we find that 30 MP cases were identified in the Healthy-2002 cohort. This gives us the cumulative incidence rate of 30/1_390_352 person-year as 2.16 cases per 100_000 person years. The formula is as follows:
incidence = ( number of new cases / (Persons followed up * Time Period) ) * B
Where B = Base population usually fixed at 100_000
These simple measures are excellent for calculations to understand (a) the proportion of people in a single population or (b) force of the disease as they spread in the community. Consider the plot of COVID-19. If you visit the worldometer website, here:
You can see that as of 8th March, the following statistics were shown for the number of new cases and deaths
The words “new” cases and deaths are giveaways that they are presenting data on incidence, although as you can see, here they are representing raw numbers, not incidence rates as such. I will leave the estimation of incidence rates (also referred to as attack rates when it comes to infections to you). But you see the utility of knowing this informaiton as you can see that it shows you a timeline of the spread of infection across the world, and you can see how reported cases of covid19 or sars-cov-2 virus infection continues to vary over time.
Age-specific and age-standardised incidence
Two other issues. When we compare the spread of incidence and prevalence, it is common to compare them in the same population across various age groups. This comparison is referred to as age-specific incidence rates (or age-specific prevalence). Here is an example of an age-specific cancer rates in the United States obtained from SEER*Stat, see the site first:
And here’s the table of data
And here you can see the corresponding barplot of the rates per age group:
So, as you can see, expectedly, cancer rates are very high in the older age groups. But this does not provide us the full picture. This is an example of age-specific rates. However, if we need to dive deeper, we need to make these rates comparable with the other countries, we need to standardise these rates. This is done by using a proces referred to as standardisation. In order to standardise, we will use a standard population and then pro-rate the age-specific rates to that population. Here is a standard population of the world
This is how we get to the standardised rates of all cancer from the age-specific rates
- Take the age-specific rate
- Then multiply the age specific rate with the population for that specific age
- Add up the numbers to derive the age-standardised rates
Using the above two populations, this is how we are going to derive the age-standardised rate for all cancers for the United States
Let’s do this in Julia. The following are the codes in Julia to get the standardised rates
Once you obtain the standardised rates, you are then able to compare these rates from other countries for comparison, and indeed over ranges of time period.
So, in summary, we have covered the following four measures of disease occurrence:
- We learned that prevalence is in essence a proportion of people with a condition among all people
- Incidence is the rate at which a disease develops, so it is the rate of new cases over all susceptible people for a time period (hence the need for person-time)
- Age-specific rates are useful for comparing the different age bands
- Age-standardised rates enable us to compare rates either across time periods or between places
Measures of Association in Epidemiology
Measures of association in Epidemiology are essentially different forms of ratios. We will discuss the following:
- Rate Ratios and Relative Risk from prospective studies or longitudinal studies
- Odds and Odds ratios
We can compare two incidences and we get the relative risk or risk ratios. We also get absolute risk or risk differences that tells us by how much more is incidence due to a risk factor or environmental risk factor. Let’s take a look.
Example from a fictitious study
Suppose you’d like to study the association between exposure to non-green urban spaces and heart disease risk (this is a toy example). You have decided to include “prospective” (going forward) data from 1000 people who have been exposed to green rural living and equally 1000 people who have lived in cities and “non-green” urban city spaces; after several years of study, you find the following (in the beginning of the study, none of the participants had reported any heart related illness):
- Incidence of heart disease among the non-green group: 20
- Incidence of heart disease among the green group: 10
Further, imagine that on an average, 60% people were to live in cities and not exposed to green living. What sense do we make of it?
The first point that we’d like to discuss is whether city living is associated with heart disease. There are two ways to think of it: does city living causes heart disease? Let’s find out the relative risk, this is given by the following formula:
relative risk = Incidence among the exposed / Incidence among the non-exposed
In this case, our relative risk is incidence of heart disease among the city dwellers / incidence of heart disease among the green-spaces dweller and we find that RR = Ie/I0 to be 2 (20/10), and we may say that city dwelling is associated with twice the risk of heart disease than non-urban green living. This is good for causal association but not so much for risk difference. In risk difference, we ask, what is the excess risk of heart disease among the city dwellers compared with those in the green living exposure. That risk difference in our case is 20–10 = 10 per thousand people. Not a huge difference, but then where it becomes interesting is when we start comparing what is referred to as Attributable Risk. Attributable risk tells us how much is heart disease risk attributed to city dwelling as opposed ot green living, and is given by:
Attributable risk = (Ie - I0 )/ Ie
Alternatively,
Attributable Risk = (RR - 1) / RR
In our toy case, we may say as RR = 2.0, therefore attributable risk is 50% or 50% in the excess risk is attributed to city living in this case. But where it really becomes interesting is when we consider the public health importance of it. This is given in the form of population attributable fraction and is given by the following formula:
PAF = pe * (RR - 1) / (1 + pe * (RR - 1) )
This is where it takes into account the prevalence of exposure. In other words, it attempts to understand that (1) if we were able to show that there was a cause and effect association between the exposure and the outcome we are studying, and (2) if we were also able to measure the prevalence of exposure in the community, then, we would be able to argue the extent to which we would be able to reduce the health outcome resulting from the exposure if we were able to remove the exposure in the first place. What would that mean in our situation?
At the outset, we stated that in our case about 60% people lived in the cities, or at least in non-green spaces. If all of them were moved over to green living, by how much might we be able to predict that heart diseases would decline by this fact alone (mind you, that we’d still need to be able to show a cause and effect association)? Using the above formula we could argue that would be 0.70 /(1 + 0.70) = 0.41 or roughly we would be able to bring down the rate of heart disease by 40% if we were able to move people from a city living to a more green living. This may seem significant, but this is why population attributable fraction is interesting.
Odds and Odds Ratio
We will round it off with a discussion on Odds and odds ratio. Let’s say we are studying that second hand smoking is a risk factor for heart disease and we have decided to study 100 patients of heart disease and 100 other healthy people who did not have heart disease to report, and we took their histories about their being exposed to second hand smoking. Second hand smoking or passive smoking or environmental tobacco smoke is when your partner or people in your house smoke cigarettes or tobacco and you are exposed to the smoke. After our completion of the study, we were able to construct the following table:
Again, as before, a toy example, a fictitious example I made up. Essentially, what we have here are as follows:
- Among the 100 people who had heart disease in our sample, 60 said that they were exposed to second hand smoking (“SHS”). This tells us that there was a 60% probability of their being exposed to SHS; this would also mean that there was a 40% probability that they would not be exposed to SHS. This in tells us that their odds (or likelihood) of being exposed to as opposed to being not exposed to SHS was 60 against 40 or 60/40. This is their Odds of Exposure if they had the disease.
- Among those who did not have the disease, the odds were 20 against 80, in other words, if they did not have the disease, the Odds of Exposure was 20/80
- If we now want to find the ratio of their ODDS of EXPOSURE with respect to those who had the disease and those who did not have the disease, this constitutes an Odds Ratio. In this case, the odds ratio is found by dividing 60/40 with 20/80 ( 60/40 / (20/80)) = 6.0 a rather large Odds Ratio in this case.
Once you set up a two by two table such as the above, Odds Ratio becomes a matter of finding a cross-product as follows:
In this case, the odds of exposure for the diseased group or those with the condition (mind you, it can be any health condition), is given as A/C and the odds of exposure for the “no disease” or no health condition group turns out to be B/D, so the Odds Ratio now becomes A * D / B * C or the cross-product ratio
Conclusion
We will visit these issues again when we discuss prospective cohort studies, retrospective cohort studies and case control studies and study design section of this series. But for now, we have covered the issues around relative risk and odds ratio and we have learned that we can use these measures to derive public health significance.
— — —
[1] Wamithi, S., Ochieng, R., Njenga, F.G., Akech, S.O., & Macharia, W.M. (2015). Cross-sectional survey on prevalence of attention deficit hyperactivity disorder symptoms at a tertiary care health facility in Nairobi. Child and Adolescent Psychiatry and Mental Health, 9.
[2] Engler, R., Nelson, M.R., Collins Jr., L.C., Spooner, C.E., Hemann, B.A., Gibbs, B.T., Atwood, J.E., Howard, R.S., Chang, A.S., Cruser, D.L., Gates, D.G., Vernalis, M., Lengkeek, M.S., McClenathan, B.M., Jaffe, A.S., Cooper, L.T., Black, S., Carlson, C., Wilson, C.B., & Davis, R.L. (2015). A Prospective Study of the Incidence of Myocarditis/Pericarditis and New Onset Cardiac Symptoms following Smallpox and Influenza Vaccination. PLoS ONE, 10.