Environmental and Occupational Epidemiology, Part I

Arindam Basu
Environment, Epidemiology, Climate
44 min readJul 14, 2015

--

Summary
Here we review of the principles of environmental epidemiology. We begin with a definition of epidemiology that is, epidemiology is the study of the distribution and determinants of health related states and the application of that knowledge to improve health states and prevent illnesses in populations. Then we state how to measure disease states in populations (we will cover prevalence, incidence, and standardised rates). Once we have an idea about measurement of the disease states, we then we discuss what determinants of diseases (valid association, and causal association) with an emphasis that in our case, the determinants are in our physical environment and in our workplaces. We then describe means to measure such associations.

Definition of Environmental and Occupational Epidemiology

Epidemiology is defined as the study of distribution and determinants of health related states and the application of that information to improve health and prevent illnesses in populations.

This shows an excerpt from the entry of the term “Epidemiology” in the Dictionary of Epidemiology. The citation for the dictionary of Epidemiology is here:

Let’s talk about how we can measure the three components:

  • Distribution of diseases or health states in populations. — Three measures of disease distribution are rates, ratios, and proportions. Incidence is a rate used to measure disease or force of disease in any population over time. Prevalence is the proportion of people with existing diseases, and standardised ratios (standardised mortality or morbidity ratios) are measures of prevalence or incidence for different populations that differ in some characteristics for example age, or gender distribution. The aim of using standardised ratios is to use a standard population and estimate the measurements so that they can be compared.
  • Determinants of diseases/health states in populations. — In Environmental or Occupational Health, we will use entities that are present in our physical environment or occupational (job) setting as “determinants” or those that will causally be associated with diseases. We measure determinants or associations of exposures with diseases using either rate ratio or relative risk estimates for prospective studies such as intervention studies or cohort studies; we use ratio measures to calculate Odds Ratio for case control studies. Using case control studies, we can calculate the odds of exposure for people with the disease (cases) and those who do not have the disease (controls)
  • Use of Epidemiological Knowledge to improve health related states and prevent illnesses. — When we use epidemiological knowledge to address how much gains we can obtain by reducing exposure, we base such estimates on “counterfactual reasoning”. We ask, if we were to remove the exposure altogether, by how much we would eliminate the resultant health effect? We use relative risk reduction (a ratio measure), or attributable risk (difference between rates of diseases among the exposed and non-exposed), and population attributable risk percent, a measure of proportion to assess if the exposure were to be completely removed, to what extent would we accomplish a reduction in the disease as well.

Measure of the distribution of diseases in Population

Three common measurement units to measure distribution of diseases in populations are prevalence, incidence, and standardised mortality and morbidity ratio.

Prevalence

Prevalence refers to the probability of occurrence of diseases in populations and is given by:

Prevalence = ( Total number of cases of a disease/ Total population ) * Base PopulationHere is an R function for estimation of prevalence:
measure_prev <- function(x, N){
prevalence = round((x / N) * 10000, 3)
return(prevalence)
}
# prevalence of diabetes in a survey of 2500 people with 40 people diagosed with diabetes
# x = 40, N = 2500
prevalence_diabetes = measure_prev(40, 2500)
prevalence_diabetes

Guha Mazumder, Smith and others (1998) conducted a study of 7683 individuals over a two year period (1995–1996) in West Bengal state of India to investigate arsenic-associated skin diseases (keratosis and hyperpigmentation). In their sample, they had 4093 females and 3590 males. Figure 1 provides the distribution of population of males and females in their sample.

Age, sex, and arsenic level (ug/L) distribution of a study population (see reference 1 for more information)

They found that in this population, 48 females and 108 males had presented with the typical skin disease named “keratosis” (keratosis is thickening of skin and discolouration due to high concentration of arsenic in the body that leads to skin changes).

This is how arsenic keratosis looks like. These are painful lesions on hands.

Figure 2 provides details of the keratosis data:

Number of keratosis cases by age group and arsenic levels in blood

Based on these data, you can estimate the prevalence of keratosis in the males and females as follows:

Total number of males with keratosis = 108
Total number of males studied = 3590
Base population = 10,000
Formula of Prevalence of keratosis among males = (108/3590) * 10000
Prevalence of keratosis among males = 300.8 per 10, 000 population
(this is roughly about 3 percent)
Similarly, the prevalence of keratosis among females:
Total number of females with keratosis = 48
Total number of females in the study = 4093
Formula of the prevalence = (48/4093) * 10000
Prevalence of keratosis among females = 117.3 per 10, 000 population
Work out the prevalence of keratosis in the entire population.
(Answer: 203 per 10, 000 population)

Types of Prevalence. — Two types of prevalence are period prevalence and point prevalence. Point prevalence refers to the prevalence of a disease condition that exists at one particular point in time, and period prevalence refers to the prevalence of disease over a period of time. For example, in the above table, we estimated the period prevalence of keratosis in people who were exposed to arsenic in drinking water refers to the period between 1995–1996 because these were the two year in which they were studied. You could also state that at the end of 1996, if these were the numbers, then as of 1996 end, the figures we estimated would be the point prevalence. The point you should note here is, prevalence is estimate of the distribution of disease that accounts for BOTH new cases of the disease and those that were already existing at the start of the investigation. When we are interested to study what is the “rate” at which disease is spreading in the population, then we need to identify new cases of disease that occur among people who were free from the disease at the start of a period of observation. This measure is referred to as the incidence of a disease, which we describe next.

Incidence

Incidence refers to the total number of new cases of a disease that occurs in a population that is exposed to an environmental factor but initially free of the disease, over a specified period of time. Note, to establish incidence:

  1. Start with a population that is free of the disease in question
  2. Count the number of “new cases” that occur in that population
  3. That population must be “susceptible”, that is they are exposed to an environmental or other cue that can give rise to the disease under study.

Note that the population under study must be free of disease when we begin to observe them. Let’s say you want to study the rate at which high blood pressure (“hypertension”) is spreading in the population and you decide to study 1000 individuals for one year. You must ensure that these 1000 individuals must not already have hypertension when you start to study them, so if they develop hypertension, they may do so at any point in the year through which you will study them. A combination of the number of disease free individuals and the time period will constitute person-years that will form the denominator of your incidence measurement. As an illustration, if you were to study 1 person for 1 year of time, your denominator would be 1 person-year. Likewise, if you were to study 1 person for 10 years, your denominator would be 10 person-years. This is same as if you were to study 10 persons for 1 year; you would still refer to the denominator as 10 person-years. In our hypothetical situation, if we were to study 1000 people for 1 year, we would use as our denominator 1000 person-years.

Assuming that there were 0 persons with the disease at the beginning of our observation, we will continue to count how many people are diagnosed with the disease we are interested in (in our hypothetical case, that would be how many people were to be diagnosed with hypertension over 1 year). Imagine that after studying them for one year, we found 15 new cases of hypertension. The incidence of hypertension is given by:

Incidence = (total number of new cases of a disease / total number of individuals who were susceptible to develop the disease in the beginning of the observation period * time period through which you study them) * Base population## Corresponding R code# Estimate incidence rate
estimate_incidence <- function(x, N, t , B = 1000){
person_time = N * t
incidence = round((x/N * t)*B, 3)
return(incidence)
}
## Example
# incidence rate of 1000 people followed for 5 years
# to estimate the incidence of 50 total new cases of malaria
# in a population that otherwise free of malaria cases

From the above formula, we get to see that the incidence of hypertension in this population is 15 per 1000 person-year. This concept of person-time is important for incidence where our aim is to study the “pace at which a condition is developing”.

You can study incidence in a number of ways. If you count all new cases in a given population over a specified time, then the measure of incidence is “cumulative incidence” as you are calculating an overall estimate. You could also study populations over any period and estimate the “rate at which the disease would develop in such a population”, and that estimate would be referred to as “incidence density rate”.

Prevalence and incidence are related to each other and you should use incidence and prevalence to build case for your argument about either the extent of a disease in the community or population you are interested to work with (use prevalence), or when you’d like to highlight the rate at which a new condition is developing. In times of outbreaks, you will monitor incidence, not prevalence. Why? Study the diagram below.

Relationship between incidence and prevalence

As the above diagram suggests, prevalence is the “pool” of number of people with a disease in the population. As new cases of disease get added to that pool, the size of the prevalence increases. But in any population, people also move away, or die, or get cured from the disease. In each of these situations (moving away, death, or cure), people get removed from the central box (“prevalence”) and the size of the box will therefore shrink. So, incidence increases the size, and removal from the pool decreases the size of the prevalence box. If the size of the box were to be “steady” (we call that “steady state condition”), the rate at which people would be added to the box would have to match the rate at which people were to be removed from the box. Now as incidence is a matter of “time” through which a disease persists, therefore we say that at steady state condition:

Prevalence = Incidence Rate * Duration of the diseaseWhich is why:
1. For outbreaks, new cases can come at a very rapid pace, and if the disease is also rapidly fatal or does not last very long, people get removed from the pool quickly. As the duration of the disease approaches 0, you can state that prevalence is virtually incidence; while building the case for control of outbreaks, focus on the size of incidence, and aim to reduce the incidence!
2. For chronic diseases on the other hand, (say high blood pressure, or diabetes), once you have the disease, you have it for life. As they last a long time, with increasing incidence, the size of the pool of prevalence keeps increasing! Hence, if your aim is to reduce the pool of cases, draw attention to the size of prevalence, but aim to reduce incidence.

Age adjustment and standardisation

With prevalence and incidence, you will be able to assess the extent of disease distribution in one population (or a given population, or a sample of people), or how fast a disease condition is developing in a group, community, or population. But you will need to compare this with the national average, or another place, or say even make a statement as to whether for the same population over time things have worsened or not. This indicates that you need some kind of a “benchmark” to indicate that the “pooled rate” you have for incidence or prevalence is somehow comparable to other populations or the same population over time. You need this so that you can make valid comparisons. How do you do that?

There are a couple of ways. You can age-adjust the rates and/or you can age-standardise the rates. What is the difference? In age adjustment, you use the same population, but you will use a “weight” of each age band and provide a final pooled estimate using those weights rather than crudely estimate the prevalence on the basis of the number of cases of disease and the population affected. A crude estimate is what you would get if you did not use a weight of each age band, basically all you would do would be to sum up the number of people with the disease (or death), and divide it by the total population. This would provide you with what is referred to as “crude rate”. But what if some disease condition is influenced by age itself? For example, “death from all causes” is more likely to occur in older age groups in some populations (say in developed countries) and younger age groups in other populations? You would be none the wiser. This is why adjustment for age makes sense when you report summary estimates when age is a factor that you can argue affects the disease conditions. If the age-adjusted rates and the crude rates are same, you would know that age did not matter, but till you adjust for the age variable, you would not know. We have mentioned age here as an example, but this will apply for adjusting for all variables that you can argue as confounding variables (explained later in the text).

Age-adjusted prevalence

Here is a spreadsheet that you can use to calculate age-adjusted prevalence. We have adopted the spreadsheet from the female keratosis we presented in Tables 1 and 2.

A spreadsheet to estimate age-adjusted prevalence. If you click on the spreadsheet it will open up in https://docs.zoho.com/sheet/published.do?rid=2ti8y2b5a2b06a32543549b346b595fcaf10e website for editing.

Let’s describe the spreadsheet column by column:

  1. Column A shows the age group starting with less than 9 year olds.
  2. Column B (“count”) shows how many people were in each age group
  3. Column C (“Percent”) shows what is the total percent in each age band. So, the total number of female participants were 4093; out of them, 536 were in age group ≤9 years. Hence the percent was 13.096.
  4. Column D (“Keratosis”) shows how many of them were with keratosis.
  5. Column E (“prevalence”) shows the prevalence of those with keratosis. To derive the keratosis prevalence for this age group, we divided the number in keratosis column with the corresponding number in the “Count” column. Thus for ≤9 year age groups, the prevalence of keratosis is given as percentages as 1/536 = 0.187 percent. This measure of prevalence is referred to as “age-specific prevalence”.
  6. The weighted prevalence column (Column F) is obtained by multiplying “prevalence” with “Percent”. We call this weighted prevalence because the percent is the “weight” for age distribution. When we multiply this weight with the age-specific prevalence, we get the weighted prevalence.
  7. In Row 11, we provide you the calculations for Crude Prevalence and Age-adjusted prevalence. The crude prevalence is obtained by dividing total number of people with keratosis (in our case 48) with the total number of people in the study (in this case 4093) and multiplying with a base (in this case we used the base of 100), so we say that the prevalence of keratosis among females is 1.173%.
  8. In Row 12, we provide you with the Age-adjusted prevalence. We have obtained age-adjusted prevalence by summing up the weighted prevalence and dividing the sum by 100 (to account for the fact that all percents will add up to 100)
Test for you:
Here is a partially filled in spreadsheet for the keratosis values for males. Can you fill in "percent", "prevalence", "Weighted Prevalence", and calculate the Crude Prevalence and Age-adjusted prevalence?
The spreadsheet for you to fill out (if you click on the image it will take you to the webpage, or you can directly go to https://docs.zoho.com/sheet/published.do?rid=2ti8y2b5a2b06a32543549b346b595fcaf10e)

Age-standardisation

Age-adjustment allows us to control for the effect of age from the resultant estimate if we believe that age could influence the distribution of an outcome. For example, here in case of males and females, we thought older ages might be associated with keratosis risks, so we used age-adjustment to control for the effect of age in estimation of the prevalence. In our case the crude and the age-adjusted rates were similar (nearly identical) so we can state that age did not influence keratosis. Children and adults were equally “likely” to be affected with keratosis if they were manifest. But what happens when the age structure or distribution of two populations are different and therefore crude rates are not comparable? In our case, as you see the age distribution for males and females were different. Would this be affecting the rates? How do we know that males and females had different rates? In order to answer this question, we would need a “third population” which would have an agreed distribution of the people across age groups, and we would estimate what the numbers would be for THAT population. Such a population is referred to as “standard population”. The World Health Organisation provides such a population referred to as SEGI world population.

Here’s how it looks like:

The SEGI world population. We will use it to standardise the rates.

You can use this “standard population” to compare different populations, or even study transition of diseases within one population. For example, you can use standardised rates to compare if there were more cases of a disease sometime past compared to now. Let us use the method of standardisation to test if high blood pressure or hypertension has increased in New Zealand in 2012 compared with 2009, when the population structures were different.

Recipe for direct standardisation: step by step

Step 1. Adjust the Segi World Population to reflect the age ranges and counts for the desired age ranges. In this case, we are interested to study high blood pressure for all ages between 15 years and 85 years or above. Once we have the 2009, 2012, and SEGI populations aligned, they look like as follows:

Step 2. Estimate the numbers of people with disease you want to study. In our case, we will estimate people with hypertension in each of the populations (2012 and 2009) for every age group. We start with prevalence estimates we already have from the New Zealand Health Survey data (see columns Prevalence (2012) and Prevalence (2009) in the figure below. So, we get the following data:

Using the Segi Population, we have multiplied the age specific prevalence in each age groups to get the numbers for the corresponding years. Then we have added up the numbers (see the row titled “total”)

Step 3. Add up the numbers and compare. Using a standard population (Segi world population), we see that there are more people with treated hypertension or high blood pressure in 2012 (784) than that were in 2009 (724). The summary measure is expressed as Standardised Morbidity Ratio as

SMR = Total Number of Cases in Population A / Total Number of Cases in Population B
SMR can be standardised morbidity ratio, it can also be standardised mortality ratio when we estimate death rates in two populations

In our case the SMR would be 1.08 (784/724); this indicates that compared with 2009, the prevalence of high blood pressure (treated) have increased in 2012 in New Zealand.

Using this process of standardisation, can you find out the SMR for males and females for arsenic keratoses? The following spreadsheet explains how to calculate the values.

Calculation of SMR for males and females for arsenic caused keratoses (explanations in the text)

Let us walk through this calculation step by step as to how to calculate the standardised morbidity ratio for males versus females for arsenic keratoses:

  1. The first column shows the “Age group”, note that the age group here is shown in jumps of 10 years, so we have less than 9 years, then 10–19 years of age, and so on. The highest age group here is given as 60 years and above
  2. The second column is SEGI population but now multiplied by 100. Note that for each band, we have summed up the SEGI population numbers. So, for instance, for ≤9 years, we have (12 + 10 = 22) times 100, hence 2200
  3. The third column shows prevalence of keratoses among females read off from Table 2 we presented earlier. Here, we have converted the percentage values to their decimal values. Hence for ≤9 year old females, as their prevalence was 0.2%, here we have presented that value as 0.002
  4. The fourth column shows prevalence of keratoses among males read off from Table 2 we presented earlier. Again, as with the females, the percentages are converted to their absolute decimal values.
  5. The fifth column (“ker_count_females”) shows the expected number of keratoses for females for the segi population. We have obtained this by multiplying the female prevalence of keratoses with the corresponding age-band SEGI population.
  6. The sixth column (“ker_count_males”) show the expected number of keratoses for males for the segi population.
  7. In row number 9, we have added the numbers of keratoses for females and males respectively, so you see 130 (for females) and 285.7 (for males)
  8. In row number 10, we have obtained the ratio between 285.7 and 130 to show the male:female ratio of 2.198. This is the standardised morbidity ratio for females and males in this population. The figure 2.198 suggests males are twice as likely to have keratoses than females.

In summary, we saw that we can use prevalence to estimate the extent to which a disease is a problem in the population. We also saw that incidence provides an estimate of the ‘force’ with which a disease is spreading in the population. These are two different estimates meant to record two different phenomenon, but these are related to each other. Age adjustment is a method to adjust or control for the effect of age when we want to estimate prevalence or incidences. Finally, standardisation of rates is a way to adjust for the difference in age structure of two populations. In our examples and discussion so far, we have discussed age as a factor. In fact, you can extend this discussion to other factors — gender effects, effects of socioeconomic status and differences, and other variables that you may think will have an impact on the distribution of diseases in populations.

What does it mean for environmental health investigations? We can compare populations with different distributions and populations that are different with respect to environmental factors (city dwellers versus country dwellers and compare rates of diseases); we can compare our regional rates with national rates for environmental variables and disease conditions.

We now turn our attention to estimation of effects of environmental determinants of diseases. But before we do so, we need to understand what do we mean by “causation”. How do we know that an environmental factor actually “causes” a disease condition. This is more than just play of words. Once we understand this concept, we can then assess the impact of one or more environmental variables on diseases.

How do we know that an environmental factor causes diseases?

Epidemiology in general and environmental epidemiology in particular go deeper than just describing health states, as you know, and one of the attractive aspects of environmental epidemiology, as we argued previously was to identify hazards and characterise risks. How do we do that? Characterising risks involve at the least finding associations between environmental agents and the specific health effects that result from being exposed to those specific environmental agents. Such agents are in our physical environment (air, water, soil, food) or in our occupational environment (in our jobs the agents we are exposed). If we have to minimize hazards to our health, we will need to control the agents. But how do we find out which agent is responsible? Also, we can meaningfully talk about controlling exposure hazards only if we know the extent to which the agents are validly associated with the health effects. What is meant by “valid association?”

We can claim for example, that environmental tobacco smoke, also referred to as “second hand smoking” (someone else is smoking a cigarette and you get to inhale the tobacco smoke because you are close to that person, or perhaps because you live in the same household, or in nearby environment) is associated with chronic respiratory illnesses. The skeptics would say that there is really “no association between second hand smoke inhalation and lung disease” and to prove their point, they will raise three objections against your claim:

  • Chance. — They will say that such association can be a matter of chance occurrence, that these observations can occur in randomness, it really has no meaningful association
  • Bias. —They will argue that you are making this claim because you are systematically made errors such that you will end up showing that people who are exposed to second hand smoke develop more of respiratory diseases although that is not true. You may have made wrong observations wilfully or your participants in the study must have provided intentional wrong observations.
  • Confounding. — The skeptics will also claim that a “third agent” related to both second hand smoke AND lung disease was responsible for the observations you made. If that is true then it is not really environmental tobacco smoke, but something else, some other thing that are both associated with environmental tobacco smoke and the chronic lung disease that you are talking about. For example, it is possible that some genes will predispose one to inhale environmental tobacco smoke and the same set of genes can predispose someone to develop lung disease. Or, we know that those who are exposed to environmental tobacco smoke are also likely to belong to low socio-economic status and those who belong to low socioeconomic status are also likely to suffer increasingly from chronic lung disease than those who are better off in the society, so what you are seeing in the form of relationship between environmental tobacco smoke and lung diseases is actually a function of SES.

How can we address the play of chance?

Think of the following two by two table which we construct before beginning our study:

Before we begin our study

The Null Hypothesis Table

We have a theory that “exposure to second hand smoke causes lung disease”. Based on this theory, we will frame two hypotheses — the null hypothesis, and the alternative hypothesis. The null hypothesis is the hypothesis we will refute using data we will collect in our study or investigation. The null hypothesis will state,

H0: Lung disease is equally likely among those who do and do not inhale second hand smoke

We will examine this null hypothesis in our research. In comparison to null hypothesis, we have an alternative hypothesis. We state this as follows:

H1: We say that those who inhale second hand smoke are at higher risk of lung disease

In the above statements, H1 refers to “alternative” hypothesis, and H0 refers to the “null hypothesis” (note that it leaves room for equivalence). We specify the above table with our specific example thus:

Before we begin our study, we set up our hypotheses

Before we begin our investigation, we state in our findings we may find sufficient evidence that we shall reject the null hypothesis that ETS has no association with chronic lung disease (or that ETS or No ETS exposure, the rates of chronic lung disease would remain pretty much the same). But we can also be mistaken, and there are two way we can commit errors:

  1. On the basis of our studies, we reject the claim of the null hypothesis while the null hypothesis was actually correct. We’d like to be conservative on these errors and we can set a probability limit on this. That is, we can state that we shall be no more than 5% mistaken on this one, or our probability of error will be 0.05. This is also referred to as alpha error
  2. Of, we can, on the basis of our study, claim that we did not find evidence sufficient to reject the null hypothesis, but the null hypothesis actually was false (that is ETS exposure would actually increase the risk of chronic lung disease; often the “direction” of the association is important as in our case, but often, it is “just” a matter of association). We can go a little easy on this one, and claim that the probability of error on this one will be about 20% (although you can be tight on the error part here as well). This error is also known as beta error.
  3. Note that in addition to these two errors, we can also be correct on two counts. First, on the top right quadrant, we can be correct if we reject the null where null indeed is false; this is the “power” of our study, really. It is no surprise that this is referred to as power of the study, and is expresssed in terms of beta error, that is
power = 1 — Type II error

What do we do with this information to address the chance factor?

While this little bit is important to understand, there are a couple of things we do with this information:

  1. We use this information to calculate how many people we need to include in our study

Let’s dwell on this for a bit. Let’s say we are interested to conduct a study ourselves to find out the association between exposure to ETS and the risk of lung cancer. By the way, there is already a number of studies conducted on this topic, see

The screenshot of the search result of sarching Google Scholar on “Environmental Tobacco Smoke” and “Lung Cancer”; click on the image to go to the search results.

So, can we invoke our above table and figure out how many people shall we take to do the study? We will discuss these issues in detail in part 2 when we discuss epidemiological study designs. For more information on how to estimate sample size for your various epidemiological studies, see the following webpage and start plugging in your numbers (the openepi website)

2. After we have conducted the study

Now we go ahead and collect our data. The results of data analysis will provide us with a measure of probability that if the null hypothesis were to be correct, what would be the probability of finding what we found (that is what is the probability of our results given null hypothesis is correct, written as P(Findings | Null)? Here, the “|” symbol is interpreted as “given”. This probability estimate is stated as p-value. We should not use the p-value as a binary estimate of whether the findings are “significant” or “non-significant” but interpret it as a probability estimate of the findings under conditions of the null hypothesis. Hence, if the p-value is high, then we cannot refute the null or we cannot reject the null hypothesis. If on the other hand, the value of p is very low (conventionally less than 0.05), then we state that we can reject the null hypothesis in favour of the alternative hypothesis. Later we will see that a more robust estimate is a 95% confidence interval around the point estimate that we should use to find out the boundaries of the estimate.

Let’s illustrate this with an actual study conducted on environmental tobacco smoke and lung cancer. Paolo Boffeta and his colleagues conducted a large case control study on the association between ETS and Lung Cancer in 12 centres in seven European countries for over six years, and published the results in 1998, read the full study here.

http://jnci.oxfordjournals.org/content/90/19/1440.full

A total of 650 patients with lung cancer and 1542 persons with absence of lung cancer (controls) were studied in this research. Environmental tobacco smoke exposure was determined using a survey questionnaire, and odds ratios were estimated to reflect the association between exposure and outcome. A detailed description of the study is beyond the scope of this tutorial, so only a very brief snapshot is presented in terms of what the findings were for people who were exposed to ETS both from their workplaces and from their spouses, see the figure below

The risk of lung cancer from exposure to environmental tobacco smoke for people in the 12 country study on environmental tobacco smoke and lung cancer by Paolo Boffeta et al (see the reference section for full text or click on the image)

We can look at many different figures here, but for the sake of learning, let’s review the figures for the section on duration of exposure (hours/day * years). We see that comapared with the baseline conditions, as the duration of exposure increased so did increase the risk (1.00 for baseline, then 0.91 for 0–165, and then 1.31 and 1.46 for progressively increasing levels of exposure. Note a couple of things here. First, the effect size does not seem to be too big, and this is partly explained by the large sample size on which they worked. So, larger your sample size, the smaller will be the effect size that you can pick up. Second, note that all ORs straddle 1.00, so from a 95% confidence interval perspective you may state that none of these findings were statisticall significant. Third, note that as the extent of exposure increased, the corresponding effect size also increased. This is important finding as far as dose response effects are concerned. We shall review these concepts later, but for the time being, this is an example of how results are presented in the context of an epidemiological study.

Biases in Epidemiological Studies

Chance, therefore, in the context of an epidemiological study is something to be ruled out by using statistical tests and hypothesis testing, conducting the study after taking adequate sample size with respect to the levels of expsoure and the substantive effects. The next issue is with biases. Biases are systematic errors in observation and data analyses that can lead to erroneous conclusions about the nature of association between an exposure and an outcome. We are concerned with two forms of biases in epidemiological studies: selection bias and response bias, both of which undermine the conclusions drawn from epidemiological studies. Selection bias refers to what happens when an investigator preferentially selects or omits specific groups from a study. For example, take the previous example of the case control study on the association between environmental tobacco smoke and lung cancer. If the lung cancer patients and the non-cancer control patients were selected differently (say the lung cancer patients were selected from among the hospital cancer ward patients and the controls were selected not from another ward of the hospital but from the general population or population who’d not be exposed to environmental tobacco smoke of any sort over a long period of time, say a prison environment), then the investigators would be putting in an adverse selection that would in turn increase the likelihood of finding a positive result. This is erroneous. On the other hand, it could also be that, the measurement of the environnmental tobacco smoke exposure among the cases and controls were done differntly. If the exposure assessment was done using a questionnaire, and having known the purpose of the study, the respondents who were cases (that is who knew that they had cancers, and that they knew that the purpose of this study was to identify the linkage between ETS exposure and lung cancer), it is quite likely that they would over report of their exposure to ETS and this in turn would drive the direction of the effect sizes towards positive but you’d be none the wiser. This type of bias is referred to as the response bias (where the response of the participant is such that invalidates the study results), although the selection was OK.

The topic of bias is broad-based and we have barely touched the most essential two types of biases that we get to encounter in the literature. There are other forms of biases for example, where the errors in measurement are equal extent for all comparison groups (randomly distributed), so that the results tend to be more towards the null; whereas there are situations where the biases are non-random and we do not know in which direction the bias will go. In general, for the bias, it is best that any source of bias is eliminated at the study design phase.

How do we eliminate bias in the study design phase? One way is to be careful about selection of the participants in a trial, and design the questionniare or instrument collection where involvement of the investigator at the data collection phase is minimised to the extent that any subjective interpretation can influence the results. What this means is, rather than using questionnaires to collect data, use something like urine cotinine or other measures to collect data on ETS exposure; some data that are objective than subjective. In addition, the interviewers need to be trained properly so that any subjective assessment during the interview is minimised as much as possible. Third, although not possible in the context of environmental epidemiological studies, but can be done in other contexts: use some sort of masking or “blinding” so that the interviewers or investigators will not know the different comparison group members. All in all, the best way to deal with bias is to eliminate any possible sources of bias at the stage of planning the study rather than at the end of it.

Role of Confounding Variables

The third important consideration for assessment of valid association is the role of confounding variables. This is best explored graphically, see the figure below:

Concept of Confoundng Variables; a confoundng variable is one that sites in between the exposure and outcome and is related to both of them but does not come in the causal pathway

A confounding variable is one that kind of “confuses” any association that we get to see between the exposure variable of interest and the health outcome that we are discussing. Let’s review this in the light of the environmental tobacco smoke and the lung cancer example. What might be some variables that come between ETS and Lung Cancer that are both associated with ETS and Lung Cancer? Bennett et al (1998) conducted a case control study on the association between ETS and lung cancer among non-smoking women; in that study, the authors considered several gene variants and the nature of this study suggests that gender is a confounding variable as well (that men are more likely to suffer from lung cancer anyway and may also be more susceptible to being exposed to ETS than women). Other variables such as jobs that expose people to both ETS and other agents that are known carcinogens, socioeconomic factors, radon exposure need to be considered in this context. Note that in none of these, the factors that are confounders would “result” from being exposed to ETS and come in the causal pathway to cause lung cancer. In most cases, age, gender, and socioeconomic status are quite important variables that confound relationships between specific exposures and health outcomes that need to be considered carefully for analyses.

In general, three strategies are used to control for the effect of confounding variables: restriction, matching, and multivariate methods. Elimination refers to selection of participants in the study so that those variables that are known to confound linkage between exposure and outcomes are removed at the stage of planning the study. For example, in the context of conducting a study on the association between ETS and lung cancer, that would mean that if we suspect that gender would be a potential confounding variable, then restricting the study to only non-smoking women would eliminate the potential confounding that we can expect from gender. However, as can be argued, a downside of this strategy is that, there are less sample size to play with as participants with some characteristics will not be available and that restricts sample size as well and the study can be underpowered and restrictive. Also, the study findings from this type of study cannot be extended to men, for instance. The second strategy used is matching. In the context of case control studies for instance, matching would mean that cases and controls are of similar age and gender. Often same age and gender are taken together and cases and control individuals are paired. If there is any suspicion that socioeconomic status can influence an association between an exposure and an outcome, then the cases and controls can be taken from same families to adjust for the effect of the socioeconomic status; this is in addition to restricting the study to only certain strata of socioeconomic positions. Once more the downside of this strategy is that, overmatching from families and by other means can result in less sample size and other issues with effect estimates. The third strategy is to adjust for the effects of confound at the stage of data analysis. In this strategy, at the time of data analysis, the potential confounding variables are entered in a multivariate data analysis model and the effects are then adjusted for. For example, Carl Bornehag and colleagues investigated the association between pthalates (these are compounds that are used for making PVCs and platics and other household agents) and the risk of asthma in children in a sample of 400 children (198 cases and 202 controls). While the phthalate data were collected from the urine of children, Bornehag et al also collected data on flooring, housing type, and other variables and used them in a multivariate logistic regression model (more on this later) to test the association between these various factors and the risk of asthma in children; see the table below:

A table from Bornehag et al(1998) on the association between pthalates and children’s Asthma; the purpose of this table is to show how they have used multivariae analyses to shwo the relationships. See in the bottom of the table where they have mentioned that they adjusted for sex, age, smoking at home, type of building, construction period, self reported flooding during preceding three years.

As you can see in the table above, Bornehag and colleagues considered several variables in their study and instead of restricting the analysis to only one or more such variables, they decided to collect all these information and then later, in their data analysis model, they decided to throw them together. We shall discuss in our modules on case control studies about logistic regression and why or how the models are set up, for the time being the point of this writing is to illustrate that typically, various variables are included in an analysis phase to control for confounding.

Chance, Bias, and Confounding

In summary, chance, bias, and confounding are three factors that must be considered when any linkage between an exposure and an outcome is considered. Typically, when the sample size and the effect estimates are well thought out, when the different theoretical and practical confounding variables are accounted for and biases are eliminated, you can say that the association reported is a valid association. This also refers to the internal validity of the study. We still need to consider whether this association is a causal association and this is another large topic, so before heading in that direction, let’s focus on defining a few measures of association that are important.

Measures of Association

We are going to review four measures of association (risk difference, attributable risk and population attributable risks, risk ratios, and odds ratio) in this section. These measures of association are applied in different contexts in epidemiological studies.

Risk Difference (Attributable Risk). — This is the difference in the incidence rates between those with an exposure and those without an exposure. In environmental epidemiological studies, we essentially discuss what is often referred to as observational epidemiological studies where the we only get to observe the health effects or outcomes as a result of specific exposures, and not conduct any interventions. So, a study design where risk difference or attributable risk becomes relevant is cohort study (more on the study design section). In cohort study, we get to study the rate of outcomes among individuals who are exposed to a factor of interest and those who were not exposed to the factor of interest and follow them over time. Let’s say Ie represent the incidence rate of occurrence of health effects among the exposed and Io represent the incidence rate among non-exposed. Then risk difference is given by:

Risk Difference (RD) = Ie — Io
Ie = incidence among the exposed
Io = Incidence among the non-exposed
This measure is also referred to as "Attributable Risk"
Relative Risk = Ie / IoIf relative risk is about incidence rates, then Relative Risk is also referred to as Rate Ratio (RR)

Risk difference is also known as attributable risk. Two related terms are attributable risk % (AR%) and Population Attributable Risk % (PAR%). The attributabe risk percent provides an estimate of how much of the disease can be attributed to the exposure and is given by:

AR% = 100 * (Ie — Io)/Ie ## Alternative formula:AR% = (RR — 1)* 100 / RR where RR = relative risk or rate ratio 

The PAR% provides an estimate of the proportion in reduction in the prevalence of a disease in the population if the exposure were to be reduced to the level of being labelled as “non-exposed” or minimized. This formula is given by:

PAF% = [Pe * (RRe — 1)/(1 + Pe * (RRe — 1))] * 100where 
PAF% = population attributabe fraction;
Pe = prevalence of exposure in the population;
RRe = rate ratio or relative risk of the disease as a result of the exposure

A good resource to review these concepts is below

Rate Ratios and Relative Risks. — These brings us to the next concept of relative risk. Relative risk or Rate ratio refers to the ratio of the rate of disease in the exposed versus the non-exposed. Consider a cohort study where exposed and non-exposed individuals are followed over a period of time to study the incidence of diseases or health outcomes in each category. The relative rates of disease occurrences is known as Rate Ratio or Relative Risk, and is given by:

Relative Risk or Rate Ratio (RR) = Ie / Io where 
Ie = incidence of disease or health condition in the exposed group and
Io = incidence of disease or health effect in the non-exposed group.

Concepts of Odds and Odds Ratios

Prevalence for instance provides us with an idea of the probability of a disease in the population. For instance, we found that the prevalence of hypertension in NZ is about 16%, which leads us to the idea that roughly 16 per 100 people in NZ suffer from high blood pressure and are treated. Which also means that (100–16)/100 = 84% people in NZ do not suffer from treated high blood pressure. So the odds of having high blood pressure versus not having high blood pressure (either with normal or with low BP) in a NZ population is about 16:84 ~ 4 to 21 or ~ 1 to 5. So, we define Odds as

Odds = Probability that an event occurs / Probability that an event does not occur

Odds Ratio will then be simply the ratio of two odds. Let’s review where in Epidemiology in general and environmental epidemiology in particular are we going to see them. Refer to the following table:

A typical scenario for a case control study. A = cases who have the exposre, B = controls who have the exposure, C = cases who do not have the exposure, and D = controls who do not have exposure. Cases and controls refer to people with and without the disease condition of interest, and exposed and non-exposed refer to the state of being or not being expsed to the environmental condition of interest

The above table shows us a typical scenario of a case control study. As we have been discussing, in a case control study, we sample people on the basis of whether they have or they do not have the disease of interest. Say in the ETS and lung cancer study, if we were to set up a case control study, we’d sample cases from among people who have lung cancer and we’d sample controls from among people who do not have lung cancer (or free from lung cancer). Then, using our measures of ETS, we’d estimate the level of exposure among each group. In this way, we shall find out what is the likelihood among the cases to be exposed to ETS (the measured exposure of our interest) as opposed to controls. From the above table, we see that out of A+C number of cases, A number of people had the exposure. So the probability of being exposed for cases would be: A/A+C; similarly, the probability of not being exposed if that person for cases would be: C/A+C. If we now consider odds of exposure for cases we see that the odds of exposure for cases as: A/C (can you work out why?). I leave it out to you to reason what happens with the controls, but you can see that the odds of exposure for controls will be B/D.

If we put the two Odds together to calculate the Odds Ratio, the Odds Ratio comes out to be:

Odds Ratio = (A/C) / (B/D) or 
Odds Ratio = [A * D ]/ [B * C]

This is also referred to as a cross product.

Relationship between Risk Ratios and Odds Ratios. — For all intents and purposes, Odds ratios are same as relative risks or rate ratios and they are treated as such (as measures of relative risk estimates). A point to note is that, odds ratios are essentially same as relative risks for diseases that are rare (so assuming a disease is rare, odds ratios and relative risks can be used interchangably). What this means for environmental health conditions is this, you can use the same measures that depend on relative risks (that is, attributable risks and population attributable risks and PAR%s) as if you were using the relative risk estimates.

Here is an example. In the phthalate study (see above), we noted that compared with the lowest quartile of exposure, in the highest quartile of exposure to pthalates, the OR for Asthma in children was 2.93. This would indicate that if we were to bring down the pthalate concentration in the environment to equivalents of the lowest quartile of exposure in this population, we would be able to account for about 65% of the Asthma cases due to pthalate alone (To see how this calculation was done, use ((RR — 1)*100/RR and put RR = 2.93 from the above table and re run the calculations).

Say in a population, about 40% children are exposed to high concentrations of pthalates, and we agree that reduction of pthalate or the relative risk of pthalate overall will be about 2.93 for the highest versus versus lowest elevels of exposure for Asthma. Then it follows that if we were to bring down the pthalate concentration to the lowest quartile, we would prevent pe (0.40) * (RR = 2.93–1) /(1 + 0.40 * (2.93 -1)) or about 43.6% or nearly 44% of the Asthma cases in the community by reducing pthalate alone. However, even as we write these, we need to make sure that we have established or at least figured out that pthalates are _causally_ related to childhood asthma. After ruling out the play of chance, eliminating biases or significant biases from the study designs so that we ensure that we have established a good study, and controlling for confounding variables, we have established that our findings about association between exposure and outcome is internally valid. But have we established a causal linkage? The answer is necessarily “no”, unless we have examined some other sources of information that will give us a closer appreciation as to what do we mean by cause and effect linkage to which we now turn.

How do we Judge Causality from Valid Association

When we use measures of association in the context of understanding how much of a disease can be prevented if a certain amount of exposure were to be removed or reduced to the levels of lowest level possible, or how much of the damage are we incurring with the present level of exposure to a certain environmental agent, we implicitly believe that this particular environmetal agent (call it “E”) is “causing” the health outcome that we are investigating. For example, if we were to state that by reducing the extent of ETS to negligible amount, we would be reducing the prevalence of lung cancer in our population by x amount, we state this under the assumption that ETS and lung cancer association goes beyond that, they are connected in a cause and effect relationship. It is the “nature” of this relationship that is very important to discern. We shall elucidate this nature of association (causal association) by using two related concepts: one, the counterfactual concept: that is, what would happen to the health outcome in the absence of the exposure? And two, using “criteria” or conditions or an approach of a general framework to examine what constitutes a cause and effect relationship. Let’s check out the counterfactual theory first.

Counterfactual theories of Causality & Rothman’s Pie

Attributable Risk % and what does that mean. — Let me tell you the story of Mrs B, our 80 years old neighbour, how she broke her hip and had to be admitted to the city hospital. She was happily walking along a path in the city centre to visit her niece. It snowed a few days ago, and while the city council workers had shovelled and ploughed out some amount of snow from the streets, some still persisted on the path, plus it was icy cold on that day, and there was black ice all over, and the path was slippery. I did not tell you that about thirty years ago, Mrs B, then in her late forties, had suffered from what is referred to as a “mild stroke” and it left her a little wobbly at times. So, here she was, walking along the path, and a kid was coming at her on her scooter on the pavement, and she wanted to skirt the kid, felt a little wobbly and lost her balance, and suddenly slipped on the pavement. At an instant, she knew she broke her hip. Fortunately for her, she was able to find someone to phone the local ambulance and they shifted her to the city hospital and she recovered. Now, here is a question for you:

What do you think caused her hip fracture? (Think about it):

  • Was it the black ice and the bad weather? (that immediately preceded and the city council did a sloppy job)
  • The kid on the scooter? (if he did not come she would walk as usual, won’t she?)
  • Her wobbly gait and the old stroke?

Perhaps all three conspired in some way to “cause” if you will, her hip fracture. This is the point. Quite often in our lives, there are events that precede the one outstanding event or an illness and we get into the loop of thinking that the last thing (“the last straw that broke the camel’s back”) must be it causing the health issue. But quite often, the answers are less clear. Ken Rothman proposed a concept where he ideated that any model of disease can be said to have a list of sufficient and necessary “causes” in them. He conceptualised them in the form of a series of circles with arcs that denote the contribution from different supposed factors. Check them out:

Rothman’s pie. The circles indicate sufficient cause models and the arcs within the circle with letters in them denote the different component causes Click on the image to visit the soure of the image.

What you see above is also referred to as the component cause model that Professor Ken Rothman proposed. Note the three models above. All these models depict “sufficient cause” for a particular health outcome. For instance, think of the hip fracture of our neighbour, Mrs B. Let’s think of several possible factors that can close the circle that led to her hip fracture (we can think of some of them right away: the stroke she suffered, the icy path, the kid on the scooter, and so on). Likewise, for a disease D, we have for the leftmost model, we have these five causes: D, C, B, A, and E. For the middle model, we have these five causes: G, F, B, A and H; for the model to the right, we have these other five causes: I, F, C, A and J. Note that the one common element in each of these sufficient cause models is the causal factor “A”. In such situations, the factor “A” is termed as the “necessary cause”, that is, at least “A” is needed to close the causal model before the sufficient causal model loop is closed.

Note a couple of points from here. The first relates to the size of the pie slices: for instance, how do we determine the relative sizes of the pies? Also, do the sizes of the pies matter? The other point relates to the more fundamental issue of how do we know that this pie is causal? How are each element selected to go into the larger piece of the pie? The answer to the second question is partly rooted in philosophy of science and we shall discuss this in the next section, but before we get there, let’s dwell a bit on the first question: what size of the slices and if the sizes at all matter?

The size of the slices depend on their relative contribution. From our previous discussions, we know that this contribution in turn is dependent on their attributable risk percent variable. That is, let’s say for the Asthma problem in children, we learned from Bennett’s study that phthalates OR was 2.93, and from there, we saw that the attributable risk percent was something like 44%. So, 44% of the childhood asthma was caused by pthalates, but that does not necessarily means that the remaining 56% of the Asthma causes are from other sources. It could be less, it could be more. How so?

There would be some other causes of Asthma that may play a bigger role. Say, dust allergy leading to asthma may be 3.0 in their RRs, and that would mean something like 66% of childhood asthma may be explained by exposure to dust particles. This is because there are overlaps between pthalate in the household and dust particles in the household and each contribute to Asthma to certain extent. Thus removing phthalates entirely may remove certain likelihood of asthma, but then the remaining “causal factors” would take their place. It might also be less than 100% and that means that the remaining percent of the disease can be explained by unknown variables. We now turn to the next issue, how do we even know which factors to go into the causal pie?

Sir Austin Bradford Hill and his criteria

A poster from the original talk by Sir Austin Bradford Hill. Click on the page to visit an interesting webpage for reading the different viewoints on cause and causality in health sciences

In the 1965 meeting of the Society of Occupational Medicine in London, Professor Austin Bradford Hill, took the stage. His talk was about a critical issue in the science of how we would deduce that X is a cause of Y when it came to diseases and public health issues. He cited the instances of the association between smoking and lung cancer. In that meeting, Hill proposed nine conditions (he termed them viewpoints, but these were later, somewhat arguably, were referred to as criteria, so much so that these are now referred to as “Hill’s criteria”). Hill proposed the following nine conditions to understand the causal linkages between an exposure and an outcome. His illustrative examples were drawn from the then US Surgeon General’s report on Smoking and Health, his own research on the UK doctors’ studies, and his years of experience with the occupational health:

  1. Strength of Association. —The first among the conditions was that, the stronger the association between the exposure and the health effect, the more likely, it would be that this association would be causal than not. On the other hand, weak associations are more likely to be due to undetected biases. To some extent, a justification of a strong association for causal linkages could be the play of attributable risk percentage that can account for the extent to which the specific exposure would explain the outcomes. But it is nto necessarily true that weak associations are non-causal, as a strong association helps to address the issue or rival hypotheses that the association could be due to unmeasured confounders that could have stronger effects and thus nullify this association.
  2. Consistency. — By consistency of association, Hill meant that under different circumstances, under different situations, the agent of exposure or in our case environmental agent would have similar or comparable levels of association. If the association would be one where increase in environmental exposure would be associated with increased levels of health outcome (say ETS exposure and lung disease), then in different sitaution, different populations, it would remain roughly same or similar enough to pass the test of consistency.
  3. Specificity. — By specificity, Hill meant that one exposure would lead to a very specific set of effects, or what we today would write something like one-one relationship. While this is intuitive, it is not necessarily true as one exposure could lead to more than one effect (for example ETS could lead not only to lung cancer, but also other diseases for instance heart diseases); similarly, observing lung cancer would not necessarily mean that people were exposed to ETS only. Having said that, a principle of specificity would be a great way to seal the claim that there would be some way of a cause and effect linkage between the exposure and outcome.
  4. Temporality. —Tempora stands for “time”. By temporality, Hill indicated that the presumed cause should precede the emergence of health outcomes. In Hill’s own words, it’d be something like putting the horse before the cart as it is the horse that draws the cart, the horse in this case being the exposure and the cart the outcome. Of all the conditions that he outlined in his almost “canonical” exposition, this one is perhaps the most robust as if, E should qualify for a cause of the outcomes O, then it is intuitive that E should precede O in time.
  5. Biological Gradient. —For the environmental health sciences, this is better appreciated as dose response curve, in the sense that Hill put it up. In this concept, as the level of exposure would increase, so would correspondingly the level of the outcomes. Intuitively, this does sound great, and indeed a robust causal framework would uphold this. But the reverse is also sometimes true in the sense, that even if the dose response relationship does not exactly follow a pattern that as the dose increases so does the outcome, or response, we can still talk in terms of causal relationship. As the precise nature of this association can be non-linear, curvilinear, or can have a threshold effect or a ceiling effect yet this association can be causal.
  6. Plausibility.—This refers to the biological plausibility that qualifies the association. Is there a biological mechanism that we know of that we can use to qualify this association? While intuitive and straightforward for most cases of linking the exposure and disease, the reverse, again is not true for establishment of causation. The history of public health is replete with examples where even though biological mechanism was not established, nevertheless, the association was robust enough that action could be taken. For example, in 1847, when John Snow investigated the London cholera epidemic, nothing was known about the biological mechanism about the role of vibrio cholerae for the causation of the diarrhoeal symptoms and signs that he was investigating but he deduced the role of water in the causation of the disease and was wise enough to identify the specific pump and took action. This is particularly moot in case of environmental epidemiological approaches, where often, a precise biological cause is not known, nevertheless a causal association cannot be ruled out.
  7. Coherence. —Coherence, in the sense we infer from Hill’s work, refers to the phenomenon that if the exposure and the outcome are linked in a causal way, then they do not conflict with what we know about the nature of such associations. Again, the reverse, or lack of coherence or consistency with what is expected does not necessarily refute a causal linkage (and this certainly not how Hill would have defined it).
  8. Experimental Evidence. — Experimental evidence points to the fact that in the laboratory, given what we know about the biology of the causation, we should be able to replicate these experiments (perhaps on human tissues isolated, or perhaps on non-human animals) and still see some of the same effects. While the establishment through experimental evidence is a good proof that the exposure must have been a cause of the outcome, lack of such evidence does not necessarily refute the claim of causation. Sometimes, the effects may not be possible to replicate in non-human animals as faithful animal models may not exist. Yet at other times, the milieu that is available in human systems may not be available in the laboratory conditions.
  9. Analogy. — Finally analogy refers to the phenomenon that there can be some inkling in other situations where an association might have been possible. Take the ETS and lung cancer example. We know now that tobacco smoke itself is responsible in a causal manner for lung cancer. Therefore, it stands to reason that ETS, which is nothing but tobacco smoke in the environment, can be equally responsible for lung cancer in exposed individuals.

These nine criteria are good for setting up frameworks and guidelines to ascertain if an association is one of cause and effect or more likely to be than not. These are more like conditions and guidelines rather than hard and fast criteria to be used in all cases. At least in the context of environmental health and environmental epidemiology, not all of these can be established beyond doubt in most cases. The one that is really important from a causality establishment perspective is the clause on temporality, or that the cause must precede the outcome. This becomes an important consideration for our study design section where we turn next.

--

--

Arindam Basu
Environment, Epidemiology, Climate

Medical Doctor and an Associate Professor of Epidemiology and Environmental Health at the University of Canterbury. Founder of TwinMe,