Overview of (Environmental) Epidemiological Principles
Summary
This is a review of the key principles of environmental epidemiology. We start with a definition of epidemiology as a study of the distribution and determinants of health related states and the application of that knowledge to improve health states and prevent illnesses in populations. We state what is meant by description of disease states in populations (prevalence, incidence, and standardised rates), then we discuss what is meant by determinants of diseases (valid association, and causal association), and the means to identify valid associations (ecological studies, case series, cross sectional surveys, case control studies, and cohort studies). Lastly, we describe how can this information be applied to improve health states and prevent illnesses (by use of attributable risk %, and other measures we show that preventive efforts can be prioritised). The focus of this lesson is on environment, and in the context of environmental health and environmental epidemiology, by environment, we mean environmental parameters as risk factors.
Epidemiology and Environmental Epidemiology
Epidemiology is defined as the study of distribution and determinants of health related states and the application of that information to improve health and prevent illnesses in populations.

From here, we derive three connected components:
- Distribution of diseases or health states in populations. — These are measured by prevalence of diseases, incidence rates of diseases, and for comparison of both prevalence and incidence rates, by comparing the rates with each other (directly) or indirectly with a standard reference population (world standard population or SEGI population).
- Determinants of diseases/health states in populations. — This refers, in case of environmental epidemiology or environmental health, to environmental determinants. For our study, we have already defined environment as that component where the environmental parameters are caused or engendered by human beings and likewise the effects are measured or impacted on other humans as well. These determinants when they link the environmental exposure agents to the health effects or outcomes, can be both causal or non-causal. For ascertaining that some determinants are non-causal (that is, they are not linked in a cause and effect manner to the outcomes), we’d still like to ensure that these health outcomes or health effects with respect to the exposure did not occur due to just a play of chance, or they did not occur because of some systematic errors in the measurement of the exposure and/or outcomes, and indeed, they were not related to a “third” mutually associated exposure or entity (that is, they are not ‘confounded’ by any other variable). In addition to these, when we are considering causal associations, we should also be sure that the environmental exposure agent that we are considering actually preceded the health outcome of interest, that, as the supposed “causal variable” increased in intensity, so was there a corresponding increase or alteration in the outcomes (dose response effect), and that these associations were repeatedly observed in a range of different situations.
- Use of Epidemiological Knowledge to improve health related states and prevent illnesses. — Epidemiological knowldege can be invaluable in addressing prevention and health promotional activities, and in the context of environmental health management, we shall focus on the use of attributable risks to set priorities or test the effectiveness of specific health related programmes aimed at improving public health activities.
Measure of the distribution of diseases in Population
Three common measurement units to measure distribution of diseases in populations are prevalence, incidence, and standardised mortality and morbidity ratio. Prevalence refers to the probability of occurrence of diseases in populations and is given by:
Prevalence = ( Total number of cases of a disease/ Total population ) * Base Population
An example will make it clear. Using data from New Zealand Health Survey, let’s see the prevalence of hypertension (high blood pressure), see:

As can be seen, about 16% of New Zealand residents reported with hypertension or high blood pressure that were managed with medication. As you can see, the prevalence of high blood pressure increases as age increases.
From the above figure and given the figures for prevalence of high blood pressure, can you find the total number of people with hypertension in New Zealand as was found in 2012?
Types of Prevalence. — Two types of prevalence are period prevalence and point prevalence. Point prevalence refers to the prevalence of a disease condition that exists at one particular point in time, and period prevalence refers to the prevalence of disease over a period of time. For example, in the above table, the total prevalence of hypertension in New Zealand refers to the period of 2012.
Incidence. — The second measure of disease distribution in population that we shall discuss is incidence. Incidence refers to the total number of new cases of a disease that occurs in a population over a period of time. The population under study should be free of disease at the beginning of the study. For example, let’s say you take 1000 individuals in the beginning of a study and study them for a year and at the end of the year, you found 15 individuals ended up with a new diagnosis of high blood pressure (all these individuals were known to have normal blood pressure in the beginning of the year). The incidence of hypertension is given by:
Incidence = (total number of new cases of a disease / total number of individuals who were susceptible to develop the disease in the beginning of the observation period) * Base population
Let’s illustrate this with the previous example. If you found 15 new cases of hypertension in the first year of observation of 1000 individuals who did not have high blood pressure to start with, then the incidence rate would be 15 per 1000/year. This type of incidence calculation is known as “cumulative incidence” as you are calculating an overall estimate. Another way might be to follow a number of people for a fixed period of time and observe the number of cases of a particular disease that emerge. For example, let’s say we follow 1000 women for 5 years and in those five years we find 50 women are diagnosed with breast cancer. Since 1000 women are followed up for 5 years, this is calculated 5000 women-years. The fact that 50 women had developed breast cancer over five years in this situation would count as 50/5000 women-years or 10 per 1000 women-years of observation. This kind of incidenc calculation is known as “incidence density rate”
So, when do we calculate cumulative incidence and when do we calculate incidence density rate? Nothing hard and fast, but in general incidence densit rate provides an estimate of the risk of human life * time estimate of the force of the disease and you can get data that is prospectively analysed and you can estimate the incidence density rates.
The third type of measure for description of disease conditions is known as standardised rates. Consider the situation: you know the rates of hypertension in NZ for the year 2012, but you are interested to know if there have been substantial increase in the prevalence in high blood pressure in NZ since 2009. So, how do you establish this? The population of course has changed in composition between the two years. Another comparison, will be say for example between another country, say India and New Zealand. You are interested to see if there are substantial differences in the prevalence of hypertension in the two countries from the reports that you compile on the Internet. This is where standardised mortality and morbiity ratios become so useful.
Illustrative Example
Let’s illustrate this with an example. Let’s say we want to find out if over years, the rates of high blood pressure has increased in New Zealand. To investigate this, we shall use a third population (in this case, this population is known as SEGI World Population), see below and using this population, we shall directly (head to head compare) the rates of high blood pressure in 2009 and 2012 in New Zealand. So, let’s get started.
Recipe for direct standardisation
- First, let’s get the Segi World Population (the Google Spreadsheet File here) and get the two other tables lined up. So, now we have the two populations and the Segi populations as follows:

Next, we estimate the numbers of people with high blood pressure in each of the populations (2012 and 2009) for each of the age groups. We do this by multplying the figures for the age groups and the numbers present in the Segi population. So, we get the following figures:

So, what do we get to see? Using a standard population (Segi world population), we see that there are more people with treated hypertension or high blood pressure in 2012 (784) than that were in 2009 (724). The summary measure is expressed in the form of Standardised Morbidity Ratio as
SMR = Total Number of Cases in Population A / Total Number of Cases in Population B
In our case the SMR would be 1.08 (784/724); this indicates that compared with 2009, the prevalence of high blood pressure (treated) have increased in 2012 in New Zealand.
So, these are ways in which you can make sense of the distribution of diseases in populations, and measure how they have increased over time. What does it mean for doing environmental health investigations? For instance, imagine a new source of pollution have been set up at a site and the number of reported cases of a particular disease have gone up as well. How do we know that is indeed the case? We can compare with the regional rates, or the national rates or the expected rates using this approach. We take the age-specific figures (or indeed any other specific rates) and then we calculate the total number of cases and compare them to see if indeed that may be the case.
Internal validity: chance, bias, and confounding
Epidemiology in general and environmental epidemiology in particular go deeper than just describing health states, as you know, and one of the attractive aspects of environmental epidemiology, as we argued previously was to identify hazards and characterise risks. How do we do that? Characterising risks involve at the least finding associations between environmental agents and the specific health effects that result from being exposed to those specific environmental agents. These agents could be anywhere, our general environment, our occupational environment, everywhere around us. These agents could be in the air we breathe, the water we drink, and the soil we trod, the vehicles we drive, the workplace we work in, there are myriad ways in which these affect us. But how do we find out? Before delving in the measures of association, let us briefly review what is meant by valid association or how do we establish a valid cause and effect relationship.
Take for example, you claim that environmental tobacco smoke is associated with lung disease (chronic lung disease). When you claim this association, skeptics would raise three issues with this claim:
- Chance. — That this can just be a matter of chance occurrence, it really has no association
- Bias. — It is possible that you are making this claim because of an erroneous set of observations, you made wrong observations, you are biased
- Confounding. — Or that, it is not really environmental tobacco smoke, but something else, some other thing that are both associated with environmental tobacco smoke and the chronic lung disease that you are talking about. For example, we know that those who are exposed to environmental tobacco smoke are also likely to belong to low socio economic status and those who belong to low socioeconomic status are also likely to suffer increasingly from chronic lung disease than those who are better off in the society, so what you are seeing in the form of relationship between environmental tobacco smoke and lung diseases is actually a manifestation of their being poor or worse off, rather than their exposure to environmental tobacco smoke. So how can you say?
So you see, it is often a matter of going beyond just the play of numbers. Let’s address the first issue: how do we know that this association that we get to see is not one just due to chance alone? As we shall learn, this is trickier than it seems in the first place and learn in the process a few things: the role of type I and type II errors, and p-values and how to make sense of them.
To keep things simple, we start with this two by two table. Note it well:
Before we begin our study

The process of any scientific investigation, or any systematic study proceeds from the planning phase itself. This holds true for all epidemiological studies as well, so we start here. The issue with epidemiological studies would be to test hypotheses. The hypotheses sit beneath “theses”, or initial ideas about the state of the world or our theories of explaining the world as it is. Let’s pick up on our hypothetical study where we’d like to show that environmental tobacco smoke cause chronic lung diseases. We have figure out how we’d like to measure environmental tobacco smoke, let’s say we use urine cotinin to measure that individuals are exposed to environmental tobacco smoke and we have also figured out a way where we can measure the extent of lung disease (more on this on the study designs section); so, now the challenge is to pitch our hypotheses. We have a theory that environmental tobacco smoke causes or leads to chronic lung disease. So, people with chronic lung disease are also more likely than others to have been exposed to higher levels of environmental tobacco smoke. We state this as follows:
H1: We say that those who show higher amount of urine cotinine (and thereby we state that they are exposed) are at higher risk of chronic lung disease
H0: (this is the counter hypothesis that states that there is equivocal association between environmental tobacco smoke and chronic lung disease), that, the levels of urine cotinine or measures of the exposure variable will be similar among those with and without chronic lung disease
In the above statements, H1 refers to “alternative” hypotheses, and H0 refers to the “null hypothesis” (note that it leaves room for equivalence). So, in our studies, we can either reject the null hypothesis statement or on the basis of our findings we will not be able to reject the null hypothesis (note that everything is considered from the point of null hypothesis). We reset the above table thus:

Before we begin our investigation, we state in our findings we may find sufficient evidence that we shall reject the null hypothesis that ETS has no association with chronic lung disease (or that ETS or No ETS exposure, the rates of chronic lung disease would remain pretty much the same). But we can also be mistaken, and there are two way we can commit errors:
- On the basis of our studies, we reject the claim of the null hypothesis while the null hypothesis was actually correct. We’d like to be conservative on these errors and we can set a probability limit on this. That is, we can state that we shall be no more than 5% mistaken on this one, or our probability of error will be 0.05. This is also referred to as alpha error
- Of, we can, on the basis of our study, claim that we did not find evidence sufficient to reject the null hypothesis, but the null hypothesis actually was false (that is ETS exposure would actually increase the risk of chronic lung disease; often the “direction” of the association is important as in our case, but often, it is “just” a matter of association). We can go a little easy on this one, and claim that the probability of error on this one will be about 20% (although you can be tight on the error part here as well). This error is also known as beta error.
- Note that in addition to these two errors, we can also be correct on two counts. First, on the top right quadrant, we can be correct if we reject the null where null indeed is false; this is the “power” of our study, really. It is no surprise that this is referred to as power of the study, and is expresssed in terms of beta error, that is
power = 1 — beta error
What do we do with this information to address the chance factor?
While this little bit is important to understand, there are a couple of things we do with this information:
- We use this information to calculate how much sample we shall need to do our study
Let’s dwell on this for a bit. Let’s say we are interested to conduct a study ourselves to find out the association between exposure to ETS and the risk of lung cancer. By the way, there is already a number of studies conducted on this topic, see
So, can we invoke our above table and figure out how many people shall we take to do the study? This will need some discussion and idea of what study we are going to do (more on this on the study design section), and here, as we are dealing with a rare disease (lung cancer), we start our sampling from people with and without lung cancer and we shall see if there is suficient evidence for us to test that the likelihood of people to be exposed to ETS will be higher for people with lung cancer than for people without lung cancer (this is known as case control study). Here, because the issue that we are interested to know or learn from our study is the what is the extent to which lung cancer sufferers are increasingly exposed to ETS compared with those who do not have lung cancer. That difference is clinically or experientially determined (that is also known as substantive difference). Let’s say we think that people with lung cancer should be twice more likely to be exposed to ETS than those without cancer, and we set our Odds Ratio (a measure of effect size in case control study) to be 2.0. We also need to know, in this particular situation, the extent of exposure to in the general population (that is, those who are not otherwise suffering from lung cancer). These people are known as control people and we need to know this number as well. Quite often this is dense and not possible to be accurately determined, and see for example, studies such as this. Let’s say, we set the urine cotinine levels above 8 ug to be marker of high exposure to ETS and consider that about 40% of our sample who do not suffer from lung cancer should be exposed to this, and so we’d say that if we found in our study, about 80% people with lung cancer were to be exposed to ETS, then that information would satisfy us to state that there was a valid association between ETS and lung cancer that went beyond just a matter of chance. This would help us to set up how many people would we need for our study. For more information on how to estimate sample size for your various epidemiological studies, see the following webpage and start plugging in your numbers (the openepi website)
2. After we have conducted the study
After we have completed our study, and having specified our alpha error levels, we go ahead and collect our data and conduct data analysis. The results of data analysis will provide us with a measure of probability that under the conditions of null, or if the null were to be true, what is the probability of finding what we found? This probability estimate itself is a measure of our p-value that we get to see reported in papers. That p does not have an intrinsic value of significance of its own (in theory it should not matter if p itself is 0.05 or whatever it is set at), but it’s just a probability estimate of our finding the specific set of findings that we’d like to report on.
Let’s illustrate this with an actual study conducted on environmental tobacco smoke and lung cancer. Paolo Boffeta and his colleagues conducted a large case control study on the association between ETS and Lung Cancer in 12 centres in seven European countries for over six years, and published the results in 1998, read the full study here.
http://jnci.oxfordjournals.org/content/90/19/1440.full
A total of 650 patients with lung cancer and 1542 persons with no lung cancer (controls) were studied in this research. Environmental tobacco smoke exposure was determined using a survey questionnaire, and odds ratios were estimated to reflect the association between exposure and outcome. A detailed description of the study is beyond the scope of this tutorial, so only a very brief snapshot is presented in terms of what the findings were for people who were exposed to ETS both from their workplaces and from their spouses, see the figure below
We can look at many different figures here, but for the sake of learning, let’s review the figures for the section on duration of exposure (hours/day * years). We see that comapared with the baseline conditions, as the duration of exposure increased so did increase the risk (1.00 for baseline, then 0.91 for 0–165, and then 1.31 and 1.46 for progressively increasing levels of exposure. Note a couple of things here. First, the effect size does not seem to be too big, and this is partly explained by the large sample size on which they worked. So, larger your sample size, the smaller will be the effect size that you can pick up. Second, note that all ORs straddle 1.00, so from a 95% confidence interval perspective you may state that none of these findings were statisticall significant. Third, note that as the extent of exposure increased, the corresponding effect size also increased. This is important finding as far as dose response effects are concerned. We shall review these concepts later, but for the time being, this is an example of how results are presented in the context of an epidemiological study.
Biases in Epidemiological Studies
Chance, therefore, in the context of an epidemiological study is something to be ruled out by using statistical tests and hypothesis testing, conducting the study after taking adequate sample size with respect to the levels of expsoure and the substantive effects. The next issue is with biases. Biases are systematic errors in observation and data analyses that can lead to erroneous conclusions about the nature of association between an exposure and an outcome. We are concerned with two forms of biases in epidemiological studies: selection bias and response bias, both of which undermine the conclusions drawn from epidemiological studies. Selection bias refers to what happens when an investigator preferentially selects or omits specific groups from a study. For example, take the previous example of the case control study on the association between environmental tobacco smoke and lung cancer. If the lung cancer patients and the non-cancer control patients were selected differently (say the lung cancer patients were selected from among the hospital cancer ward patients and the controls were selected not from another ward of the hospital but from the general population or population who’d not be exposed to environmental tobacco smoke of any sort over a long period of time, say a prison environment), then the investigators would be putting in an adverse selection that would in turn increase the likelihood of finding a positive result. This is erroneous. On the other hand, it could also be that, the measurement of the environnmental tobacco smoke exposure among the cases and controls were done differntly. If the exposure assessment was done using a questionnaire, and having known the purpose of the study, the respondents who were cases (that is who knew that they had cancers, and that they knew that the purpose of this study was to identify the linkage between ETS exposure and lung cancer), it is quite likely that they would over report of their exposure to ETS and this in turn would drive the direction of the effect sizes towards positive but you’d be none the wiser. This type of bias is referred to as the response bias (where the response of the participant is such that invalidates the study results), although the selection was OK.
The topic of bias is broad-based and we have barely touched the most essential two types of biases that we get to encounter in the literature. There are other forms of biases for example, where the errors in measurement are equal extent for all comparison groups (randomly distributed), so that the results tend to be more towards the null; whereas there are situations where the biases are non-random and we do not know in which direction the bias will go. In general, for the bias, it is best that any source of bias is eliminated at the study design phase.
How do we eliminate bias in the study design phase? One way is to be careful about selection of the participants in a trial, and design the questionniare or instrument collection where involvement of the investigator at the data collection phase is minimised to the extent that any subjective interpretation can influence the results. What this means is, rather than using questionnaires to collect data, use something like urine cotinine or other measures to collect data on ETS exposure; some data that are objective than subjective. In addition, the interviewers need to be trained properly so that any subjective assessment during the interview is minimised as much as possible. Third, although not possible in the context of environmental epidemiological studies, but can be done in other contexts: use some sort of masking or “blinding” so that the interviewers or investigators will not know the different comparison group members. All in all, the best way to deal with bias is to eliminate any possible sources of bias at the stage of planning the study rather than at the end of it.
Role of Confounding Variables
The third important consideration for assessment of valid association is the role of confounding variables. This is best explored graphically, see the figure below:

A confounding variable is one that kind of “confuses” any association that we get to see between the exposure variable of interest and the health outcome that we are discussing. Let’s review this in the light of the environmental tobacco smoke and the lung cancer example. What might be some variables that come between ETS and Lung Cancer that are both associated with ETS and Lung Cancer? Bennett et al (1998) conducted a case control study on the association between ETS and lung cancer among non-smoking women; in that study, the authors considered several gene variants and the nature of this study suggests that gender is a confounding variable as well (that men are more likely to suffer from lung cancer anyway and may also be more susceptible to being exposed to ETS than women). Other variables such as jobs that expose people to both ETS and other agents that are known carcinogens, socioeconomic factors, radon exposure need to be considered in this context. Note that in none of these, the factors that are confounders would “result” from being exposed to ETS and come in the causal pathway to cause lung cancer. In most cases, age, gender, and socioeconomic status are quite important variables that confound relationships between specific exposures and health outcomes that need to be considered carefully for analyses.
In general, three strategies are used to control for the effect of confounding variables: restriction, matching, and multivariate methods. Elimination refers to selection of participants in the study so that those variables that are known to confound linkage between exposure and outcomes are removed at the stage of planning the study. For example, in the context of conducting a study on the association between ETS and lung cancer, that would mean that if we suspect that gender would be a potential confounding variable, then restricting the study to only non-smoking women would eliminate the potential confounding that we can expect from gender. However, as can be argued, a downside of this strategy is that, there are less sample size to play with as participants with some characteristics will not be available and that restricts sample size as well and the study can be underpowered and restrictive. Also, the study findings from this type of study cannot be extended to men, for instance. The second strategy used is matching. In the context of case control studies for instance, matching would mean that cases and controls are of similar age and gender. Often same age and gender are taken together and cases and control individuals are paired. If there is any suspicion that socioeconomic status can influence an association between an exposure and an outcome, then the cases and controls can be taken from same families to adjust for the effect of the socioeconomic status; this is in addition to restricting the study to only certain strata of socioeconomic positions. Once more the downside of this strategy is that, overmatching from families and by other means can result in less sample size and other issues with effect estimates. The third strategy is to adjust for the effects of confound at the stage of data analysis. In this strategy, at the time of data analysis, the potential confounding variables are entered in a multivariate data analysis model and the effects are then adjusted for. For example, Carl Bornehag and colleagues investigated the association between pthalates (these are compounds that are used for making PVCs and platics and other household agents) and the risk of asthma in children in a sample of 400 children (198 cases and 202 controls). While the phthalate data were collected from the urine of children, Bornehag et al also collected data on flooring, housing type, and other variables and used them in a multivariate logistic regression model (more on this later) to test the association between these various factors and the risk of asthma in children; see the table below:
As you can see in the table above, Bornehag and colleagues considered several variables in their study and instead of restricting the analysis to only one or more such variables, they decided to collect all these information and then later, in their data analysis model, they decided to throw them together. We shall discuss in our modules on case control studies about logistic regression and why or how the models are set up, for the time being the point of this writing is to illustrate that typically, various variables are included in an analysis phase to control for confounding.
Chance, Bias, and Confounding
In summary, chance, bias, and confounding are three factors that must be considered when any linkage between an exposure and an outcome is considered. Typically, when the sample size and the effect estimates are well thought out, when the different theoretical and practical confounding variables are accounted for and biases are eliminated, you can say that the association reported is a valid association. This also refers to the internal validity of the study. We still need to consider whether this association is a causal association and this is another large topic, so before heading in that direction, let’s focus on defining a few measures of association that are important.
Measures of Association
We are going to review four measures of association (risk difference, attributable risk and population attributable risks, risk ratios, and odds ratio) in this section. These measures of association are applied in different contexts in epidemiological studies.
Risk Difference (Attributable Risk). — This is the difference in the incidence rates between those with an exposure and those without an exposure. In environmental epidemiological studies, we essentially discuss what is often referred to as observational epidemiological studies where the we only get to observe the health effects or outcomes as a result of specific exposures, and not conduct any interventions. So, a study design where risk difference or attributable risk becomes relevant is cohort study (more on the study design section). In cohort study, we get to study the rate of outcomes among individuals who are exposed to a factor of interest and those who were not exposed to the factor of interest and follow them over time. Let’s say Ie represent the incidence rate of occurrence of health effects among the exposed and Io represent the incidence rate among non-exposed. Then risk difference is given by:
Risk Difference (RD) = Ie — Io
Risk difference is also known as attributable risk. Two related terms are attributable risk % (AR%) and Population Attributable Risk % (PAR%). The attributabe risk percent provides an estimate of how much of the disease can be attributed to the exposure and is given by:
AR% = Ie — Io/I_e or (RR — 1)* 100 / RR where RR = relative risk or rate ratio (see below)
The PAR% provides an estimate of the proportion in reduction in the prevalence of a disease in the population if the exposure were to be reduced to the level of being labelled as “non-exposed” or minimized. This formula is given by:
PAR% = [Pe * (RRe — 1)/(1 + Pe * (RRe — 1))] * 100
where PAR% = population attributabe risk percent; Pe = prevalence of exposure in the population; RRe = rate ratio or relative risk of the disease as a result of the exposure
A good page to review these concepts is below
2. The Population Attributable Risk (or Population Attributable Fraction ) indicates the number (or proportion) of…www.med.uottawa.ca
Rate Ratios and Relative Risks. — These brings us to the next concept of relative risk. Relative risk or Rate ratio refers to the ratio of the rate of disease in the exposed versus the non-exposed. Consider a cohort study where exposed and non-exposed individuals are followed over a period of time to study the incidence of diseases or health outcomes in each category. The relative rates of disease occurrences is known as Rate Ratio or Relative Risk, and is given by:
Relative Risk or Rate Ratio (RR) = Ie / Io where Ie = incidence of disease or health condition in the exposed group and Ie = incidence of disease or health effect in the non-exposed group.
Concepts of Odds and Odds Ratios
Prevalence for instance provides us with an idea of the probability of a disease in the population. For instance, we found that the prevalence of hypertension in NZ is about 16%, which leads us to the idea that roughly 16 per 100 people in NZ suffer from high blood pressure and are treated. Which also means that (100–16)/100 = 84% people in NZ do not suffer from treated high blood pressure. So the odds of having high blood pressure versus not having high blood pressure (either with normal or with low BP) in a NZ population is about 16:84 ~ 4 to 21 or ~ 1 to 5. So, we define Odds as
Odds = Probability that an event occurs / Probability that an event does not occur
Odds Ratio will then be simply the ratio of two odds. Let’s review where in Epidemiology in general and environmental epidemiology in particular are we going to see them. Refer to the following table:

The above table shows us a typical scenario of a case control study. As we have been discussing, in a case control study, we sample people on the basis of whether they have or they do not have the disease of interest. Say in the ETS and lung cancer study, if we were to set up a case control study, we’d sample cases from among people who have lung cancer and we’d sample controls from among people who do not have lung cancer (or free from lung cancer). Then, using our measures of ETS, we’d estimate the level of exposure among each group. In this way, we shall find out what is the likelihood among the cases to be exposed to ETS (the measured exposure of our interest) as opposed to controls. From the above table, we see that out of A+C number of cases, A number of people had the exposure. So the probability of being exposed for cases would be: A/A+C; similarly, the probability of not being exposed if that person for cases would be: C/A+C. If we now consider odds of exposure for cases we see that the odds of exposure for cases as: A/C (can you work out why?). I leave it out to you to reason what happens with the controls, but you can see that the odds of exposure for controls will be B/D.
If we put the two Odds together to calculate the Odds Ratio, the Odds Ratio comes out to be:
Odds Ratio = (A/C) / (B/D) or [A * D ]/ [B * C] or a cross product.
Relationship between Risk Ratios and Odds Ratios. — For all intents and purposes, Odds ratios are same as relative risks or rate ratios and they are treated as such (as measures of relative risk estimates). A point to note is that, odds ratios are essentially same as relative risks for diseases that are rare (so assuming a disease is rare, odds ratios and relative risks can be used interchangably). What this means for environmental health conditions is this, you can use the same measures that depend on relative risks (that is, attributable risks and population attributable risks and PAR%s) as if you were using the relative risk estimates.
Here is an example. In the phthalate study (see above), we noted that compared with the lowest quartile of exposure, in the highest quartile of exposure to pthalates, the OR for Asthma in children was 2.93. This would indicate that if we were to bring down the pthalate concentration in the environment to equivalents of the lowest quartile of exposure in this population, we would be able to account for about 65% of the Asthma cases due to pthalate alone (To see how this calculation was done, use ((RR — 1)*100/RR and put RR = 2.93 from the above table and re run the calculations).
Say in a population, about 40% children are exposed to high concentrations of pthalates, and we agree that reduction of pthalate or the relative risk of pthalate overall will be about 2.93 for the highest versus versus lowest elevels of exposure for Asthma. Then it follows that if we were to bring down the pthalate concentration to the lowest quartile, we would prevent pe (0.40) * (RR = 2.93–1) /(1 + 0.40 * (2.93 -1)) or about 43.6% or nearly 44% of the Asthma cases in the community by reducing pthalate alone. However, even as we write these, we need to make sure that we have established or at least figured out that pthalates are _causally_ related to childhood asthma. After ruling out the play of chance, eliminating biases or significant biases from the study designs so that we ensure that we have established a good study, and controlling for confounding variables, we have established that our findings about association between exposure and outcome is internally valid. But have we established a causal linkage? The answer is necessarily “no”, unless we have examined some other sources of information that will give us a closer appreciation as to what do we mean by cause and effect linkage to which we now turn.
How do we Judge Causality from Valid Association
When we use measures of association in the context of understanding how much of a disease can be prevented if a certain amount of exposure were to be removed or reduced to the levels of lowest level possible, or how much of the damage are we incurring with the present level of exposure to a certain environmental agent, we implicitly believe that this particular environmetal agent (call it “E”) is “causing” the health outcome that we are investigating. For example, if we were to state that by reducing the extent of ETS to negligible amount, we would be reducing the prevalence of lung cancer in our population by x amount, we state this under the assumption that ETS and lung cancer association goes beyond that, they are connected in a cause and effect relationship. It is the “nature” of this relationship that is very important to discern. We shall elucidate this nature of association (causal association) by using two related concepts: one, the counterfactual concept: that is, what would happen to the health outcome in the absence of the exposure? And two, using “criteria” or conditions or an approach of a general framework to examine what constitutes a cause and effect relationship. Let’s check out the counterfactual theory first.
Counterfactual theories of Causality & Rothman’s Pie
Attributable Risk % and what does that mean. — Let me tell you the story of Mrs B, our 80 years old neighbour, how she broke her hip and had to be admitted to the city hospital. She was happily walking along a path in the city centre to visit her niece. It snowed a few days ago, and while the city council workers had shovelled and ploughed out some amount of snow from the streets, some still persisted on the path, plus it was icy cold on that day, and there was black ice all over, and the path was slippery. I did not tell you that about thirty years ago, Mrs B, then in her late forties, had suffered from what is referred to as a “mild stroke” and it left her a little wobbly at times. So, here she was, walking along the path, and a kid was coming at her on her scooter on the pavement, and she wanted to skirt the kid, felt a little wobbly and lost her balance, and suddenly slipped on the pavement. At an instant, she knew she broke her hip. Fortunately for her, she was able to find someone to phone the local ambulance and they shifted her to the city hospital and she recovered. Now, here is a question for you:
What do you think caused her hip fracture? (Think about it):
- Was it the black ice and the bad weather? (that immediately preceded and the city council did a sloppy job)
- The kid on the scooter? (if he did not come she would walk as usual, won’t she?)
- Her wobbly gait and the old stroke?
Perhaps all three conspired in some way to “cause” if you will, her hip fracture. This is the point. Quite often in our lives, there are events that precede the one outstanding event or an illness and we get into the loop of thinking that the last thing (“the last straw that broke the camel’s back”) must be it causing the health issue. But quite often, the answers are less clear. Ken Rothman proposed a concept where he ideated that any model of disease can be said to have a list of sufficient and necessary “causes” in them. He conceptualised them in the form of a series of circles with arcs that denote the contribution from different supposed factors. Check them out:
What you see above is also referred to as the component cause model that Ken Rothman, the noted Epidemiologist proposed. Note the three models above. All these models depict “sufficient cause” for a particular health outcome. For instance, think of the hip fracture of our neighbour, Mrs B. Let’s think of several possible factors that can close the circle that led to her hip fracture (we can think of some of them right away: the stroke she suffered, the icy path, the kid on the scooter, and so on). Likewise, for a disease D, we have for the leftmost model, we have these five causes: D, C, B, A, and E. For the middle model, we have these five causes: G, F, B, A and H; for the model to the right, we have these other five causes: I, F, C, A and J. Note that the one common element in each of these sufficient cause models is the causal factor “A”. In such situations, the factor “A” is termed as the “necessary cause”, that is, at least “A” is needed to close the causal model before the sufficient causal model loop is closed.
Note a couple of points from here. The first relates to the size of the pie slices: for instance, how do we determine the relative sizes of the pies? Also, do the sizes of the pies matter? The other point relates to the more fundamental issue of how do we know that this pie is causal? How are each element selected to go into the larger piece of the pie? The answer to the second question is partly rooted in philosophy of science and we shall discuss this in the next section, but before we get there, let’s dwell a bit on the first question: what size of the slices and if the sizes at all matter?
The size of the slices depend on their relative contribution. From our previous discussions, we know that this contribution in turn is dependent on their attributable risk percent variable. That is, let’s say for the Asthma problem in children, we learned from Bennett’s study that phthalates OR was 2.93, and from there, we saw that the attributable risk percent was something like 44%. So, 44% of the childhood asthma was caused by pthalates, but that does not necessarily means that the remaining 56% of the Asthma causes are from other sources. It could be less, it could be more. How so?
There would be some other causes of Asthma that may play a bigger role. Say, dust allergy leading to asthma may be 3.0 in their RRs, and that would mean something like 66% of childhood asthma may be explained by exposure to dust particles. This is because there are overlaps between pthalate in the household and dust particles in the household and each contribute to Asthma to certain extent. Thus removing phthalates entirely may remove certain likelihood of asthma, but then the remaining “causal factors” would take their place. It might also be less than 100% and that means that the remaining percent of the disease can be explained by unknown variables. We now turn to the next issue, how do we even know which factors to go into the causal pie?
Sir Austin Bradford Hill and his criteria
In the 1965 meeting of the Society of Occupational Medicine in London, Professor Austin Bradford Hill, took the stage. His talk was about a critical issue in the science of how we would deduce that X is a cause of Y when it came to diseases and public health issues. He cited the instances of the association between smoking and lung cancer. In that meeting, Hill proposed nine conditions (he termed them viewpoints, but these were later, somewhat arguably, were referred to as criteria, so much so that these are now referred to as “Hill’s criteria”). Hill proposed the following nine conditions to understand the causal linkages between an exposure and an outcome. His illustrative examples were drawn from the then US Surgeon General’s report on Smoking and Health, his own research on the UK doctors’ studies, and his years of experience with the occupational health:
- Strength of Association. —The first among the conditions was that, the stronger the association between the exposure and the health effect, the more likely, it would be that this association would be causal than not. On the other hand, weak associations are more likely to be due to undetected biases. To some extent, a justification of a strong association for causal linkages could be the play of attributable risk percentage that can account for the extent to which the specific exposure would explain the outcomes. But it is nto necessarily true that weak associations are non-causal, as a strong association helps to address the issue or rival hypotheses that the association could be due to unmeasured confounders that could have stronger effects and thus nullify this association.
- Consistency. — By consistency of association, Hill meant that under different circumstances, under different situations, the agent of exposure or in our case environmental agent would have similar or comparable levels of association. If the association would be one where increase in environmental exposure would be associated with increased levels of health outcome (say ETS exposure and lung disease), then in different sitaution, different populations, it would remain roughly same or similar enough to pass the test of consistency.
- Specificity. — By specificity, Hill meant that one exposure would lead to a very specific set of effects, or what we today would write something like one-one relationship. While this is intuitive, it is not necessarily true as one exposure could lead to more than one effect (for example ETS could lead not only to lung cancer, but also other diseases for instance heart diseases); similarly, observing lung cancer would not necessarily mean that people were exposed to ETS only. Having said that, a principle of specificity would be a great way to seal the claim that there would be some way of a cause and effect linkage between the exposure and outcome.
- Temporality. —Tempora stands for “time”. By temporality, Hill indicated that the presumed cause should precede the emergence of health outcomes. In Hill’s own words, it’d be something like putting the horse before the cart as it is the horse that draws the cart, the horse in this case being the exposure and the cart the outcome. Of all the conditions that he outlined in his almost “canonical” exposition, this one is perhaps the most robust as if, E should qualify for a cause of the outcomes O, then it is intuitive that E should precede O in time. This one is indisputable.
- Biological Gradient. —For the environmental health sciences, this is better appreciated as dose response curve, in the sense that Hill put it up. In this concept, as the level of exposure would increase, so would correspondingly the level of the outcomes. Intuitively, this does sound great, and indeed a robust causal framework would uphold this. But the reverse is also sometimes true in the sense, that even if the dose response relationship does not exactly follow a pattern that as the dose increases so does the outcome, or response, we can still talk in terms of causal relationship. As the precise nature of this association can be non-linear, curvilinear, or can have a threshold effect or a ceiling effect yet this association can be causal.
- Plausibility.—This refers to the biological plausibility that qualifies the association. Is there a biological mechanism that we know of that we can use to qualify this association? While intuitive and straightforward for most cases of linking the exposure and disease, the reverse, again is not true for establishment of causation. The history of public health is replete with examples where even though biological mechanism was not established, nevertheless, the association was robust enough that action could be taken. For example, in 1847, when John Snow investigated the London cholera epidemic, nothing was known about the biological mechanism about the role of vibrio cholerae for the causation of the diarrhoeal symptoms and signs that he was investigating but he deduced the role of water in the causation of the disease and was wise enough to identify the specific pump and took action. This is particularly moot in case of environmental epidemiological approaches, where often, a precise biological cause is not known, nevertheless a causal association cannot be ruled out.
- Coherence. —Coherence, in the sense we infer from Hill’s work, refers to the phenomenon that if the exposure and the outcome are linked in a causal way, then they do not conflict with what we know about the nature of such associations. Again, the reverse, or lack of coherence or consistency with what is expected does not necessarily refute a causal linkage (and this certainly not how Hill would have defined it).
- Experimental Evidence. — Experimental evidence points to the fact that in the laboratory, given what we know about the biology of the causation, we should be able to replicate these experiments (perhaps on human tissues isolated, or perhaps on non-human animals) and still see some of the same effects. While the establishment through experimental evidence is a good proof that the exposure must have been a cause of the outcome, lack of such evidence does not necessarily refute the claim of causation. Sometimes, the effects may not be possible to replicate in non-human animals as faithful animal models may not exist. Yet at other times, the milieu that is available in human systems may not be available in the laboratory conditions.
- Analogy. — Finally analogy refers to the phenomenon that there can be some inkling in other situations where an association might have been possible. Take the ETS and lung cancer example. We know now that tobacco smoke itself is responsible in a causal manner for lung cancer. Therefore, it stands to reason that ETS, which is nothing but tobacco smoke in the environment, can be equally responsible for lung cancer in exposed individuals.
These nine criteria are good for setting up frameworks and guidelines to ascertain if an association is one of cause and effect or more likely to be than not. These are more like conditions and guidelines rather than hard and fast criteria to be used in all cases. At least in the context of environmental health and environmental epidemiology, not all of these can be established beyond doubt in most cases. The one that is really important from a causality establishment perspective is the clause on temporality, or that the cause must precede the outcome. This becomes an important consideration for our study design section where we turn next.
Overview of Epidemiological Study Designs
This presents a very quick and rough outline of some basic information and features of different types of epidemiological study designs used in environemental and occupational health studies. Essentially, in organising this information, we move from simpler studies using secondary data to more complex primary data gathering studies. Some of these will be explained in detail in the individual study sections when we build them. We are going to discuss this in the context of environmental and occupational health.
Ecological Studies. — Ecological studies in the context of environmental health essentially test the association between environmental exposure at an aggregate level and the outcomes are measured at an aggregated level as well. For example, Adrian Barnett and colleagues studied the correlation between air pollution in several Australian cities and in Auckland and Christchurch and correlated the air pollutant data that were collected over three years to the total number of hospital admissions due to cardiovascular illnesses. (see below):
What makes this as an ecological study is the fact that the investigators collected data at the level of populations (that is from hospitals about the total number of cases) and from the meteorological departments about the air quality and then correlated the two measurements. A downside of this approach with respect to causal inference is that, as data were not collected on individual levels, one cannot make any inference about one person’s risk of hospitalisation on a “bad air day” if you will. Any inference that you draw from an ecological study for individual cases would be subjected to “ecological fallacy”, that is, inferring for individuals from data collected at the aggregate levels.
Case Series. — These are study designs where individual data that are collected where details of each individual cases are noted and possible exposures are tallied as well. However, as these studies do not have any valid comparison group, these studies are best suited for environmental health related surveillance. Environmental health surveillance and environmental health tracking is a systematic, ongoing, process of data collection, analysis, inerpretation, and dissemination of vital information on data on environmental exposures and health effects. These are vital for environmental health and environmental epidemiological approaches as these help to identify clusters of disease and environmental pollutants and toxins, and thus enable framing of hypotheses. Check out for instance, the environmental health surveillance in Western Australia to learn more about how these processes happen.
Cross sectional surveys. — These provide snapshots of large or small or well defined populations in a specific period of time or a specific point in time. Cross sectional surveys are usually conducted using questionnaires. These studies are done to establish or identify prevalence of a specific health outcome.
Case Control Studies. — In this study design, investigators investigate specific hypotheses about the association between specific exposure and disease outcomes. Investigators start by sampling from individuals with and without the particular disease in question. Participants who have the disease outcome of interest are labelled as “cases”, and those without disease outcomes are labelled as “controls”. The investigators then ascertain the extent of exposure for both groups and compare their likelihood of exposure. Case control studies report their effect estimates using Odds Ratios (see above). The analytical method of choice is usually logistic regression to estimate the Odds Ratio for the specific levels of exposure. Case control studies are great for rare diseases such as cancers. This strategy enables investigators to test more than one exposure for each health outcome of interest.
Retrospective and Prospective Cohort Studies. — Cohort studies are in general study designs where “cohorts” or similar groups of individuals or participants in a study, who are initially free from the health outcomes of interest, are assembled and are followed through in time to study the pattern of emergence of the health states or health outcomes. In prospective study, the time of the emergence of the health outcome is unknown; thus, the individuals who are stratified to either exposure or non-exposure status are selected in the present, for example and then they are followed through in time. In retrospective cohort studies, the cohorts are assembled using a principle known as “historical cohort”. In assembling historical cohorts, we already know the status of exposure and we also know the status of the health outcomes of the individuals within the cohorts collected at some point also in the past (in more recent past than the time of their assignment status), and then the analysis proceeds in the same manner in both types of studies. Retrospective cohort studies are well suited for occupational epidemiological studies where cohorts are assembled on the basis of whether they were exposed ot specific environmental agents within the industry, and then records are examined to study the emergence of their specific health states or health outcomes. The data are then analysed. Analysis of cohort study data are used for estimation of incidence of specific health outcomes. The technique of data analysis for cohort studies usually include proportional hazards model where over time, the emergence of different health outcomes are studied. Within cohort studies, case control studies can be nested as well. These happen when initially from the disease free persons blood samples are collected and specific biomarkers are preserved for estimating exposure status in the form of dosage at that particular time; then, when sufficient number of disease outcomes or health outcomes accrue, these retrospectively collected biomarker samples are utilised as exposure to conduct case control studies. Individuals who are disease free at that stage are assigned the status of controls; those who show signs and symptoms of the disease under study are chosen as cases. These nested case control studies within the context of cohort studies are useful for studying multiple exposures just as regular case control studies. Otherwise, cohort studies are good study designs for studying multiple outcomes as a result of single or limited number of exposures. Cohort stuides are also well suited for studying rare forms of exposure (that is exposure that are not very common or say something that occurs due to exposure to agents in specific industries). On the downside, they are time consuming and quite expensive to conduct. Otherwise, of all the study designs they are the most robust in terms of ascertaining causal inference from epidemiological perspective.
Conclusion
This was a quick introduction to the basic principles of epidemiology. We learned about the key definition of epidemiology as was given by RJ Last in his textbook on epidemiology, and then proceeded from that definition to learn different measures of disease distribution (prevalence, incidence, and standardised rates of disease), and measures of association (odds ratio, relative risks, and attributable risks). Then we learned that a notion of causality in epidemiological studies flow from establishment of valid association (that is ruling out chance, eliminate biases, and controlling for confounding variables), and how causal inference can be established using both counterfactual approaches and using condition based approaches that was discussed by Hill way back in 1965. Then, we learned about a few study designs that are used in epidemiology. While this provides a brief (very brief) snapshot of Epidemiology and some of its applications in the special case of environmental epidemiology this will provide us with a guide to do more interesting work and researches on environmental health. We shall learn statistical data analysis and use them in environmental health next.





