Life Expectancy Calculation: learn to take ownership over data constructions

Published in

CodeX

19 min readAug 30, 2021

An introduction to data mining for people too shy to start

Introduction

Summer times and vacation gives you more free time and by extension, more opportunities to question all the information you receive during the year. A friend of mine was wondering how the life expectancy was calculated. Because I’m currently studying data mining, I decided to take this topic for an exercise. The calculation seemed to be some kind of average so it should be easy. And then began this journey… I’m now able to answer to my friend but I decided to create this post to help people who are not satisfied with data constructions provided without explanation. I think that questioning the information is good and that we shouldn’t refrain from using raw data to produce our own answers. This document tries to help people to make their first steps into data exploration explaining how to calculate life expectancy from scratch with few computing skills.

About Life Expectancy

Life expectancy is a well-known measure provided by national statistical offices, which shows for each country and for each year the time in years you can expect to live from a specific age. Usually you’ll get the life expectancy at birth for a specific year and trends are separated by gender. A typical trend looks like this:

Typical trend for life expectancy at birth

The first use for this information is to show that life expectancy is growing along with medical and technical improvements, and also that women live longer than men. But how is it calculated ?

Intuition is not always a good idea

My first intuition was that they are tracking deaths for population born at a specific year and that they are taking an average of the age of people when they died: problem solved.

The problem with that approach is that you have to wait for all people born a specific year to die. I’m born in 1975 so I would have to wait until around 2090 to know what my life expectancy is. And here’s the major issue: I also have to wait to be dead to be able to compute it. My naive intuition was correct when we look far in the past, but it seems inefficient to predict how many years I may expect to live from now.

Basic research is often discouraging

Starting with Wikipedia in French (I’m French, and you might have another formula according to your language) I had this formula :

pi being the probability to survive at age i.

If you are not used to mathematics, you’ll probably give up at this stage: this seems too theoretical and we should be able to find a more practical way to compute life expectancy. I went then on the INSEE website which is our national institute for statistics and economics. They produce a lot of data about demography. The data is easy to find and you can visualize and download the results as trends or tables. But how did they proceed to get this data is more complex too find out. They provide a lot of documents about methodology and it’s difficult to find a relation between a data projection and the calculation that created it.

Looking for example: an entry point?

I started a research on life expectancy using ‘how to’, ‘example’, or ‘tutorial’ as key words: it gave me better results. I found a document that explains how they actually calculate life expectancy in the United States.

Here’s the link to the document: https://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_03.pdf

United States Life Table,2008

In the document, I learned new important key words:

Life table
Mortality rate

I also understood the following: when you want to compute life expectancy for a given year you need to focus on the deaths that occurred during that year. Since these are large populations, you have a great number of deaths every year, with people dying at every age from 0 to one hundred and more. You can start to create a table that shows for each year the number of people who died at that age. Life expectancy is projected from the deceased for one year: it doesn’t represent a reality but a statistical estimation. This doesn’t seems to be really intuitive and you can have some doubts how effective it is. In practice, statistical estimation often gives a good idea of what you can expect. Here’s an example: If you roll a dice and you count for 1000 rolls the number of times you got a 6, you’ll end up with a number close to 167 (if your dice is standard). Dividing this number by 1000 will give you a good probability to get a 6 when rolling the dice. The more dices you roll, the more accurate your result will be. Comparing a life table to a dice, the different ages of deaths are the different numbers of the dice (there will be around 120 different numbers on that dice) and the amount of deaths at each age divided by the total number of deaths of a year will represent the probability to die at that age.

Of course this projection cannot predict the future and you have to consider that the conditions won’t change. Tracking that information from year to year helps to understand the evolution of this projection.

A that point, I had a better understanding of the problem but trying to understand the life table, I discovered a key column named qx which is described as:

“Column 2. Probability of dying (qx) — Shows the probability of dying between ages x and x + 1. For example, for males in the age interval 20–21 years, the probability of dying is 0.001225 (Table 2). This column forms the basis of the life table; all subsequent columns are derived from it.”

The definition wasn’t giving any indication of how it was computed. I needed to search deeper…

Digging further using new key words

I used the name of the column as key word when searching the web: something like “how to calculate the qx column of a life table”.

And this is how I came up with a new document giving new clues about how to create a life table.

This is an important lesson: at the beginning of a research you don’t have the right key words: your first questions won’t help you to find the answers but the unlocking key words.

Here’s the link: https://faculty.weber.edu/jcavitt/WildlifeManagementMaterials/Lab/Life%20Table%20Construction.pdf

Life Table Construction

The document has a protocol to build a life table:

You need first a table containing a row for each age (or slice of year) with the number of deaths at that age for a specific year
You can then build a new column containing at each age the people who survived to a specific age: from your table you add a column named surviving. Starting from the latest row of the life table, you set the number of surviving equals to the number of death of that year (next year nobody will be alive). Then you go back in time climbing in your table and adding the number of death of an age to the amount of surviving people of the next age. When you reach the age of 0, your surviving number will be the total number off the deaths for that year
The surviving column represents now a fictive population decreasing at each age until all the population is extinguished
You can now add a qx column to the life table that will calculate the mortality rate for each age dividing the number of deaths by the number of survivors at each age
You can also create a px column that will contains the surviving population at each age divided by the initial population. You’ll have a number that will vary from 1 to 0 representing the proportion of people still living at age x
Using the px column, you can now represent a normalized population creating the column lx(in a life table they often use a population of 100000). Using the number 100000 makes it easier to read the data and ensure that you have the same population sample from year to year so you can easily compare your life table to other life tables

The qx column is the key element from which all the other columns are computed. We need now to compute life expectancy from that column.

Calculating the life expectancy

Note: This paragraph aims to explain the logic behind the calculation. It is probably more confusing than helping. Don’t worry about that: it will become clearer when we use it in a real use case.

The life expectancy (named ex) is calculated from the mortality rate. But if we look at our different sources of knowledge:

Wikipedia (French): ex for an age x is the sum of the products from x to k (k being an age varying from 0 to the oldest age) of the probability to survive (1-qx) at each age(we’ll see how to compute this later)
“National Vital Statistics Reports”: ex is the total of Person-Years lived by people who survived age x divided by the surviving population at age x. ex = Tx/lx
“Life Table Construction”: ex uses the same calculation as “National Vital Statistics Reports” but Tx is computed differently.

These 3 methodologies to compute the life expectancy using the same data source will give 3 different results. This is an important discovery: when you deal data construction, the methodology will always have an impact on the result. And the methodology is often hidden and different constructions representing the same concept can cohexist, adding confusion and misunderstandings.

Wikipedia Calculation

From column qx which is the mortality rate, we can create a new column plx containing the probability to survive to age x with plx = 1- qx. This represents the variable pi in the equation
We can then create a column prod_plx which is the product of plx values from plx(0) to plx(x). By example: prod_plx(0) = plx(0) when prod_plx(20) = plx(0)*plx(1)*plx(2)*…*plx(20). This represents the products of pi from 0 to k in the equation

From prod_plx we can now create the column ex that will compute the sum of each prod_plx from x to the oldest age in the table. By example: ex(0) = prod_plx(0) + prod_plx(1) + prod_plx(2) + … prod_plx(117) (where 117 is the oldest age of the table) when ex(20) = prod_plx(20) + prod_plx(21) + prod_plx(22) + … prod_plx(117). This represents the sum of products of pi from k to infinite

“Life Table Construction” Calculation

From column lx (normalized surviving population) we create the column Tx which is sum of values of lx when age is greater or equals to x. By example Tx(0) = lx(0) + lx(1) + lx(2) +…+lx(117) (where 117 is the oldest age of the table) when Tx(20) = Tx(20) + Tx(21) + Tx(22) + … + Tx(117)
The column ex is created calculating ex = Tx/lx

“National Vital Statistics Reports” Calculation

The calculation of ex required another source of information. They are using a column named Lx which is defined as:

“Column 5. Person-years lived (Lx) — Shows the number of person-years lived by the hypothetical life table cohort within an age interval x to x + 1. Each figure in column 5 represents the total time (in years) lived between two indicated birthdays by all those reaching the earlier birthday. ”

I didn’t understand how the column was computed and I had to have another search using the Lx calculation as a key word.

Here’s the source that provided the answer:

https://www.ssa.gov/OACT/HistEst/CohLifeTables/LifeTableDefinitions.pdf

Definitions of Life Table Functions

The Lx column is representing the total of years lived by the surviving population of an age x. Some people dying at age x will die at age x and 364 days, but other people on the opposite will die at age x and 1 day. To have a better estimation, the scientists considered that half of the people deceased at age x lived less than 6 months and other half lived more than 6 months. They created the Lx column to include in expectancy the people that died at age x who lived more than 6 months after their birthday. The formula is then Lx(x) = lx(x)-( 0.5*dx(x)) where dx is the number of people who died at age x. For the first year another value is computed to represent the average of babies living more than 6 months. Because deceases for babies are mostly occurring the first 6 months, Lx(0) should be lower than the other years which are normally distributed. I don’t know the formula (maybe based on the birth months average) but in the “National Vital Statistics Reports”file the formula is approximately Lx(0) = lx(0)-(0.87*dx(0)).

Once the column Lx is computed, Tx is computed the same way as in “Life Table Construction” but Lx values are summed instead of lx values.
ex is computed as: ex = Tx/lx exactly as in “Life Table Construction”

Are you still there?

At least we know for sure that the calculation is not so difficult but still requires a complex protocol that seems to change regarding culture. It seems that Life expectancy presented from different sources can lead to different results just because of the methodology.

Well, this is interesting but we want to make our own opinion. Let’s use this new knowledge to test it on real data!

Applying the theory: Using raw data to compute our own version of life expectancy

Extracting useful information from raw data

To build a life table that will help us to calculate the life expectancy for a specific year we need to have all the deaths of that specific year. Where can we find the data? In France, the government provides it to the public trough the statistics and demography institutes. Searching for “deceases data France” led to this web site:

Fichier des personnes décédées (Décès) - data.gouv.fr

Les fichiers nominatifs diffusés ici ne sont pas des...

www.data.gouv.fr

You can find there various text files for each year containing all the deceases registered for a year. Each line of a file contains the name, date of birth, date of death and location information about the deceased. All we need from those files is the age of death, which is not provided. We will have to extract the year of death and subtract the date of birth, then group each death by age counting the number of deaths at each age sorting the ages from the lower value (0) to the higher (the oldest age to die). This part involves a frightening skill: coding! Someone might think that it will be too complicate. Fortunately there are languages and frameworks that helps a lot to manipulate data. One language very popular with data scientists is Python which has a lot of online interpreters and a lot of tools to deal with data. The documentation and sample are tremendous: with a lot of curiosity and a bit of perseverance (or maybe the opposite?)we can easily come with a few lines of code that will do the job. There won’t be any tutorial here but I’ll copy the code I used as a proof that what we want to achieve doesn’t require many skills.

I’ve downloaded the deceases file for year 2008 in France which is named “deces-2008.txt” then ran the following code in Python (using pandas module):

Python code to generate 3 life tables (total, male, female)

The code above is all you need to generate the life table from a raw file from French demography data.

The file is read
The gender column is extracted (1 is male, 2 is female) : extraction information came from data provider documentation
The Year of birth is extracted as a number
The Year of death is extracted as a number
The age of death is calculated subtracting year of birth to year of death
Life table for full population is created grouping lines by age (each line is a death): the grouped lines are counted to generate the deaths column
Same operation as for the life table but filtering data on gender to create life tables based on gender (male and female)
The life table are persisted as text files that can be open by any spreadsheet

Note: We can use Python code to calculate the entire life table but I preferred to use a spreadsheet for life expectancy calculation which is easier too process for most of people.

If you are interested to start coding:

Python: a good language to start to work with data: https://www.python.org/about/gettingstarted/
Google Colab: A good online code interpreter (that will allow you to run python code without installing anything): https://colab.research.google.com/notebooks/intro.ipynb?utm_source=scs-index
Pandas: An important data api(code module that can achieve a lot of data task for you such as read data, save data, filter, etc…): https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb
Pandas documentation site: https://pandas.pydata.org/docs/

Using spreadsheet to perform calculation

Now that I have my csv (csv meaning comma separated values) files, I can open them in a spreadsheet. Note that I end up with 3 files but I’ll only use the first one to demonstrate the calculation. The male and female life tables were processed exactly the same way.

Here’s what I get when importing my csv life table in the spreadsheet:

You can see the age -12. This is probably an error when registering the date death. I’ll remove that line from the table.

At the end of the file I have:

The oldest person who dies in 2008 was 117 (well done!). Nobody died at ages 114,115 and 116 (all other ages have people who has died): I’ll add these lines manually to be consistent setting 0 in the deaths columns for those lines.

Note: Getting errors or unexpected result when working with raw data is totally normal and common. Getting the right data is the most time-consuming task and this is probably why they call it data mining.

I can now build a fictive population where all people were born the same year and that has extinguished after 117 years (people dying progressively, having no children and no people coming from outside or leaving the population).

I’ll start from age 117 when 1 person died: my surviving population at that age was having only one person.

At age 116,115 and 114, I still had a population of 1 but at age 113 the surviving population had 2 people: one died that year and the other at 117. Going up and using the number of deaths of each age, I can calculate the population using the rule: surviving(x) = surviving(x+1) + deaths(x)

Here is what we have using automatic calculation. From age 117:

To age 0:

At age 0 the population is complete: 553111 is the number of deaths in France in 2008 (regarding to my file). From that number, we create a fictive population of 553111 people born the same year. Each year some of them will die. The life expectancy is calculated from the projection of how many years a person from that population can expect to live. This projection represents a statistical distribution from this sample of data.

The column qx is created using the rule: qx (x)= deaths(x)/surviving(x):

The column px is created using the rule: px(x) = surviving(x)/surviving(0):

The px column is interesting: it represents the percentage of the population that survived to a specific age. You can observe by example that 80% of the population is surviving until age of 64, but that only 48% makes it to 82. 18% of the population remains at age of 90. This percentage can also be used to create a normalized population of 100000 to ensure to have same references from year to year.

The column lx is calculated using the rule: lx(x) = round(px(x) * 100000):

We can then calculate the life expectancy using the 3 formulations taken from above:

Wikipedia formulation doesn’t required any population. It uses only qxto calculate plx then prod_plx then ex_wiki :

plx(x) = 1-qx(x)
prod_plx(x+1) = prod_plx(x) * plx with prod_plx(0) = plx(0)
ex_wiki(x) = sum(prod_plx(x):prod_plx(117))

ex_wiki is the life expectancy using Wikipedia french formula.

ex_wiki(x) = the number of years you can expect to live when you are at age x

when you are born you can expect to live 76.6 years but if you arrive at age of 80 you can still expect to live 4,4 years.

2. “Life Table Construction” method is creating Tx then calculating ex.

Tx(x) = sum(lx(x):lx(117))
ex_2(x) = Tx(x)/lx(x)

In my opinion there is something wrong in the Tx(x) = sum(lx(x):lx(117))

It should be Tx(x) = sum(lx(x+1):lx(117)): Tx represents the sum of the years of the survivors of a year. lx at age 0 is the entire population but not all the population survived that age. My opinion is reinforced by that fact: if we calculate ex using an offset of 1 for lx, the ex_2 and ex_wiki become proportional:

ex_wiki(x) = (ex_2(x))*px(x)

I created a column ex_test to demonstrate that.

I choosed to use Tx(x) = sum(lx(x+1):lx(117)) and not to follow “Life Table Construction” protocol for Tx calculation.

Remember that px is the proportion of surviving population at each age. It means that the Wikipedia calculation is dividing Tx(x) by the entire population instead of lx(x) which is the surviving population at age x.

Here’s the result in the spreadsheet:

Calculation ex from “Life Table Creation”

ex_2 and ex_wiki have the same life expectancy at age 0 but at age of 80 life expectancy becomes 8.1 instead of 4.4. Getting older you may feel the urge to know which calculation is the most accurate!

3. Last but not least the “National Vital Statistics Reports” method. Lx is first calculated removing half of deaths from at each lx(x) excepted for lx(0) which uses 87% of the deaths. Tx_2 is built summing Lx instead of lx and ex_3 is calculated dividing Tx(x) by lx(x)

Lx(x) = lx(x)-(0.5*(lx(x)-lx(x+1)) with Lx(0) = lx(0)-(0.87*(lx(0)-lx(1))
Tx_2(x) = sum(Lx(x):Lx(117))
ex_3(x) = Tx_2(x)/lx(x)

Because Lx have larger numbers the life expectancy with this calculation is more optimistic. The division of Tx being the surviving population instead of the entire population we can expect to live longer when we are getting older.

I’m optimistic so I’ll tend to choose the third methodology.

Here’s the result in the spreadsheet:

Calculating ex from “National Vital Statistics Reports”

Understanding the different methodologies

If you are interested to understand what are the differencies between the 3 calculations:

The Wikipedia method doesn’t use population, only surviving rate (1- mortality rate) but still comes with a result in year. How this is possible?

If you look how lx is calculated: lx(x+1) = lx(x) * qx(x)

You take the previous surviving population and multiplying it by the dying rate. Another way to picture it is considering that someone that has survived at age x had to survive every year from 0 to x. lx(x) can be then calculated as the initial population (lx(0)) multiplied by the product of all surviving rate (plx) from 0 until x.

lx(x+1) = lx(0)*plx(0)*plx(1)*plx(2)*…*plx(x)

As prod_plx(x) = plx(0)*plx(1)*plx(2)*…*plx(x)

We can write that lx(x+1) = prod_plx(x) * lx(0)

In our file:

ex_wiki(x) = sum(prod_plx(x):prod_plx(117)) the sum of all the plx products from x to the oldest age
Tx(x) = sum(lx(x+1):lx(117))
ex_2(x) = Tx(x)/lx(x)

Because lx(x+1) = prod_plx(x) * lx(0)

We can write Tx(x) = lx(0) * sum(prod_plx(x):prod_plx(117))

And Tx(x) = lx(0) * ex_wiki(x)

And as ex_2(x) = T(x)/lx(x)

We can write ex_2(x) = lx(0)* ex_wiki/lx(x)

And because lx(0)/lx(x) = 1/px(x) by définition

ex_2(x) = ex_wiki(x)/px(x)

The unique difference between ex_2 and ex_wiki is that for ex_wiki we compare the sum of products of surviving rate to the initial population and in ex_2 we compare it to the surviving population at each age.

Finally the third methodology propose to consider that half of people dying at age x should be considered as living x+1 year because they were close to x+1.

Comparing my calculation with official data

I ran the same calculation (using the most optimistic methodology) for male and female life table and I came up with these numbers:

Life expectancy for a male from birth in 2008 in France: 73.2
Life expectancy for a female from birth in 2008 in France:81.2

Looking at https://www.insee.fr/fr/statistiques/2416631 , I found these results:

77.6 for male
84.3 for female

They have more optimistic results than my best results.

There are many possible reasons for that:

I made a mistake somewhere
I didn’t work with the same data sources
They are doing other calculations
…

Having a closer look to the raw data, I discovered that many deaths from the file came from 2007 and earlier and I discovered declared deceases from 2008 in file until 2018.. Even using only 2008 deceases, I didn’t got the same results. I tried to find answers in the methodology and discovered that they tried to take migration into account, because population are not static people are living or coming. They also made some calculations including fiscal declarations and used the mean of 5 years of deaths to perform calculation.

To make it short, I discovered than when you try ton confront projections to the real world you realize that you have to take a lot of parameters in consideration because of biases.

Conclusion

Starting from a single data indicator (gender for instance) calculated every year and trying to understand what it represents computing it by myself taught me a lot:

I had a bad intuition of what the data was
I discovered that there are different ways to calculate it
I learned new projections about life expectancy(the proportion of a population surviving to a certain age by example)
Raw data doesn’t have the direct answer: you have to transform it
It was relatively fast (a few hours) too come up with a basic calculation and try it out with some real data. But when I compared my results with the official results, I realized that the real world requires to consider the parameters carefully and spend much more time on it
It’s easy to find a result, but it is much more difficult to find the methodology or the formula that led to that result
Having a better understanding of the problem might lead you to disagree with a methodology (by example Tx calculation from “Life Table Construction”)
A lot of interesting raw data is available to the public
Data mining becomes messy really really fast

I hope that you enjoyed this data journey as I enjoyed sharing it and that you are now ready to start your own data investigations.

Have a nice life!