France population analysis with Python

cloud
8 min readFeb 24, 2022

--

French government puts many datas online for transparency reasons, a great playground for python analysts.

Today we’re going to see what we can do with the list of all deaths since 1970 in France.

This analysis has been done with Visual Studio Code and Jupyter-Notebook.

Collect data

The data can be found here and are given by year. So the first step is to download all this datas and to convert all files in XLSX to be able to give them to pandas. I’ve done it simply with Excel one by one. Just think to transform the column with number (3) in text otherwise Excel will round it.

Now we have got 52 files with something like that :

Excel screenshot

Transform data

https://www.data.gouv.fr/fr/datasets/fichier-des-personnes-decedees/ give us all the data types, their length and where they are located.

In this 3 columns we have (data name) :

  • last name (nom)
  • first names (prenom)
  • gender (sexe)
  • birth date (date_naissance)
  • City of the birth (for people born in France or in DOM/TOM/COM) (ville_naissance)
  • Country of the birth (for people born abroad) (pays_naissance)
  • Death date (date_deces)
  • Code of the death location (code_lieu_deces)
  • Number of the death certificate (num_acte_deces)

We are going to concatenate all this files to have one Dataframe.

import pandas as pd
df = pd.DataFrame()
for a in range(1970, 2022):
dfnew = pd.read_excel('<YOUR DIRECTORY>\\deces-' + str(a) + '.xlsx', engine='openpyxl', header=None)
df = df.append(dfnew,ignore_index=True)

And we are going to put each data in a different column.

df.columns = ['name', 'naissance', 'mort']
df[['nom', 'prenom']] = df.name.str.split('*' ,expand=True,)
df.prenom = df.prenom.str.rstrip('/')
df = df.drop(['name'], axis=1)
# Birth data
df['sexe'] = df.naissance.str[0]
df['date_naissance'] = df.naissance.str[1:9]
df['ville_naissance'] = df.naissance.str[14:44]
df['pays_naissance'] = df.naissance.str[44:]
df = df.drop(['naissance'], axis=1)
# Death data
df['date_deces'] = df.mort.str[0:8]
df['code_lieu_deces'] = df.mort.str[9:14]
df['num_acte_deces'] = df.mort.str[15:]
df = df.drop(['mort'], axis=1)
df = df.dropna(subset=['date_naissance', 'date_deces'])

We have now a big dataframe with 26268061 entires which looks like that :

Dataframe before change

We are going to change the birth and death dates type to work on its easier with the method to_datetime(). To do that, we have to remove datas not in the good format like day or month having ‘00’ value.

df['mois'] = df.date_naissance.str[4:5]
df['jour'] = df.date_naissance.str[6:7]
df = df[~df.mois.str.contains('00', na=False)]
df = df[~df.jour.str.contains('00', na=False)]
df = df.drop(['mois', 'jour'], axis=1)
df['mois'] = df.date_deces.str[4:5]
df['jour'] = df.date_deces.str[6:7]
df = df[~df.mois.str.contains('00', na=False)]
df = df[~df.jour.str.contains('00', na=False)]
df = df.drop(['mois', 'jour'], axis=1)
df['date_naissance'] = pd.to_datetime(df.date_naissance, format="%Y%m%d", errors='coerce')df['date_deces'] = pd.to_datetime(df.date_deces, format="%Y%m%d", errors='coerce')df = df.dropna(subset=['date_naissance', 'date_deces'])

And to finish, I‘ll add a column with age of each person because it is a very interesting information and we are going to work with it.

df['annee'] = df['date_deces'].dt.yeardf['age'] = df.date_deces - df.date_naissance
df['age'] = (df.age.dt.days / 365.25).astype('int')

Save our dataframe

Because I don’t want to lose hours creating the dataframe, I’ll save it with pickle and call it each time I want to have a proper dataframe.

df.to_pickle('<YOURDIRECTORY>\\df.pickle')

Ok, we have now a big Dataframe with our data ready to be analyzed !

Oldest persons since 1970

First, who are the oldest persons dead since 1970.

10 oldest persons since 1970

WoW we have 2 persons dead at 144 years old.

What is interesting is we see Jeanne CALMENT who was the oldest person in the world when she died in 1997 but we never heard about the 2 first persons, Marie-José MEYER and Jean-Claude DEVEIL died at 144 years. So normally they should be the oldest persons and not Jeanne CALMENT. Maybe there was a doubt about their real age.

Now we are going to search the oldest person for each year and drawing a graphic to see the evolution.

df = df.sort_values(by='age', ascending=False).groupby(by='annee').first()import matplotlib.pyplot as pltfigure(figsize=(20, 20), dpi=100)
df = df.reset_index()
df = df[df.annee > 1970]
plt.plot(df.annee, df.age)
plt.show()
Oldest persons by year

We can see the age increase up to 1990 and we are now on a stable tray with some exceptions.

Number of death by year

Then we are going to see the number of death by year.

# We reload the dataframe
df = pd.read_pickle('<YOUR DIRECTORY>\\df.pickle')
df['date_deces'] = df.date_deces.dt.strftime('%Y')
df = df.groupby(by=['annee', 'date_deces']).size().to_frame()
df = df.sort_values(by=[0], ascending=False)
df = df.reset_index()
# We are drawing the result
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(20, 20), dpi=100)
df = df[df.annee>1972]
df = df.sort_values(by=['annee'], ascending=False)
plt.plot(df.annee, df[0], label='Death by year')
plt.show()

To really understand this graph, I think we should compare to the graphic of the evolution of the french population but we see the 1990 / 2010 was more smooth and now increase quickly. We know the french population is aging so it could be a lead. This graph show us peaks like the heat wave in 2003, the epidemic in 2015 and the COVID in 2020.

Population born in an other country

Ok now we are going to see what we can observe using the country of birth.

# We reload the dataframe
df = pd.read_pickle('<YOUR DIRECTORY>\\df.pickle')
# We use seaborn to draw the barplot
import seaborn as sns
figure(figsize=(20, 20), dpi=100)# We remove the blank country which represent France
df = df[df['pays_naissance'] != '']
group = df.groupby(['pays_naissance']).size()
df = group.to_frame(name = 'nb_pays').reset_index()
df = df.sort_values(by=['nb_pays'], ascending=False)
# We are going to print the 15 firsts results
df = df.head(15)
sns.barplot(data=df, x=df.pays_naissance, y=df['nb_pays'], label="nb", color="blue")

With no surprise, we found in the top 6 Algeria / Italia / Spain / Tunisia / Morocco and Portugal.

Death age by country of birth

We are going to try to see now the evolution of the death age for each of this country to compare them. I’ll start in 1992 because before this date results are limited by the number of cases.

# We reload the dataframe
df = pd.read_pickle('<YOUR DIRECTORY>\\df.pickle')
import matplotlib.pyplot as plt# First the overall average
figure(figsize=(20, 20), dpi=100)
group = df.groupby(by='annee')['age'].mean()
group = group[group.index > 1992]
plt.plot(group.index, group, label='Moyenne globale')
plt.legend()
# For ALGERIAgroup = df.groupby(by=['annee', 'pays_naissance'])['age'].mean().to_frame()
group = group.reset_index()
group = group[group['pays_naissance'] == 'ALGERIE']
group = group[group['annee'] > 1992]
plt.plot(group.annee, group.age, label='ALGERIE')
plt.legend()
# For MOROCCOgroup = df.groupby(by=['annee', 'pays_naissance'])['age'].mean().to_frame()
group = group.reset_index()
group = group[group['pays_naissance'] == 'MAROC']
group = group[group['annee'] > 1992]
plt.plot(group.annee, group.age, label='MAROC')
plt.legend()
.......... Same thing for each country then :plt.plot(group.annee, group.age)
plt.show()

Very interesting thing, someone born in Italia or Spain live in average 10 year more than someone born in Portugal or Morocco.

I don’t have the exact explication of this facts because it can have many origins like living condition, work, nutrition but we can see the difference is huge.

Other detail, since 2009, Algeria which was in the average before this date, has loose 2/3 years.

Death age by month of birth

Now we are going to do something more strange but the result is interesting because different than what I thought. We are going to compare the evolution of the death age by month of birth.

# We reload the dataframe
df = pd.read_pickle('<YOUR DIRECTORY>\\df.pickle')
# We only keep the birth month
df['date_naissance'] = df.date_naissance.dt.strftime('%m')
tab_annee = [j for j in range(1972, 2022)]
tab_mois = [j for j in range(1, 13)]
tab_moy = []
for j in range(1972, 2022):
tab = []
for i in range(1, 13):
if i<10:
mois = '0' + str(i)
else:
mois = str(i)
tab.append(df[(df.date_naissance == mois) & (df.annee == j)].age.mean())
tab_moy.append(tab)
dfmoy = pd.DataFrame(tab_moy, columns=tab_mois, index=tab_annee)

We are drawing the result :

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(20, 20), dpi=100)plt.plot(dfmoy.index, dfmoy[1], label='Janvier')
plt.plot(dfmoy.index, dfmoy[2], label='Février')
plt.plot(dfmoy.index, dfmoy[3], label='Mars')
plt.plot(dfmoy.index, dfmoy[4], label='Avril')
plt.plot(dfmoy.index, dfmoy[5], label='Mai')
plt.plot(dfmoy.index, dfmoy[6], label='Juin')
plt.plot(dfmoy.index, dfmoy[7], label='Juillet')
plt.plot(dfmoy.index, dfmoy[8], label='Aout')
plt.plot(dfmoy.index, dfmoy[9], label='Septembre')
plt.plot(dfmoy.index, dfmoy[10], label='Octobre')
plt.plot(dfmoy.index, dfmoy[11], label='Novembre')
plt.plot(dfmoy.index, dfmoy[12], label='Décembre')
plt.legend()
plt.show()

Before seeing the drawing, I imagined lines crossed all the time but the result is different and show something more regular. Persons born in February or March die in average 6 months later than people born in June or July. This is confirmed by the result given by the following code :

Month with the more deaths

Now we are going to do the same thing looking what is the month with the more deaths.

# We reload the dataframe
df = pd.read_pickle('<YOUR DIRECTORY>\\df.pickle')
df['date_deces'] = df.date_deces.dt.strftime('%m')
tab_annee = [j for j in range(1972, 2022)]
tab_mois = [j for j in range(1, 13)]
tab_sum = []
for j in range(1972, 2022):
tab = []
for i in range(1, 13):
if i<10:
mois = '0' + str(i)
else:
mois = str(i)
tab.append(len(df[(df.date_deces == mois) & (df.annee == j)].index))
tab_sum.append(tab)
dfsum = pd.DataFrame(tab_sum, columns=tab_mois, index=tab_annee)# We are creating the graph
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(20, 20), dpi=100)plt.plot(dfsum.index, dfmoy[1], label='Janvier')
plt.plot(dfsum.index, dfmoy[2], label='Février')
plt.plot(dfsum.index, dfmoy[3], label='Mars')
plt.plot(dfsum.index, dfmoy[4], label='Avril')
plt.plot(dfsum.index, dfmoy[5], label='Mai')
plt.plot(dfsum.index, dfmoy[6], label='Juin')
plt.plot(dfsum.index, dfmoy[7], label='Juillet')
plt.plot(dfsum.index, dfmoy[8], label='Aout')
plt.plot(dfsum.index, dfmoy[9], label='Septembre')
plt.plot(dfsum.index, dfmoy[10], label='Octobre')
plt.plot(dfsum.index, dfmoy[11], label='Novembre')
plt.plot(dfsum.index, dfmoy[12], label='Décembre')
plt.legend()
plt.show()

I can’t explain the peak for february 1989 but we can see January is the worst month and September the month with less deaths. Calculating the average, we have :

Conclusion

That’s all for this analyze. I’ll let the demography experts give their explanation, it’s not my area of expertise. It could be interesting to do the same analyze with other countries to see the differences. I hope this article give you the curiosity to find more data to analyze

Thanks for reading.

Have fun.

--

--