People Watching on the Wikipedia
Rock stars die young. This used to be common knowledge. Of course with every year that the Rolling Stones refuse to retire and every concert where Mick keeps jumping this seems less true — the Stones tour for 2020 is postponed, but only because of Covid. 27 used to be the age to die as a rocker, now a swath of hit makers are closing in on 3 times that.
It could still be true, of course. How would we find out? How do we find out anything these days? Check Wikipedia!
Wikipedia is of course a massive online, open and free encyclopedia with more than 6 million articles of which over a million are about people. Wikipedia seems mostly a textual resource, but actually contains quite a bit of structured data. Using the techniques described below in the “how this works” section, we can extract a table of people born after 1500 with a number of columns.
Grouping by occupation and then sorting by their average lifespan seems to confirm the theory:
If James Dean was alive today, he would be a rapper! Rappers die the youngest and billionaires the oldest by quite the difference. Before you start decrying the injustice of young artists dying while old billionaires get the best health care money can buy, let’s think about the biases in the data here; The average year of birth for Wikipedia rappers is 1982, so they just haven’t had a chance to die old. As for billionaires, dying young as a billionaire is hard since it means you just have fewer years to become that rich. Kanye at 43 is a billionaire and a rapper but also very much still alive.
We can plot the year of birth vs the lifespan and see the bias quite clearly:
Average lifespan goes up, at first slowly, but after 1880 it quite rapidly rises from around 70 to 79 in 1924 after which it drops like stone, which hurts the life expectancy of rappers calculated this way.
To counter this bias, we can of course filter out anybody born after 1924. This will of course also filter out most rock stars — even Chuck Berry was born in 1926. Looking at the top 60 occupations, we get the following table:
- Aviators die the youngest. That seems to make sense.
- Generals outlive soldiers and both are outlived by admirals.
- Scientists have a higher life expectancy than sports stars. One reason this might be is that cricketers become famous at a younger age than scientists — and so have more of an option to die young. In the case of some sports like boxing, the activity itself might of course reduce their life expectancy.
- Republicans live a year longer than Democrats.
- Actresses outlive actors by four years.
Singers end up in the second column, so with a life expectancy of slightly less than average. Then again poets do die young and so maybe rock stars are just a combination of those two categories?
How this works
The dataset used here is generated from a Wikipedia dump (technically it is first imported into a postgres database first, which helps with the performance). These dumps contain one text file per article laid out using the wikimedia markup language. This is what you see when you edit Wikipedia articles and it looks something like this:
Basically it is normal text with some special characters like ‘==’ and ‘[[‘ that have an extra meaning — anything between [[ and ]] automatically becomes a link. All links refer to a different Wikipedia page, but some links are special. In the example above, the [[File:Einstein patentoffice.jpg actually renders as an image.
Categories are another special type of links. Any link of the form [[Category:<category name>]] indicates that the page linked from belongs to the category <category name>. For this project we look at three types of categories births, deaths and occupations.
The births and deaths categories look like [[Category:1910 births]] and [[Category:1982 deaths]] in this case indicating a person who was born in 1910 and died in 1982. We select all pages from the Wikipedia that have a matching birth category and assume they are about people (there’s a separate category for animal births). Pages without a matching death category are presumably about people still alive.
Extracting occupations is slightly more involved; those categories usually tend to be of the form “adjective adjective occupation”, for example, [[Category:English cricketers]] or [[Category:Shot-down aviators]]. We can get to a usable set of occupations though by growing an initial set of occupations to a much broader set.
Start with a small, hand picked list of occupations. Run through all categories of all people and for each category that ends on one of our picked occupations, keep track of the prefix. Make a list of all these prefixes and keep the most common ones. Then run through all categories again and do the reverse; keep all categories that start with any of the discovered prefixes and keep track of the post-fix. Keep a list of the most common post-fixes and call that your list of occupations.
For example, let’s say the category ‘cricketers’ is on our initial list. By going through the list of all categories ending in ‘cricketers’ we discover a whole set of useful prefixes, for example ‘English’, ‘Indian’ and the slightly less generally applicable ‘left handed vs right handed’. If we now go through the list again and look for anything that starts with ‘English’ or ‘Indian’ we might discover that ‘scientist’ is a good candidate for occupation, since we find both English and Indian scientists (though not a lot of ‘left handed vs right handed scientists’).
The supplied table contains a few more columns, like country of birth and gender. Wikipedia doesn’t specify these fields in general, but using some data wrangling we get to reasonable estimates. These are not used for the current post though, so we’ll leave the details of those to some future installment.
Head over to https://github.com/DOsinga/wiki_import to see the tools used in this post to get to the data. Or if you just want to go to the results directly, open the notebook directly. Or even more fancy, open the notebook in Google Colab and you can play with it directly. And of course, drop me a line if you spot something interesting in the data.