The Startup
Published in

The Startup

People Watching on the Wikipedia

Rock stars die young. This used to be common knowledge. Of course with every year that the Rolling Stones refuse to retire and every concert where Mick keeps jumping this seems less true — the Stones tour for 2020 is postponed, but only because of Covid. 27 used to be the age to die as a rocker, now a swath of hit makers are closing in on 3 times that.

27 Club graffiti in Tel Aviv. Left to right: Brian Jones, Jimi Hendrix, Janis Joplin, Jim Morrison, Jean-Michel Basquiat, Kurt Cobain, Amy Winehouse

It could still be true, of course. How would we find out? How do we find out anything these days? Check Wikipedia!

Wikipedia is of course a massive online, open and free encyclopedia with more than 6 million articles of which over a million are about people. Wikipedia seems mostly a textual resource, but actually contains quite a bit of structured data. Using the techniques described below in the “how this works” section, we can extract a table of people born after 1500 with a number of columns.

Grouping by occupation and then sorting by their average lifespan seems to confirm the theory:

Some occupations and their (raw) life expectancies in years

If James Dean was alive today, he would be a rapper! Rappers die the youngest and billionaires the oldest by quite the difference. Before you start decrying the injustice of young artists dying while old billionaires get the best health care money can buy, let’s think about the biases in the data here; The average year of birth for Wikipedia rappers is 1982, so they just haven’t had a chance to die old. As for billionaires, dying young as a billionaire is hard since it means you just have fewer years to become that rich. Kanye at 43 is a billionaire and a rapper but also very much still alive.

We can plot the year of birth vs the lifespan and see the bias quite clearly:

Average lifespan goes up, at first slowly, but after 1880 it quite rapidly rises from around 70 to 79 in 1924 after which it drops like stone, which hurts the life expectancy of rappers calculated this way.

To counter this bias, we can of course filter out anybody born after 1924. This will of course also filter out most rock stars — even Chuck Berry was born in 1926. Looking at the top 60 occupations, we get the following table:

Some observations:

  • Aviators die the youngest. That seems to make sense.
  • Generals outlive soldiers and both are outlived by admirals.
  • Scientists have a higher life expectancy than sports stars. One reason this might be is that cricketers become famous at a younger age than scientists — and so have more of an option to die young. In the case of some sports like boxing, the activity itself might of course reduce their life expectancy.
  • Republicans live a year longer than Democrats.
  • Actresses outlive actors by four years.

Singers end up in the second column, so with a life expectancy of slightly less than average. Then again poets do die young and so maybe rock stars are just a combination of those two categories?

How this works

The dataset used here is generated from a Wikipedia dump (technically it is first imported into a postgres database first, which helps with the performance). These dumps contain one text file per article laid out using the wikimedia markup language. This is what you see when you edit Wikipedia articles and it looks something like this:

Some of the raw code underlying the Wikipedia

Basically it is normal text with some special characters like ‘==’ and ‘[[‘ that have an extra meaning — anything between [[ and ]] automatically becomes a link. All links refer to a different Wikipedia page, but some links are special. In the example above, the [[File:Einstein patentoffice.jpg actually renders as an image.

Categories are another special type of links. Any link of the form [[Category:<category name>]] indicates that the page linked from belongs to the category <category name>. For this project we look at three types of categories births, deaths and occupations.

The births and deaths categories look like [[Category:1910 births]] and [[Category:1982 deaths]] in this case indicating a person who was born in 1910 and died in 1982. We select all pages from the Wikipedia that have a matching birth category and assume they are about people (there’s a separate category for animal births). Pages without a matching death category are presumably about people still alive.

Extracting occupations is slightly more involved; those categories usually tend to be of the form “adjective adjective occupation”, for example, [[Category:English cricketers]] or [[Category:Shot-down aviators]]. We can get to a usable set of occupations though by growing an initial set of occupations to a much broader set.

Start with a small, hand picked list of occupations. Run through all categories of all people and for each category that ends on one of our picked occupations, keep track of the prefix. Make a list of all these prefixes and keep the most common ones. Then run through all categories again and do the reverse; keep all categories that start with any of the discovered prefixes and keep track of the post-fix. Keep a list of the most common post-fixes and call that your list of occupations.

For example, let’s say the category ‘cricketers’ is on our initial list. By going through the list of all categories ending in ‘cricketers’ we discover a whole set of useful prefixes, for example ‘English’, ‘Indian’ and the slightly less generally applicable ‘left handed vs right handed’. If we now go through the list again and look for anything that starts with ‘English’ or ‘Indian’ we might discover that ‘scientist’ is a good candidate for occupation, since we find both English and Indian scientists (though not a lot of ‘left handed vs right handed scientists’).

The supplied table contains a few more columns, like country of birth and gender. Wikipedia doesn’t specify these fields in general, but using some data wrangling we get to reasonable estimates. These are not used for the current post though, so we’ll leave the details of those to some future installment.

Head over to https://github.com/DOsinga/wiki_import to see the tools used in this post to get to the data. Or if you just want to go to the results directly, open the notebook directly. Or even more fancy, open the notebook in Google Colab and you can play with it directly. And of course, drop me a line if you spot something interesting in the data.

--

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

Customer segmentation at Instacart dataset using K-means

Dead Spot Score Explained (part 1)

Sierpiński Triangle: Fractal Christmas Tree

Three Sierpinski triangles

4 Best Beginner-friendly Courses for Data Analysis

Tips to building better Deep Learning Models from 10 Days of ML Challenge

Continuous innovation: How Gamaya reinvented its Ag image analysis pipeline

Spatial data — the final frontier

Build a Dash Web App for Binary Classification Model Selection

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Douwe Osinga

Douwe Osinga

Entrepreneur, Coding enthusiast and Co-Head of Delve a Sidewalk Labs product.

More from Medium

World’s First Real-Time AI-Powered Voice Conversion for Everyone From Voicemod

Optimising IPU Applications Using the PopVision Analysis Tools

Nelleke’s take on project management and AI

Project manager Nelleke

Transparency in AI: Proposal for a Universal “Is This an AI System?” Consumer Signal