Was 2016 especially dangerous for celebrities? An empirical analysis.

Jason Crease
Jan 2, 2017 · 6 min read
This will explained later. Guess which one is 2016?

It’s become cliché that unusually many prominent people died in 2016. Is this true? To answer this we need to know:

  1. (The easy part) What is unusually many?
  2. (The hard part) What is a celebrity?

The BBC analysis

But couldn’t this just be due to an increasing number of pre-prepared obits, or some other long-term trend? You can try to account for this by interpolating from 2012 to 2015 (I used a logarithmic trend— a quadratic gave similar results). Thus, I’d expect 36.4 celebrities to die in 2016. 49 did.

Using the obvious Poisson interpretation, P(Deaths ≥ 49) = 0.026. So a 1 in 40 year freakiness.

Just taking January to April gives an even more extreme picture. I’d predict 13.7 deaths — instead there were 24. This has a probability of just 0.007. The specific choice of January to April stinks of data-dredging, but I’m still kinda impressed.

Wikipedia and prominence

  1. It has a British bias (although it’s obviously impossible to be entirely objective.)
  2. When do they prepare obits? Maybe they just happened to write a load in December 2015.
  3. The decision to prepare an obit still remains the subjective opinion of a few bods at the BBC.
  4. Maybe the 2016 deaths were merely unusually expected, thus had obits ready.

Wikipedia to the rescue!

My analysis is similar to the book Who’s Bigger?. You may just want to skip my article and read that book.

Using C#, the Wikipedia API, and plenty of regexes, I extracted a list of prominent deaths from each year’s summary page, eg https://en.wikipedia.org/wiki/1992#Deaths . This gives a total of 6475 people, or roughly 20 a month. Then I used the Wikipedia API to get the lengths of these biographies in bytes, and the number of revisions per article.

I probably hit the web API pretty hard, so I made a small donation out of guilt :(.

Article length and revisions as a measure of prominence

This is kinda unsatisfactory. Johan Cruyff’s long football career gives him a long, detailed article, but is he really more significant than Michael Jackson? Michael Jackson has 8x as many revisions as Johan Cruyff, I presume this is because people pay him 8x as much attention.

These are the 20 articles with the most revisions:

Ah, that’s better! Every one a mega-celebrity. Note three are from 2016.

But now I found is a bias towards contentious figures (such as Indian guru Sathya Sai Baba), and those whom the man in the street has a lot to say about. Some important long-dead figures have good biographies that were rapidly and conclusively written in a few sessions by scholars — surely they deserve recognition?

A few other random bits:

  • The longest biography on Wikipedia is of Belgian astronomer Eric Walter Elst. It tediously lists thousands of asteroids that he discovered, but has few revisions.
  • When plotting Revisions against Lengths, we can see that there is a good correlation between Revisions and Lengths. The Spearman rank correlation-cofficient is 0.884 — quite high.
  • Looking at revisions and lengths there is an exponential trend. That is, something like 80% of the length/revisions is in 20% of the articles.
  • Most Wikipedia editors are American, male, nerdy, and young. I suspect.
  • I’m only using the English Wikipedia. My analysis is Anglocentric. And US-centric.

My definition of celebrity

A maximal celebrity will score 1.0. Unknowns will score 0.0.

The harmonic average has the nice property that it biases against those with unusually high scores for Length or Revisions. So a person with a very long article that has only been revised a few times is probably an anomaly, and will score poorly. Likewise, a short biography that has been heavily revised will also score poorly.

Here are my top-30 based on this metric:

These seems like a nice compromise between the two metrics.

Here’s a few other random mega-celebrities for comparison:

I now make two convenient definitions: a P200 and a P1000, a mega-celebrity and celebrity respectively. Note that every P200 is also a P1000.

  • You’re a P200 if you’re in the top 200 of my list, for those dying 2000–2016. Just making it into P200 territory are Enoch Powell and Edward Heath.
  • You’re a P1000 (or P1K) if you’re in top 1000. Just making it are Dom DeLuise and Jeff Hanneman.

Prominent People’s deaths in 2016 on Wikipedia

Looking at P200 and P1Ks there appears to be a long-term linear trend. I guess this is because articles of living celebrities are continuously expanded, so recently-dead celebrities have longer articles.

Indeed, Wikipedia statistics show linear increases in most metrics since 2010. I think it’s reasonable to do a linear interpolation of 2000–2015, and use this to predict 2016.

P200s

2016’s P200s were: Fidel Castro, Muhammad Ali, David Bowie, Prince, George Michael, Johan Cruyff, Bhumibol Adulyadej, Leonard Cohen, Antonin Scalia, Elie Wiesel, Nancy Reagan, John Glenn, Carrie Fisher, Chyna, Harper Lee, Kimbo Slice, Ernst Nolte, Rob Ford, Pierre Boulez, Alan Rickman, Shimon Peres, Christina Grimmie, Terry Wogan, Abbas Kiarostami, and Merle Haggard.

P1Ks

Technical note: Deaths are Poisson distributed, not normal! What’s with this linear-least-squares rubbish?

λ(t) = at + b

Due to the central-limit theorem, the sample-mean (i.e. observed deaths per year) of a Poisson approaches a Gaussian. So doing linear-least-squares regression assuming Gaussian-residuals on the Poisson-parameter/observed-deaths variable could be fine in-the-limit.

However, since λ itself increases with time, the residuals will increase in magnitude. Additionally, the normal-approximation is poor for small λ, especially at the tails, which is where we are.

So let’s redo the maths with the Poisson CDFs. Taking the λs from the earlier linear-interpolation:

For P200s: P(D ≥ 25 | λ =17) =0.04

For P1000s: P(D ≥ 99| λ =78) =0.01

It still looks like an unusually high number of celebrities died, but the number of mega-celebrity deaths was less surprising that the large number of rank-and-file celebrities.

Conclusion