Wikipedia’s 100 Most Influential People

A quick introduction

Ethan Hansen
7 min readJul 30, 2018

Every year, Time Magazine release their coveted “100 Most Influential People of 20 — “ a list filled with icons, revolutionaries, presidents and dictators. However, every so often, Time will update their mot coveted list — “The 100 Most Influential People of All Time”. All time! That’s a really, really, really long time. For a matter of perspective, I (a 17 year old) have only been alive for 0.00026% of the time humans have existed on our humble blue planet, the editors of time, not that much longer. Thus, the sheer brevity of our inconsequential life span begs the question — how can a few editors at Time, really determine who has had the most influence, effected the most change, impacted the most people.

If you’re currently writing an angry email to Time after this earth shattering revelation, stop! Yes, I understand and sympathise with the anger that arises from the obvious arrogance and naivety from those at Time. Rather, I propose change. A new way to determine who deserves to be on that celestial list. The answer to our problem…
Wikipedia.

Yes, the Wikipedia. The website that teachers tell you to avoid, as it is “full of inaccuracies”, and that is partly true. As a community run site, Wikipedia allows average citizens like you or me, to edit the wikipedia entries. As a result, in our use of wikipedia, we will try to avoid the pure content as much as possible… How, can we possible get ANY relevant information by not looking at the content?, I hear you asking. Ever head of quality over quantity, we’ll be using that ideology flipped on its head. Rather than focusing on the quality, we will focus solely on the quantity. Akin to that of the google algorithm, where it ranked sites based on the amount of back-links they possessed, we will rank the current individuals on the list, according to the amount of links they have on their wikipedia page.
Lets begin by analysing the links.

Math Theory

The maths used to analyse and compare the different suggested lists, falls under the category of Linear Modelling. A key component of Linear Modelling is Correlation, the relationship between two quantitative variables.
There are 3 essential factors of correlation.
Direction — the direction/trend of the data plotted on a graph, which helps define whether or not the variables have a direct or inverse relationship. Data with an upward trend is defined as having a positive correlation, a downward trend is having a negative correlation.
Linearity — whether or not the data approximately forms a straight light when plotted.
Strength — the measure of how strictly the data follows a linear pattern. Defined as either strong — directly along a straight line, moderate — approximately following a direction with a degree of linearity or weak — no direction, linearity or general pattern.

Pearson’s Correlation Coefficient
Pearson’s correlation coefficient is a quantitative way to analyse data and provide a value ranging from -1 to 1, and give a singular answer and evaluation to the correlation of the data. A negative value indicates a negative correlation and a positive value indicates a positive correlation. The magnitude of the value represents the strength of the correlation. A value of -1 has a strong negative correlation and a value of one has a strong positive correlation.
In this analysis we will use all features of Linear analysis, as to both give brief insights and an in depth, comparative analysis of the two proposed lists.

Links vs Rank

Time Magazine
When modelled as a scatter plot (number of links / rank), the original Time list looks like this.

There is no obvious correlation between the two variables (number of links & rank), as such no Line of Best Fit, could be drawn. Along the x axis the distribution is relatively even, whereas along the y, the regression of points could be indicative of an exponential relationship, between the top and bottom ranked individuals on our refactored list.
Wikipedia

Our new list, dependent on the amount of links on each individual’s wikipedia page, is modelled below. Note — as the amount of links is the refactoring variable, the chart will have a distinct inverse relationship (as ranks decrease, the amount of links increase).
The chart shows a strong relationship between rank and number of links, reflective of their intrinsic relationship. Interestingly, the left values of the graph, curve up, as if to model an exponential relationship, rather than linear. However as most of the points fall on the line of best fit, and the cluster of outliers aren’t too significantly isolated, the data will be treated as if it is linear.
Now that we have our initial list. We need to determine whether or not Wikipedia, truly offers the best and most valid data for determining who the 100 most influential people of all time, really are.
We can do this by introducing a third variable, to test for biases in either of the lists, and plotting this third variable against the individuals respective ranks on the two lists. In a perfect world, there will be a no linear relationship between the two variables, as it would reflect the idea that, the list is a result of thorough analysis, and one factor doesn’t hold too significant amount of influence when it comes to determining individual placements. The variable…
Time.
No, not Time Magazine, the time these influential individuals were alive. Or rather, when they were born. Both ideas can be tested against time, and the results an indicator of whether or not, the editors at Time, in their naivety, prioritised recent characters of influence, with the human view that here and now is the epicentre of everything and everywhere (Geocentric vs Heliocentric).
Note — a slight negative correlation can be justified, as, with an increasing population comes increasing opportunity for a larger magnitude of influence.

Birthdate Vs Rank

Time Magazine

Fortunately for the Time editors, there is no obvious correlation between date of birth and rank. This is backed up by R score (Pearson’s Correlation Coefficient) of -0.16. A score that indicates a slight negative trend, reflective of the growing population, but overall, representative of a very weak relationship between date of birth and rank. This is symbolic of the fact that, the editors of Time incorporated other factors in justifying individual ranks, on their list, and is evident of no bias between the two elements.

Wikipedia

Uh oh… When our new list is modelled against DOB, a moderate negative correlation is created. Indicative of a bias in the data, with an unjustifiable amount of influence given to one factor, time, or rather recency. The R score of -0.65 demonstrates a moderate negative correlation between DOB and rank. Additionally, the points followed a linear fashion, with a degree of accuracy allowing for a line of best fit to be appropriately drawn.
A comparison
Rather than visually analysing the graphs, the calculation of the R score allows for a direct quantitative comparison of the two sets of relationships. Given the growth of population, a justifiable standard R value of -0.1 can be set as a benchmark for a neutral relationship between DOB and rank and indicate little or no bias or unbalanced influence given to the time they were alive. Time’s R score comes in at -0.16, only slightly higher than the benchmark value, and a score reflective of little bias and an equality of influence given to all factors, in the justification of placements on the list. On the other hand, our Wikipedia list has a R school of — 0.65, showing a moderate negative correlation. The magnitude of this score shows that, based on our re-ranked wikipedia list, there is an underlying bias of information, giving too much influence to time of existence.

Conclusion

So there we have it. It turns out Time didn’t do so terribly after all. However nothing is perfect, and a comparison based on a more broad and time independent source than wikipedia may produce drastically different results. The bias can be explained however. As wikipedia is an online source, and the internet was only created 3 decades ago, there is more likely to be more sources of information about recent influential figures. This claim, however excludes religious figures, who’s impact is found in their continued legacy, and thus is talked about as much today as they were initially.
Sooooooooo, What?
Well, this reflects a real and modern issue with data analysis, and a potentially disastrous situation to occur with machine learning. Yes, we as a society have gathered information, and a lot of it at that. And yes, that information can be useful. But that’s intrinsically small picture thinking, societies move in cycles, and we only really started recording data for the internet in the 1900’s. We have been alive for a lot longer than that, and thus the naivety that comes with the egocentricity of humans, could be our downfall. A dependence on machine learning, based on information that really only reflects recent events, could lead to a dependency and a system, that can and will be blindsided by any situation that occurred before google. Anything from the plague to a super volcano eruption should not be dismissed, just because we don’t have ‘data’ on it.

If you enjoyed
Twitter — @the_firealarm
Web — thefirealarm.co

--

--