Wikipedia’s Ongoing Search for the Sum of All Human Knowledge

Oxford Internet Institute
Oxford University
Published in
13 min readJan 20, 2016

Wikipedia co-founder, Jimmy Wales, is famously quoted as asking us to: “Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge”. This is pretty much the Wikipedia mission statement: certainly a laudable one, and the recent passing of the five-million-articles mark in the English edition is undoubtedly impressive.

We all use Wikipedia, but how many people (yes you, the reader..) actually contribute to it? In order to celebrate Wikipedia’s 15th Birthday the Oxford Internet Institute (University of Oxford) organized a public editathon to help improve content around “the Social Internet”, with training led by the Bodleian Libraries’ Wikimedian in Residence, Dr Martin Poulter.

At the OII, we have long had an interest in Wikipedia as an object of research. In fact, our first doctoral thesis in 2006 examined Wikipedia’s governance. Our most recent doctoral thesis examined how Wikipedia’s policies about sourcing lead to the creation of “facts”. We were delighted to share some of these insights as well as help first time editors to cut their teeth on editing the world’s largest encyclopedia.

In the Oxford Internet Institute’s library: surrounded by knowledge, and adding it to Wikipedia!

The event combined practical training for first-time editors with short presentations of some of the insights from our research into the platform. The presentations aimed to provide wider context to the editing efforts, examining issues like the geographical coverage and bias of Wikipedia’s content, the increasingly central role Wikipedia plays in today’s information environment, the nature of its famous editing wars, and the cross-influence between Wikipedia’s 291 different language editions.

As part of his excellent general introduction for first-time editors, Martin reminded us of the five pillars of Wikipedia: (1) Wikipedia is an encyclopedia; (2) it is written from a neutral point of view; (3) it is free content that anyone can use, edit, and distribute; (4) people should treat each other with respect and civility; (5) there are no firm rules. As a prompt to further discuss Wikipedia’s core values before we got our hands dirty with actual mark-up, he asked a seemingly simple question: “Is Wikipedia free from censorship?” It wasn’t exactly that all-hell immediately broke out in the room, but the question certainly sparked lots of energetic discussion.

Discussion around Wikipedia’s famous edit wars

For a start: what exactly do we mean by “censorship”, and how does this differ from “compliance with the law”? And which country’s laws, exactly? Very fine points indeed. (The expected answer was “No” — which is either basically true, or entirely true depending on your viewpoint…). What’s clear is that Wikipedia is a vast and tremendously complex structure: over the day we took a look under the hood to grapple with some of the big questions around equality, censorship, bias, conflict — as well as more prosaic but no less important ones like “how do I add a citation to this article?”

The World According to Wikipedia

The opening research presentations were kicked off by the OII’s Dr Mark Graham, in an exploration of the geographical nature of Wikipedia. About 25% of Wikipedia’s articles are geocoded (i.e. they have geographical coordinates, indicating a connection to some place in physical space, for example an article about Nelson’s Column that includes its geographical location); meaning we can have a pretty good go at placing Wikipedia on a map — to find out what places and things in the world are written about, and where exactly they are. Or not. Many places aren’t yet written about.

The “geography of information” — the relation between information and space — has always mattered, but perhaps matters even more now that the digital and the material is increasingly woven together — not just representing but also constituting spaces. If you want to find a restaurant in Tel Aviv, Google maps will tailor the results according to the language of the request: completely different representations of the physical space will be presented to you — often invisibly to the user.

But who (or what) makes these decisions about what we see, and therefore what we go on to do? We can no longer understand “the representation” and “the represented” as separate or isolated, and the dynamic between the two is what’s interesting. Wikipedia doesn’t just reflect the world, it shapes it: Magritte’s famous “image of a pipe” is becoming the pipe itself.

The world revealed by Wikipedia. Can you spot anything missing? http://geography.oii.ox.ac.uk/

This really matters if there’s feedback between the representation and the thing being represented. Things that are more visible (e.g. on Wikipedia) will become more visible (e.g. in Google), and more written about (e.g. in the press), and therefore more visible (to the public), and more written about (e.g. on Wikipedia: and now you have a media source!). Instant feedback loop. The rich get richer, and the poor may disappear entirely.

And the Poor Get Poorer

Most academic content (one of the things Wikipedia thrives on) is published in a few places: principally the US and UK. Which mean there are many countries, most obviously across Africa, where most of the content about it is produced by outsiders — more than 95% of the content in some cases. In certain countries, e.g. Kenya, locals may use local newspapers to generate articles about Kenyan things, but may face problems (such as reversions and deletions) if outside editors don’t accept these sources as authoritative.

The highly uneven spatial distribution of (geotagged) Wikipedia articles in 44 language versions of the encyclopaedia.

So is Wikipedia making long-standing issues around global visibility and inequality better or worse? Who knows, basically: but it doesn’t on the face of it seem to be fixing the problem. Indeed, there are indications that systemic bias is more pronounced in Wikipedia than in academia generally; Wikipedians being more conservative than journal editors. Long-standing editors can be quite territorial — and attitude change can be hard to realize.

This systemic bias towards the West is all filtered through the great Wikipedia lens of “notability”. Does the massive geographical skew we see simply reflect that things are inherently more “notable” in Europe and the US (the Rockies and Great Plains excepted…)? Obviously not, but what gets classed as notable in part depends on power — i.e. which editors are making the decision, and perhaps what language they speak and where they’re from — but it also depends on the availability of suitable sources, and what information will be allowed as a source. The fact there is such bias in Wikipedia is a terrible thing, but practically: what can we do to correct it when the available source material can be so scant and so difficult to access?

A Local Encyclopedia For Local People

In the next presentation, the OII’s Dr Bernie Hogan highlighted the example of the press response to a Boko Haram attack in the Cameroonian town (or maybe it’s a village, or a commune?) of Fotokol. Wikipedia was widely used to bulk-up press stories, while being unable to meet the needs of journalists — most basically, in supplying an accurate number of fatalities, which varied widely in the resulting coverage. One obvious solution is to encourage local people to edit content about their own areas. There is evidence that this is happening, but this “local representation” varies dramatically between languages, with Europe and North America dominating, as usual.

If we look at where edits about local content are taking place, we see that not only is most of the content about the Global South being originated in the West; but a lot of the edits from the Middle East and North Africa (MENA) region are actually being sent out of the region as well (primarily to the US). So, those areas that already have relatively few edits are actually sending most of them away. This double whammy is particularly unfortunate when discover that these outgoing edits represent a vanishingly small proportion of the total edits about the West.

Difficulties around sourcing have already been mentioned as a problem — even when writing articles about gigantic physical things in physical space. Bernie gave the example of an Egyptian obelisk that was geocoded to completely the wrong place. A local editor could see the obelisk outside his window, but his edits were summarily rejected. After all, his local knowledge was not considered a legitimate source, while the other editor was working from a book (from the early 20th Century, no less)!

There was inevitably some discussion around how geographical coverage could be improved, particularly of the Global South. Could it be improved via mobile? (Well, despite improved mobile interfaces, editing from a mobile is still difficult) How about free Internet? Well (for example), Facebook’s controversial “FreeBasics” platform will allow people to consume Wikipedia. However, it won’t allow you to search on Google for new sources, making it easier than ever to persist in the role of a consumer rather than producer. In the discussion, people also noted the questionable impact of IMF/WB policies on the Global South (and by extension Wikipedia’s coverage).

Who Controls the Facts?

Wikipedia is not just incredibly popular (it’s the seventh most visited site in the world); it’s increasingly authoritative in our online information environment. It is a “generative platform”, and its information works its way into all sorts of different places we might not expect. This can be an issue. It was wonderful to welcome back OII alumna Dr Heather Ford to explore “New Authorities: Wikipedia and the reconfiguration of expertise”.

She opened with Time magazine’s famous awarding of their 2006 Person of the Year to “You”. She noted that Wikipedia used to be the David to big media’s Goliath, but that this has since reversed: Wikipedia has become a Goliath in its own right. Indeed, something very significant happened in 2012, when Google announced that it would add relevant “facts” to its search returns, through the Fact Box in the top-right hand corner of the page. (Interestingly and annoyingly, our own Fact Box is semi-wrong..).

Wikipedia plays a central role in the information that is presented in the Fact Box, which has interesting implications. Firstly: do collective and open systems like Wikipedia lead to greater equality? There are examples of this in software, but maybe not with encyclopedias, where the ‘market’ of open volunteers will not necessarily serve certain areas of interest. Information that is not already available will not be made available in Wikipedia, and therefore also in Google’s Fact Box. Wikipedia’s totalizing claims (“the sum of human knowledge”) exacerbates this problem. As new actors and logics come on the scene, people and organizations will edit Wikipedia to advance their own interests.

On Editing, and the Importance of Good Coffee and Cake

It was at this point in the early afternoon that we moved on to the editathon proper, patiently and expertly led by Martin. But it was also the exact point at which his seemingly innocent question “Is Wikipedia censored” provoked, well: lots of discussion around the nature and definition of censorship. Having already discussed a series of research presentations that deconstructed the platform and offered a fleeting glimpse of the wizard behind the curtain, the newbie editors were properly fired up with lots of clever and difficult questions. Which was great!

This was the first time the department had attempted a hybrid event: an equal mix of practical effort to improve certain articles, and contextualizing academic insight and discussion, but we had suspected that the practical and the academic might usefully inform each other. After all, without adding content to Wikipedia, there’s no Wikipedia to study; and without study, we’re rendered blind and ignorant (although potentially happier as a result).

And hopefully we got the balance right: one of the participants commented that she “came away with more respect for Wikipedia as well as more awareness of its problems worldwide. The day was a mix of introspection, thoughtful and informed discussion, and practical help, without any hype. I’ve gained a greater sense of responsibility towards keeping it useful, honest and well used in every sense.” Which is what we were hoping for! Keep Wikipedia useful, honest, and well-used, people!

Editing is hard work, and we stuck at it for about three hours, fuelled by coffee and a large quantity of cake. The second set of presentations capped the day by considering the many terabytes of data collected by the Wikipedia platform, and what we can do with it. As a social science department we love social data, and Wikipedia comes with lots of it — geocodes, time stamps, user edit logs, links between language editions, etc., all of which help to untangle the complex social stuff of the world it aims to represent, and also the complexity of what’s happening behind Wikipedia’s famously basic and ingenuous interface.

Wikipedia As: Big Data?

The degree of geotagging varies across different languages, and it was great to welcome back Dr Stefano de Sabbata, to give more insight into the “Geographies of Wikipedia” — to examine the new kinds of data and geographies we are creating, and to show how the GeoHack tool can be used to link Wikipedia articles to other data sets. The theme of linking across datasets was picked up by the OII’s Dr Scott Hale, who explored the question of how much interaction there is between language editions.

The English Wikipedia is the largest version, at nearly three times the size of the German one. But larger editions don’t cover all the topics of smaller sized editions — only half the German articles have an English equivalent, and self-focus bias (a recurrent theme of the day) is seen here again. Despite much of the world population being multilingual (~50% of Europeans and 20% of Americans, for instance), most online platforms are designed for monolinguals. About 15% of Wikipedia users are multilingual and edit more than one language edition of the encyclopedia. These multilingual users tend to be very active, with over twice as many edits as the average editor.

So what are the benefits of people editing in multiple languages? They can serve as bridges and share relevant information and content across the different language versions. Certain languages seem to be connected, with clear pathways between (for example) the romance languages. (Incidentally, it was great to see one of the participants editing in Kannada, the official language of Karnataka.) English is clearly a central language connecting many other language editions. In contrast to other languages, first- and second-language speakers of English contribute content of roughly similar complexity, showing how English really serves as a bridge language.

The Great Big Pillow Fight

And what actually happens when you allow “mass collaboration of non-professional individuals” and let them all loose on a platform? In the case of Wikipedia, you’ll (incredibly) end up with over 38 million articles in almost 300 languages. And a lot of edit wars — the temporal dynamics of which the OII’s Dr Taha Yasseri, in his presentation on “Social aspects of collaborative editing: revenge, conflict, and war”, likened to a mass pillow fight. Very complicated, pretty violent, and very fast. The fact that all these individual edits are time- and user-stamped, makes Wikipedia a hugely valuable data-source for social scientists to examine and understand these conflicts: how they erupt, and how they eventually resolve and die down. Or not.

Wikipedia’s system of edit reversion is intended to stop vandalism, but is often abused by users, and also provides a good indication of how much controversy and opposition there is on Wikipedia. The more aggressive interactions between editors can involve a complex and strategic series of repeated attacks, self-defense, multilateral attack, third-party defense, and serial attacks. And different types of interactions are associated with different editor seniorities: which is interesting; a sort of attack hierarchy.

If we look at conflict evolution over time, we see that the levels of reversions map onto real-world events. But there are also (a very few) articles that never resolve to a point where reversion stops: in particular, the articles on anarchism and George Bush exist in a state of almost constant war.

So what are the most controversial topics in Wikipedia? Well, Jesus appears in all lists across languages; but mostly the topics are locally driven: editors of the Spanish edition argue an awful lot about football, and editors of the Czech edition about homosexuality — these controversial topics can give us an idea of what different country and language communities care about.

Wikipedia tries to resolve these problems by bringing in unassociated editors: a time consuming process. But might this sort of research help the Wikipedia community to identify persistent troublemakers? Contropedia is a website that identifies problematic and controversial articles: but might the identification of troublesome editors be even more useful?

It’s All In Your Hands

One of the (many!) vague doubts we had when organizing this fairly unusual event was that it might end up being incredibly depressing: a catalogue of conflict, inequality, and complex and inflexible rules and conservative editors grinding down the efforts of good people.

One participant commented: “I found it fascinating that once we burrow down to usage at a local level we find that contributors in say the Middle East or South Africa are not necessarily contributed knowledge about their own locality/country but are offering more insight into the Northern hemisphere. It was quite a glum picture presented of who is still framing the narrative about the Global South.”

But of course we wouldn’t have organized it had we not been utterly convinced that Wikipedia was something worth spending time on. Heather Ford probably summed it up best: that whatever the (friendly, constructive) criticisms made over the course of the day, individuals today have MORE POWER to edit and change information than ever before — and that this is an amazing opportunity. Indeed the overall message of the day seemed to be that to create change, we all need to get stuck in. And that means YOU!

Hearing from Stefano De Sabbata about the geographies of Wikipedia

One of the participants commented that she “came to the editathon as a novice editor and slight skeptic about some of the content on Wikipedia. Now I have better understanding of the strengths, and some weaknesses, of its community authorship model and feel more confident to use Wikipedia than I have in the past.”

Hopefully these editathon events, including the brilliant series of editathons around Women in Science organized by Oxford University’s Bodleian Libraries, and of course the thousands of other events taking place all the time around the world, will help introduce a new group of editors to the crazy, amazing, maddening, brilliant, bizarre and incredibly important and valuable world of Wikipedia!

It’s not the sum of all human knowledge — but it’s a pretty decent stab at it.

If you’re interested in improving Wikipedia’s content around “the Social Internet”, check out our list of target articles here: and please add to it!

By: David Sutcliffe, OII

--

--