Outreachy Internship: Project 0

Published in

Becoming a Data Analyst

11 min readDec 23, 2016

If you’ve read a Wikipedia article before, you may have noticed that it’s separated into sections like “Career”, “Plot” or “References”. The purpose of these section headings is to organize the content on each page. Which of these section headings is most popular? This is the question I answered in my first data analyst internship project with the Wikimedia Foundation. I investigated section headings in 5 large Wikipedia languages and released a brand new public dataset of article section headings!

Comparison of most frequent headings across 5 large Wikipedia language editions

All analyzed language editions have some version of “References” and “External Links” as the top 2 frequent heading titles. This doesn’t come as a surprise since each article should include external references which verify the content on a page and inform readers of the source. These headings should/can appear in any subject Wikipedia article, but many other headings are subject specific. For example, an article about a film may include “Cast”, “Plot”, or “Production” section headings.

There are comparable, but different headings more frequently used in different language editions. For example, in German, “Life” (“Leben”) appears in 13% of articles (presumably most of these are articles about people) and is the 5th most frequent heading. However, in English Wikipedia “Life” appears in less than 1% of articles and is only the 27th most frequent heading. Instead, “Biography” is used more frequently in 3% of English articles and is the 7th most frequent heading. In German, “Biography” (“Biografie”) is the 32nd most frequent heading and appears in under 1% of articles.

For a full list of the top 100 section headings in English, French, German, Italian and Spanish Wikipedia, along with more results, check out the meta research page.

Here are some ideas of ways you can use this dataset for interesting analysis:

look at the number of section headings per article, calculate statistics such as average and median and create a frequency distribution histogram
look for variations (and redundancies) of heading titles across a language (for example all headings which contain “life” — similar to what a user did here for German Wikipedia)
look for variations (and redundancies) of heading titles across multiple languages to understand which version of a heading is more frequently used in each language edition
get a rough break down of article subjects and their percentages (for example, you could count the number of unique articles with the “Early life” heading to get a count of biography pages, although there are more thorough methods of doing this using article categories)
use the results from the bullet point above to identify section heading gaps across Wikipedia and understand how closely article layout guidelines are being followed on each language edition (here is a German Wikipedia page suggesting section headings for biography pages)

Also, if you’re interested in a language edition which I did not include in my project, you can re-use my code and generate a new dataset for this language. Keep reading this post to learn about this process.

This project started as a question from the Reading team about the number of articles in which the “See also” heading appears in. The microtasks I completed as part of my application process served as a simplified exercise to get initial results for this question, but there were some assumptions to get these results which don’t reflect real-world data.

My task was to calculate the 100 section headings which are used in the largest number of articles for 5 large Wikipedia editions using PAWS, an online web based interactive programming and publishing environment created by WMF which provides Jupyter notebooks. The motivation behind PAWS is to reduce the amount of unnecessary complexity involved in programming and provide easy access to public data released by WMF. PAWS is still in beta and there’s not a whole lot of documentation on it yet, but anyone with a MediaWiki login can try it out.

First, I had to generate article heading datasets for English, French, German, Italian and Spanish Wikipedias. I did this by parsing through the “Articles, templates, media/file descriptions, and primary meta-pages” data dump using Aaron Halfaker’s method of extracting headings, using the mwparserfromhell library.

The page id, page title, page namespace, heading level and heading text of each article page in that language edition is captured. The data dumps contain page information for all namespaces, or pages whose name begins with a specific word, throughout Wikipedia. Namespaces are used to identify the type of page. All encyclopedia articles are in the main (or article) namespace, which is what readers see when browsing an article, but there are other options such as user pages, help pages, talk pages and more.

Here is the article page for the Charles River, which is in namespace 0.

Charles River article in English Wikipedia

Here is the talk page for the Charles River, which is in namespace 1.

Charles River talk page in English Wikipedia

My project is about the main/article namespace (0 for all languages), so only pages with namespace 0 were added to the datasets I generated.

It took anywhere from 5–20 hours to generate each of these datasets (depending on the size of the language) so I typically ran this code overnight. Once my code was done running, I ran some data quality checks to ensure that my datasets were complete.

First, I wanted to be certain that my code parsed through the entire data dump and didn’t stop prematurely. I downloaded the last XML file of each data dump to my personal laptop and pulled the last 500 lines using a tail command in terminal.

example of tail -500 to get last 500 lines in Italian data dump

I saved these results to a separate file and manually checked that each page with namespace 0 which was not a redirect appeared in the headers dataset generated.

All rows which contain “Tour du Finistère 2016” in the Italian headings dataset

Then, I went to each article page to check that I captured all sections headings in the correct sequential order.

“Tour du Finistère 2016” Wikipedia article

Here, you can see that the table of content (“Indice” in the Italian article) does indeed match the results I got in my terminal window. I repeated this for each article page in the last 500 lines of each data dump.

If this all passed, I took it one step further and used the random article generator to randomly spot check headings from additional articles.

Example of random article generator in Italian Wikipedia

This was a long process because a lot of the derived datasets didn’t pass these checks, so I had to keep recreating them until they did. I’m still not sure why the code doesn’t always complete, so if you want to reuse that part of this project, I highly recommend checking the datasets for completeness. If you don’t want to generate these yourself, you can download the public datasets I released (which have passed all the data quality checks) here.

Once I had a clean dataset, the next step was to complete the analysis. The actual question I was aiming to answer was not complicated to code, but my main hurdle in this project was getting the code to run in PAWS. This was partly to allow anyone to recreate this project easily in the future and partly to help debug and test PAWS.

As mentioned, PAWS is still in beta mode and one caveat to this is there is a 1 GB memory limit which each user is restricted to. This may not be an issue for many users, but the headers datasets can get very large. Below are the final (uncompressed) sizes for the datasets generated from the November 1, 2016 data dump.

English: 1.2 GB
French: 543.7 MB
German: 448.5 MB
Italian: 327 MB
Spanish: 295.7 MB

Initially, simply reading a tsv file into a pandas dataframe crashed the PAWS kernel. I had to come up with workarounds in my code to account for the memory limitations. For example, the section headers tsv files are read in small chunks of 100,000 rows, then concatenated together into one pandas dataframe to avoid reading the entire file into memory at once. Also, three columns of the dataframe are converted from the standard np.int64 to np.int32/16/8 data types to conserve memory. Still, the English, French, and German languages produce header files which are too large to analyze in PAWS, so the results for these were completed on my personal laptop and posted to a Github repo. I also created PAWS notebooks for these languages which contain commented out code that can run if and when the PAWS memory limits are increased.

In some cases, the wikitext for headings in an article may be formatted as “== See also ==”, but in other cases it may appear as “==See also==”. The former method produces the heading “ See also ”, while the latter produces “See also”.

Example of wikitext heading in XML data dump file with no whitespace versus with whitespace

These appear identical to readers of a Wikipedia article, therefore can be regarded as the same for my analysis. All leading and trailing whitespace in the heading text is removed to avoid duplicate titles. This created another issue regarding memory limitations in PAWS because the code to strip whitespace causes the memory usage to spike. To get around this, right after stripping whitespace I wrote this dataframe to a new tsv file, restarted the PAWS kernel, and read in this tsv file into a new dataframe.

Another issue I went down a rabbit hole for was counting number of unique articles in each language edition. This was required to get the percentage of articles each heading is present in. Here is the line of code I originally used for this:

Count number of unique articles in German Wikipedia

My code is telling me there are 1.72 million articles in German Wikipedia, but this too low because according to the official count, there are 2.01 million articles:

Official statistics from German Wikipedia as of December 19, 2016

There’s a couple things going on here. My dataset is generated from the content available on November 1, 2016 so the official count anytime after this date should be higher as there are constantly new article pages being created. However, it’s unreasonable to assume that nearly 300,000 articles were added in such a short amount of time. The headers dataset adds rows for each heading in an article page, but an article page with no headings at all won’t be included. For example, the article page below is considered a stub page, or an article that is too short to provide encyclopedic coverage of a subject. There are no section headings on this page, so this isn’t included in my derived dataset, even though it is officially counted as an article.

To get around this, I investigated the Cree Wikipedia language edition (which had only about 125 articles total so was easy to test my methods on) to understand article counts. My solution was adding a counter variable in the code which generates the headers datasets to increment each time a page is in namespace 0 and not a redirect page. This counter is still not perfectly identical to the official article counting method, but it’s a much closer approximation and works for my purposes. The counter variable for German Wikipedia counts 1.99 million total article pages.

Once the headers tsv files were thoroughly vetted, I had to compress them to make them more manageable for others to download. To do this, I downloaded the files from PAWS to my laptop and used terminal to turn these to .bz2 files. When I tried manually uploading these back to PAWS, I learned about the 25 MB upload limit.

I got around this by uploading the compressed files to my personal website and using the wget package to download these back into PAWS. The English headers file (which is the largest at 203 MB compressed) wouldn’t complete uploading to my personal website, so I had to get help from one of the tools team members to manually move this into PAWS via an scp command, but there isn’t an easy process set up for this (yet).

I posted a quick summary of my results in the different language editions’ village pump pages and got some good feedback from the community about my results. Some folks suggested better translations for the way a heading is used in practice in that language edition instead of the literal translation. Some folks were surprised at the percentages of some of the headings (ideally, each language would have References in 100% of the articles) and others offered explanations why some headings may be as popular as they are.

After wrapping up this project, I gave a quick presentation at the weekly internal research meeting at WMF and learned about a current research project to expand stub articles across different language Wikipedias where my project results may be useful. I’ve set up a meeting to talk to one of the researchers working on this project while I am at the WMF headquarters in January.

During the time I worked on this project, I had daily check ins with my mentor via IRC where we discussed my work completed, any roadblocks I was facing and ideas about how to debug them. I got exposure to members of the tools team at WMF when I ran into issues with PAWS and worked with them to debug these. I was also able to give them feedback about my experience with the PAWS project. Everyone I’ve worked with so far at WMF is incredibly helpful (even when they’re all super busy), smart and friendly.

Trying to get all this working in PAWS was frustrating at times and very time consuming, but I ended up learning a ton about memory in python, pandas and PAWS which is really good knowledge for a data analyst. I had to do a lot of testing, debugging and creative problem solving to complete this project. Overall, despite this project taking longer to complete than expected and some minor setbacks, I’m really happy with it. It’s useful to the community, I learned a lot, and got to release a brand new public dataset for the world to play with. Stay tuned for more blog posts about my internship!

Outreachy Internship: Project 0

Written by Zareen Farooqui