Parsing Wikipedia in Scala and Spark

Alexey Novakov
SE Notes by Alexey Novakov
4 min readOct 2, 2017

If one day you would need to analyze Infoboxes of Wikipedia articles, then it might be usefull for you. I described how one can get the Wikipedia articles and extract certain meta-information about every article using Infobox object, which holds some article attributes.

Example of Infobox

An Infobox text from any Wikipedia article looks like below. This particular example belongs to Writer Infobox and described here:

Similar descriptions are defined for all the other Infoboxes. Its name reminds an article category or type. Usually, you can see an infobox on right hand side of the page, like here:

{{Infobox writer ! — For more information, see [[:Template:Infobox Writer/doc]]. — 
| name = Alain Dister
| image =
| caption =
| image_size =
| alt =
| birth_name =
| birth_date = {{birth date|1941|12|25}}
| birth_place = Lyon France
| death_date = {{death date|2008|07|02}}
| occupation = Journalist, Photographer, Writer
| nationality = French
| education =
| alma_mater =
| genre = Rock and Roll, Punk Music, Beat Literature
| movement =
| notableworks =
| spouse =
| partner =
| children =
| relatives =
| awards =
| signature =
| signature_alt =
| years_active = 1962-present
| module =
| website = http://alaindister.com/
| portaldisp =
}}

How to access the Wikipedia articles

Wikipedia gives us options to download current articles as XML or SQL dump files. I chose XML file in order to parse it by Apache Spark directly.

Link to English Wikipedia XML file: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

File, I downloaded, was about 13 GB in archive. Unzipped size was about 50 GB, not sure about the exact size, since I removed it once I parsed the required Infoboxes for my further analysis.

Program to parse the XML file

File consists of huge XML tree which icnludes all the current articles as of Wikipedia dump file date creation for a specific language, for example English.

Let’s create a program to be run like this:

./wikipedia-page-processor-1.0/bin/wiki-dump-parser enwiki-20170801-pages-articles-multistream.xml datawriteren

parameters:

  1. input file name “enwiki….”

2. parent output folder name

3. Infobox name, for example “writer”. Only articles with this infobox name will be parsed and saved as separate files.

4. output folder prefix. According to above command, folder will be named like this: “data/en-writer”. It indicates language of the Wikipedia.

Scala code of the Parser

PageParser is using event-based XML parser to find specific infoboxes and save them as a separate files.

Let’s look at two main methods of the implemented parser.

callback: PageInfoBox => Unit is actually a function to save the extracted Infobox to disk as separate file.

Convert property files to CSV files by Spark

I used Spark here just for fun. Actually, parallel processing of files and writing them into another format could be also done via Akka Streams or similar tool by leveraging async or parallel combinators to speed up the entire process.

Let’s implement a programm which can be run like this:

type=$1prefix=$2./wikipedia-page-processor-1.0/bin/spark-reducer data/$2-$1 data/csv-$1 $1

Where the prefix stands for language, for example “en”. Type stands for Infobox name, for example “writer”.

Below is a program that converts the infobox files (property based) to CSV files.

We pass an input folder path to Spark using wholeTextFiles, so that later it will distribute files reading/writing across multiple partitions.

infoBoxPropsMap: Map[String, Map[String, String]] is a map from Infobox name to the map of property names and default value which is empty string. Later we populate result map with what was parsed from file.

Spark SQL to analyze Wikipedia Infoboxes

Merge single line CSV files into one CSV file which will contain all the infoboxes. Let’s use simple bash script for this task:

for filename in data/csv-$1/*.txt; do  cat “${filename}”  echodone > $1.csv

Single argument is an Infobox name. We could also reduce the whole RDD to a single data set and store its content to a file, instead of doing it manualy.

Now let’s use Spark SQL to run some query.

Data quality of the Wikipedia Infoboxes is very bad. Even though every Infobox template has property definition, property value may have some thrash. All the values are optional as well as may have some MediaWiki characters, which are quite anoying for any data analysis to filter out.

As noticed above, I had to do some normalization and data filtering before counts aggregation.

+----------------------+-----+
|nationality_normalized|count|
+----------------------+-----+
|American |5834 |
|British |1578 |
|Indian |1091 |
|Canadian |851 |
|Australian |485 |
|French |414 |
|English |364 |
|Irish |228 |
|South Korean |199 |
|Polish |151 |
|Italian |140 |
|German |136 |
|Russian |130 |
|Spanish |126 |
|Chinese |125 |
|Pakistani |118 |
|Japanese |113 |
|Romanian |99 |
|Scottish |99 |
|South African |83 |
|Swedish |75 |
|Brazilian |74 |
|Dutch |68 |
|Norwegian |60 |
|Ecuadorian |52 |
|Bangladeshi |51 |
|Ugandan |50 |
|Austrian |42 |
|Turkish |40 |
|Mexican |39 |
+----------------------+-----+

No surprises, English Wikipedia has most of the writers from English-speaking countries :-), however Indians have been really active in English Wikipedia by adding lots of writers and took third place among the other nationalities.

Full code based can be found here:

--

--