Diving Into Wikimedia Foundation’s Data
My second data analysis internship project with the Wikimedia Foundation includes generating (at least) 3 readership metrics reports. This report aggregates various data sources to display critical information about Wikimedia projects such as daily pageview counts, the percentage of desktop vs. mobile traffic, and the number of daily iOS + Android app downloads. These are common statistics for websites to track to help understand user behavior and call attention to unusual events.
During the course of my internship, I plan to release one report per month. I published my first readership report on December 12th, 2016 which covered the 18 week timespan from August 1 — December 4, 2016. My second report was published January 6, 2017 and covered the 4 weeks since the previous report. I plan to release a third report in early February. Time permitting, I may also generate a fourth report in early March.
Working on these reports gave me a lot of insight into the data WMF collects, analyzes and the general trends they have been seeing over the years. For example, I learned that Wikimedia projects see a “slump and bump” around the winter holidays — desktop engagement declines while mobile engagement increases.
December 20, 2015 was the first day the majority of pageviews came from mobile devices (mobile traffic tends to be higher on weekends). However, the week from December 26, 2016 — January 1, 2017 was the first full week where this was the case.
I also learned that although the number of pageviews from mobile devices are growing, when readers visit English Wikipedia from a desktop device, they tend to visit more pages than they would coming from a mobile device (not including the Wikipedia app).
To understand which changes are seasonal, I looked at year-over-year pageviews since May 2013 below. This shows that Wikimedia projects see a seasonal drop in pageviews around the Christmas holidays, but this rises back to normal rates in January.
Before I could start this project, I had to request and gain access to WMF’s private data servers. I started this before my Outreachy internship period officially began as it can sometimes take a couple weeks to get everything sorted out. This process entailed:
- signing a volunteer/intern NDA
- signing Acknowledgement of Wikimedia Server Access Responsibilities form
- determining which servers I needed access to for my projects
- creating a developer MediaWiki account
- opening a Phabricator ticket with my request
- creating a SSH keypair for production shell access
Once I got access to the servers, I still had work on my end to get everything set up. This was a little confusing, but I documented my steps here (hopefully this will help others requesting access in the future). Now, I could ssh into the WMF servers and run Hive queries through the command line.
Before starting to generate a new readership report, I read through a couple of the last editions to familiarize myself with the metrics and queries. Tilman, my mentor (who had previously been creating these reports), sent me his Google sheets where he’s been capturing this data and generating charts.
Running the queries and updating the data wasn’t especially hard since this part is already well documented and for the most part I simply had to update the date range. However, as I started this project, I realized I had only a fuzzy understanding of Hadoop from a video I watched a few years ago, so I took some time for personal learning. I gained experience with Hive, MySQL, and Hadoop MapReduce while working on this project.
Here are the resources I used:
- I watched this 1 hour video which is a great intro to MapReduce
- Hadoop in Practice (ch 1 + 10)
- Hadoop, The Definitive Guide (ch 1 + 12)
- Hive Language Manual
- Hive Cheatsheet for SQL Users
Apache Hive was developed by Facebook as an open source data warehouse infrastructure for querying and analyzing huge volumes of data stored in Hadoop clusters. Hive makes it easy to run SQL-like queries (called HiveQL) as MapReduce jobs in Hadoop. My previous experience with SQL was useful in quickly getting up to speed with HiveQL.
Apache Hadoop is an open source software for distributed storage and processing of very large data sets. The storage part is the Hadoop Distributed File System (HDFS) and the processing part is Hadoop MapReduce.
It’s probably worth mentioning that although they are still widely used in industry, MapReduce and Hive are slightly outdated technologies and there are faster options available.
Once I felt more comfortable with these technologies, I finished generating the data by running queries in Hive, updating the Google sheets and charts, and calculating average values for the timespan of the report.
Next, it was time to make sense of all the data and catch any anomalies and patterns. This part was super interesting. I learned the (obvious) lesson that things aren’t always what they seem and sometimes the data doesn’t back up one’s initial assumptions.
Specifically, there was a large spike in pageviews on November 9, 2016. Of course, I assumed this was due to the US Presidential Election, but I had to prove it through the data. One way to do this was to investigate along the country dimension, and find the countries with the largest change around this time. However, when I ran the to query to do this, I found that the US showed no unusual growth around this time. This doesn’t necessarily disprove my theory, but doesn’t support it. After more investigating, I couldn’t find a good answer to explain the spike and it wasn’t worth spending more time looking into it.
I wrote the draft up in Google Doc with input from Tilman, and sent it out to the appropriate WMF mailing lists. I posted the report and all the charts to the Wikimedia Commons.
This was a fun project to assimilate myself with some of the most important WMF metrics while providing useful information to members of the Wikimedia community. If you’re interested in learning more about these metrics, be on the lookout for the next report in early February!