It’s crazy how fast the past 3 months have gone by, but I’m in the final week of my Outreachy internship with the Wikimedia Foundation!
The week after the Wikimedia Developer Summit, I spent 3 days working from WMF headquarters in San Francisco. It was great to meet and work with some of the team in person, mainly my mentor, Tilman Bayer, a senior analyst at WMF. The office is spread across a couple floors with lots of meeting rooms and open spaces for people to collaborate.
I gave a talk at a PyLadies San Fransisco meetup which was hosted…
The Outreachy Internship program grants a $500 travel allowance to each intern to attend a relevant conference or workshop of their choice. I attended the Wikimedia Developer Summit this week in San Francisco, CA at the Golden Gate Club.
This is an annual gathering for technical contributors, third-party developers, users of MediaWiki and users of the Wikimedia APIs. The first two days of the summit consist of technical sessions, while the third day is a completely unscheduled day to “Get Stuff Done”. Basically, everyone gets together to hack on the ideas discussed the previous days to actually make them happen.
My second data analysis internship project with the Wikimedia Foundation includes generating (at least) 3 readership metrics reports. This report aggregates various data sources to display critical information about Wikimedia projects such as daily pageview counts, the percentage of desktop vs. mobile traffic, and the number of daily iOS + Android app downloads. These are common statistics for websites to track to help understand user behavior and call attention to unusual events.
During the course of my internship, I plan to release one report per month. I published my first readership report on December 12th, 2016 which covered the 18…
If you’ve read a Wikipedia article before, you may have noticed that it’s separated into sections like “Career”, “Plot” or “References”. The purpose of these section headings is to organize the content on each page. Which of these section headings is most popular? This is the question I answered in my first data analyst internship project with the Wikimedia Foundation. I investigated section headings in 5 large Wikipedia languages and released a brand new public dataset of article section headings!
All analyzed language editions have some version of “References” and “External Links” as the top 2 frequent heading titles. This…
Since I’ve started coding, I’ve released all my completed projects under the GNU General Public License (open source) on Github. I hadn’t heard of the term open source until I started getting technical and since then I’ve been intrigued by this idea. Lots of people are incredibly passionate about open source software. One major advantage of open source software is the flexibility to easily manipulate the source code for different use cases. Established open source projects often provide high quality software since anyone can add features, fix bugs, report issues and test the code.
Often, people who contribute to open…
Here’s what I learned while exploring Wikipedia Clickstream Data:
The Wikimedia Foundation intermittently releases Wikipedia Clickstream Data. This data shows where the traffic to Wikipedia pages comes from (like Google, Facebook, Twitter or other specific Wikipedia pages). This helps to understand what people are curious about.
I followed this article by Ellery…
While building the Harry Potter Word2Vec web application, I tried to hide all of the technical complexity from the end user. I wanted to make the user’s experience as simple and streamlined as possible. In this post, I’ll explain how the web app works behind the scenes and my design choices as I built it.
If you have not read the previous “Word2Vec on Harry Potter” blog post introducing the website and the w2v algorithm, please do that now.
Here is my code.
“You shall know a word by the company it keeps.” -J.R. Firth
Over the past few months, I’ve become fascinated by how machine learning applies to natural language problems. In the earlier Harry Potter Text Analysis project, I wrote Python code to extract insights. By using machine learning, I can take a more sophisticated approach. I’ve been specifically learning about the open source word2vec ML algorithm from Google that aims to learn the meaning behind words.
This $9 scratch-off card is the gateway to the internet for the people of Palau.
Over the past few months, I’ve taken for granted that knowledge is only a few Google searches away. Then, I spent 2 weeks working remotely from Palau (with plenty of scuba diving breaks) and quickly realized how important high-speed internet is for learning technical topics.
The internet in Palau is really, really slow. This made me very curious… why is the internet in Palau so much slower than the internet in Boston? How does internet speed work?
There are 2 factors:
The seventh and final book in the Harry Potter series, Harry Potter and the Deathly Hallows was published in July 2007 and sold 11 million copies worldwide within 24 hours. This makes it the fastest selling book in history . I was one of those dedicated fans who stood in line at midnight to get my hands on the book. I read the entire book in one sitting immediately after getting home. It was that good.
My first Python project is text analysis of the Harry Potter series. I’m familiar with the data, which in the scientific world is called…