Photo by Markus Spiske on Unsplash

A little over a month ago I had the honor and privilege of traveling to London, England with my friend Maya Filipp, for our first ever Mozilla Festival (MozFest) experience. I should also mention that it was an all-inclusive, entirely paid for trip, sponsored by Mozilla for winning the Overscripted-Data-Analysis-Challenge.

The challenge was put forth by The Systems Research Group (SRG) and Open Innovation Team at Mozilla, and it open-sourced a massive data set of publicly available information collected by a November 2017 Web crawl. The idea was to ‘explore the unseen or otherwise not obvious series of JavaScript execution events that are triggered once a user visits a webpage, and all the first- and third-party events that are set in motion when people retrieve content.’

I came across the challenge about 10 days before the deadline, when I saw a tweet by Mozilla about it. It was the end of the summer and I really wanted to do something on my own before classes started back up in September. I had never used Python before, but I wanted to learn it. I’ve tried going through the standard tutorials / documentation in the past, but those tutorials always bore me before I reach anything good. I’m a strong believer in learning through doing, and saw this as a perfect opportunity to get my feet wet. In addition to basic Python, we would need to familiarize ourselves with particular libraries such as NumPy, pandas, Dask , and Spark, as well as figure out how to do our work in a Jupyter notebook. I could tell it would be a fantastic learning experience, and I knew right away I wanted to participate; winning the challenge would just be the icing on the cake.

I read through the challenge description, and apart from the desire to satisfy my technical itch, the topic itself really interested me. We all know we are being tracked online, and realize that the Internet isn’t, but should be, a safe and secure space where our privacy is not a Privilege but a Right. I knew about cookies and about how Headers can be used to pass/store all sorts of information, but what I didn’t realize was the extent to which these technologies are being exploited by companies, business and organizations to collect more personal data on us than we would care to admit.

Martin Lopatka, a really awesome guy I had the pleasure of meeting over MozFest weekend, was one of the people overseeing the project. He put out a great blog post detailing the goals of the challenge and introduced the data set. His post described some exploratory work that had already been done, and provided examples of insights discovered. He also talked about limitations they encountered during their exploratory work, and ended off by listing some suggestions for future directions.

Reading through his post, the unfamiliar terminology had me feeling beat, until I reminded myself I’m doing this for me. There is no grade, it won’t have any sort of negative impact on my school or future career, and if anything, it’ll be a positive (and ideally useful) contribution to the team who kindly open-sourced the data set and invited the public to get involved.

I began by doing some reading and research on my own. First thing I did was read and re-read Martin’s blog in an attempt to understand what on earth this was all about. I visited every source he linked, following sources in those sources, googling terms and trying to process the information I came across. I read over a dozen papers, making sense of some more than others, and revisited many pages numerous times. I had a ton of tabs open, and probably spent a little too much time on the before-the-coding-starts part. While I was doing my prep research, I thought I’d tell some other people about the challenge and see if anyone wanted to join. Luckily, Maya was very interested and she jumped right in with me, helping me search and discover even more interesting material on the subject.

Doing research was the easy part. Time was running out, and we really had to get started on the analyses. First thing we did was export some of the parquet files to csv format so we can scroll through the data and view it in rows and columns. We chose this route because we needed a quick and easy way to see the data and get a sense of what was going on. We wanted to see if we could off-hand identify any patterns related to some of the topics we wanted to explore. This way we could avoid wasting time figuring out how to query the data using code, until we knew more-or-less what we wanted to extract and conduct our analyses on.

Originally I was intrigued by the concept of Evercookies (a persistent tracking tool), and wanted to see if I could think of a unique way to identify their prevalence in the data, beyond the techniques already explored. Meanwhile, Maya was sifting through the data and finding all sorts of goodies in the department of canvas fingerprinting. Given the tight deadline and amount of data we encountered which seemed to be linked to canvas fingerprinting, we agreed it was the best topic to try and conduct our analyses on, leaving Evercookies to the end, if there was time (there wasn’t).

We would not have known where or how to begin coding, if it weren’t for the populated data prep and analyses folders in the repository, and the exploratory analyses already conducted by previous UCOSP (Undergraduate Capstone Open Source Projects) interns. Armed with some really useful .ipynb files, we had just what we needed to lead us in the right direction. The existing work served as invaluable examples, helping us figure out how to code for data extraction and analyses in our quest to identify instances of canvas fingerprinting.

My experiences have taught me that whenever you’re contributing to a project in an unfamiliar language, the most efficient way to get started is by exploring existing code. If Maya and I had tried to learn Python and all those libraries from scratch by walking through tutorials, it would have taken weeks or months by the time we knew enough to do the analyses. When jumping into unfamiliar code territory, I’m starting to believe that the key is to find a good sample code base you can use for reference. That way, you can follow what was done already, adapt/change things, and figure out what the different parts of code do, to see how it all works together.

With a combination of those source files, documentation pages, google, and a lot of trial and error, we were able to whip up a really rough analyses for our submission. We knew it needed work, but the amount we learned and gained in such a short time left us feeling really motivated to continue the work regardless of the challenge outcome.

Some time after opening the Pull Request, feedback and code reviews started coming in. Anyone involved in open source knows what a difference it makes to receive solid, descriptive and detailed feedback, and with each comment that came in we felt more and more appreciative, amazed and surprised that people were actually paying attention to our work and taking great time and care to guide us (special shout out to Martin Lopatka, David Zeber, and Sarah Bird).

As it stands now, there is much to be done on our analyses and a massive data set waiting to be explored and picked apart. We have opened issues in our forked repository based off some of the comments left on our notebook, and our goal now is to get the PR ready for merge over winter break, before school starts up again and dominates our lives. Helping us with our branch would be a great starting point for anyone who is new to all this but wants to get involved, and Maya and I would happily guide, advise, and assist you in any way we can.

What I loved about this challenge (and what I continue to love about this project) is that it’s about much more than just building something or accomplishing some task. Rather, it’s an initiative to empower the community to use this data and come up with new observations, patterns and research findings that will help us better understand the Web. It’s a call to action for anyone and everyone to get involved in open source, placing emphasis on how shared resources and collaborative innovation can lead to much greater discoveries.

I am so happy I came across this challenge and chose to get involved. It’s opened my eyes to the astounding amount of insight to be gained by analyzing data like this, and it has peaked my interest, inspiring me to continue learning more about data science and analysis. When I first read Martin’s post and took a look at the Overscripted repository, I was nervous. I thought it might be too much for me given the tight deadline, and that even if I was lucky enough to get something in on time, it would be lame. I had no idea where to start.

Getting involved might seem meaningless or intimidating, but making the commitment and sticking to it will lead you on a journey of learning, collaboration, growth and more. Hell, I ended up in London, England.