Context Graph Data Analysis

Jason Chuang
Firefox Context Graph
11 min readMar 3, 2017

Hello!

We are members of a new data science team at Mozilla, tasked with analyzing the data from the Context Graph experiment that many of you participated in.

We expect this blog post to be the first of many that we’ll share with you in the coming months. Through these messages, we hope to establish an open two-way communication with you about what we’re doing with data at Mozilla — so that you can see your contribution at work, and we can respond to any questions or feedback you may have.

The Context Graph experiment is an exploration on whether we can better understand how people interact with the web through data — in order to help provide you, our users, a better browsing experience and help ourselves, the employees and contributors at Mozilla, more effectively advance our non-profit mission. We’re excited that you’ve decided to embark on this journey with us. In this blog post, we outline some of our plans, and describe some of our first steps and initial findings.

Improving the user experience

The advent of new web technologies and services, as well as new devices, is rapidly changing the way people access the web today. Particularly, the way users discover and rediscover content on the web has changed alongside new browser features: how our users utilize bookmarks, password managers, the AwesomeBar, as well as various add-ons and online tools; and how these habits and features affect a user’s security, privacy, and convenience while they’re online.

Your data, collected in this Context Graph experiment and in future collection efforts, can help us uncover these patterns.

We plan to explore browsing data along various usage patterns, so that we can compare the experiences of different groups of users. For example, how do users who use password managers experience the web differently than those who don’t? How do users who open multiple tabs conduct various tasks (e.g., shopping, reading news, planning a vacation) differently than those those who do them in a single tab?

We plan to examine the conditions under which users employ various features within Firefox. For example, how do users arrange windows and tabs to stay organized? How do users file their bookmarks into folders and subfolders? How do users separate work- and home-life through profiles? How do we improve these features to better meet your needs?

Might our users be experiencing performance issues, slow internet connections, or compatibility problems? How do we detect them early and respond faster through better data analysis?

Advancing Mozilla’s mission and contributing to science

Beyond improving Firefox, your data will also help contribute to various scientific studies.

Unlike some other companies, for whom collecting and holding a massive amount of personal data is critical to their business model and revenue stream, our main goals here at Mozilla are to protect your privacy while using your data in the an effective and open manner to advance our non-profit mission and benefit the public.

To this end, we’ve sought out collaborators in the academia and industry labs, and will look into various research questions such as:

1. Differential Privacy

How do we collect information from users in a privacy-preserving manner, that masks all signals that can identify a single individual while allowing us to extract large-scale statistical patterns? Results from these studies will both contribute to the latest academic research on differential privacy as well as inform the Context Graph data collection itself.

2. Algorithm Fairness

Many business and legal decisions are increasingly made based on “big data analysis” including patrol decisions, salaries, promotions, and so on. Many of these decisions, however, can be complex and require thoughtful deliberation beyond “calculating the most prominent trend.” For example, suppose a company has historically paid men more than women, and promoted men faster than women in middle to upper management positions. A naive algorithm that treats data as “objective” will faithfully uncover patterns it finds in the historical data, and continue to enforce these biases when it is asked to assess future salary and promotion decisions. How do we design data analysis techniques that are aware of and respect all members of a diverse population? Your browsing data will allow us and our collaborators to study, design, and evaluate algorithm fairness.

3. Internet Health

Historically, the world wide web is defined by hyperlinks that bring together content from all over the internet and connect users with relevant information. However, large parts of the internet are now being walled off from public access. More and more, users now discover content on the web not by following hyperlinks but via social media and other types of ad hoc content recommendations. Many sites now operate with financial incentives to keep users on their site for as long as possible, but are they serving the best interest of their visitors? Your browsing data will help us identify how users move across the internet; help us better understand how to monitor and assess the health of the internet; help us develop tools for discovering content that are of interest to the users, not just the value of advertisers; and help ensure the valuable resource that is the internet stays open, connected, and accessible to all.

Initial data exploration

While we have a lot of plans (and hope you’re as excited about them as we are), let’s take a step back and look at what data analysis involves. In particular, for our very first experiment, we collected limited browsing history from 10,000 opt-in participants over a three-month period.

What does a data scientist do, when asked to “examine a dataset”? A critical first step in data analysis is getting a sense of what’s in the data: How is the data stored? What are the fields of the database? What is the quality of the data?

Data scientists spend a significant amount of time on data cleaning and transformation. While these steps are often referred to as janitor work, they take up 50% to 80% of the time in data analysis. Beyond scrubbing dirty data, however, these exploratory steps can also inform us of statistical analyses that the dataset may support or hypotheses that we may test against it.

We collected three types of information from participants who opted into our experiment.

  • URLs that the participants visited during the course of the study;
  • time and duration of the visits; and
  • an arbitrary identifier that allows us to disambiguate whether multiple URLs were opened concurrently.

Without burdening the audience with all the “janitor” data cleaning work, below are examples of four analyses we performed to examine potential statistical patterns and identify data issues.

1. Dirty Data: The Super-Human Clickers

We looked at how the total number of URL visits are distributed across the users.

Why might this be an important question to ask? Well, before we move ahead with any analysis, it would be good to know whether the records in our database are collected from a large number of average users or a small number of extreme users. Statistically, we need the former in order to ensure we have a good representation of the user base. We also prefer the former from a human-centered design perspective. If the data were to come from a small number of extreme users, any statistical patterns we extract and subsequent decisions we make based on the data, will be tailored towards these few users and will likely not benefit the greater Firefox or web user population.

To examine how URL visits are distributed across users, we grouped the users by their amount of activities into bins on a log-scale. The x-axis in the visualization below represents user activity. For example, the bar over the number 3.0 represents all URL visits coming from users who visited approximately 10^3.0 = 1,000 URLs over the course of the study. The y-axis represents the total amount of activities, from all users in each bin. For example, the users in the 10^3.0 bin (there were 582 of them) made a total of 675,708 URL visits.

The resulting chart is (mostly) a Gaussian distribution with a mode of users who made approximately 10^3.6 = 4,000 URL visits. Whew! Breathe a big sigh of relief. We actually have a pretty decent spread of users by the amount of their activities.

But wait! On the right side of this chart, we do see three users who averaged 10^5.3 = 200,000 URL visits, and who collectively sent in over ⅔ million URLs. That’s approximately 1.7 URL visits per minute, per user, non-stop continuously over three months! While I heart the internet, I suspect I would get a pretty sore finger (or two) if I browsed the web so intensely. We suspect browsing behaviors exhibited by these users will not be representative of the typical Firefox (or web) user, any statistical patterns extracted from them will not be meaningful, and therefore we have excluded these outliers from subsequent analyses.

Pretty simple so far, eh? Yes, but we have to repeat these steps for numerous data dimensions and factors (and combinatorics of these data dimensions and factors). While it’s good news to see the data (mostly) fitting expected patterns, the process gets dry pretty fast. There’s a reason why it’s called janitor work, but these sanity checks need to be done!!

To spare the readers from these mundane details, let’s jump ahead to a few other analyses we performed further down the line.

2. Recurring Temporal Activities

We looked at user activities on various recurring temporal scales (i.e., hourly, daily, weekly). After transforming the timestamps of the URL visits, we applied seriation to group together users whose temporal activities exhibit similar recurring patterns.

In the visualization below, each of the major columns represents one day of the week (Sunday through Saturday). Each of the minor columns represents one hour with a day. Each of the major rows represents one user (labeled by their UserID). Each of the minor rows represents one week within the study. Some users joined the study mid-way and therefore sent in fewer than 12 weeks of data. The blue squares represent the average number of (concurrent) tabs opened by each user, at each hour of each day of each week during the course of the study. The darker the blues, the more tabs opened. A white space indicates that a user did not use Firefox during the corresponding hour.

We then calculated a “recurring temporal activity similarity score” between all pairs of users. We sorted the 10,000 users so that those with similar recurring temporal activity patterns are placed near each other in the visualization. (Actually, we created multiple similarity scores to assess how the different measures affect our analysis. For clarity and simplicity, we show only the results from one run of the analysis in this blog post.)

We found 1,000+ users who use Firefox regularly during the same eight hours, during business hours in Eastern Time and only on weekdays. The first 30+ users in this group are shown in the visualization. (The number is only approximate because seriation is akin to “soft clustering” and therefore produces no hard boundaries between the user groups.) Other consistent groupings of users include those who exhibit 7-day recurring patterns, continuous activities, night owl behaviors, as well as 5-day business-hour patterns but phased shifted for the various North American time zones and/or workday lengths. Through this process, we also found 2,500 users who exhibit highly irregular temporal activity patterns.

3. Concurrent Browsing Activities

We repeated a similar analysis, but instead of grouping users by their recurring temporal activities, we looked for patterns of concurrent activities. In other words, the number of tabs opened at the same time.

In the visualization below, each row represents a user (among the 8,000 users who exhibit a recurring temporal pattern). The columns represent cases where a user opened N tabs concurrently, where N = [1, 2, 3, 4, 5, 6, 7–8, 9–10, 11–20, 20+]. The length of the bars represents the amount of user activity that took place while the user has N tabs open. Users are sorted so that those with similar concurrent activities are placed near each other. (We also explored various similarity measures and various ways of binning N, including no binning, to assess the stability of our analysis.)

We found that very few of the users in the dataset browse the web mainly in a single-tab mode. For example, from rows 700 to 1,100, we observe that about 400 (or 5%) of the users in the study visited the vast majority of the URLs while having 1 to 2 tabs open. From rows 0 to 700, about 9% of the users work with up to 3 tabs. From rows 1,100 to 2,600, about 19% of the users work with up to 4 tabs. At the bottom of the chart, from rows 6,000 to 8,200, more than 1/4 of the users opened more than 20 tabs at some point during the first Context Graph experiment. Do you constantly have 20+ tabs open when using Firefox? You’re not alone! ;)

4. Domains

We also looked at the summary statistics of URL domains. From September to December 2016, the self-selected participants in our study visited the following domains (including all subdomains of each site) the most frequently.

  1. Google
  2. Facebook
  3. Amazon
  4. Craigslist
  5. Yahoo
  6. Reddit
  7. YouTube
  8. eBay
  9. Wikipedia
  10. Live

The ranking of Amazon, Craigslist, and eBay stood out in this list, both individually and collectively as e-commerce sites, as more popular than published national statistics. Unfortunately, we do not have sufficient data to explain the discrepancies.

Given our experiment spans the holiday shopping period, one hypothesis is the seasonality in web browsing behavior. Interestingly, the most frequently visited subdomain for Amazon (other than www.amazon.com itself) is smile.amazon.com, an option for users to donate to charitable organizations through their Amazon purchases. Could the holiday shoppers in our study also be feeling extra charitable?

Next steps

We are excited to be moving forward with the next phrase of the Context Graph experiment.

We will be triangulating our analysis findings with other studies done by the user research team at Mozilla, so that we have a complete picture of the Firefox user experience not just in terms of numbers and statistics — but also what the numbers mean to our users, how they affect your daily work, and impact your interactions on the web.

We will also be launching a second data collection effort, so that we can learn more about long-term usage patterns and answer targeted questions. The findings will go to relevant product teams to help improve Firefox features, engineering teams to help improve Firefox performance, as well as folks at the Mozilla Foundation examining the health of the internet.

Your participation in the Context Graph experiment is contributing to the work across the entire Mozilla organization (and beyond, to leading-edge academic research). You may hear from us again shortly. We hope you will respond to our requests.

If you have any feedback or questions, please do not hesitate to reach out to us. We would like to keep this conversation about data and analysis open between us, so that we all learn from this experiment. Cheers!

--

--