By three methods we may learn wisdom. First, by reflection, which is noblest; Second, by imitation, which is easiest; and third by experience, which is the bitterest. — Confucius
For the purpose of tracking the time I spend on a browser, I use the Limitless extension on Chrome. While it gives me the time spent under categories, I thought it might be useful to inspect all my browsing data for the past year.
Here begins my quest to understand all that there was in my browsing data.
Getting History Data
The first step in the process was to get all the browsing data for the past year. Google chrome stores the past 3 months of history on a device in SQLite format, but I ended up exporting my Google tracked data using Google TakeOut. The exported json has my browsing history across all devices, including mobile.
The history stored by Chrome or tracked by Google does not give me the session information i.e. time spent on each tab. So my analysis is mainly focused on the number of visits and the time of visit rather than the session or the duration. A part of me is relieved actually, to know that Google is not tracking it yet.
Once the data was downloaded, I began by loading the data into a Pandas dataframe:
import pandas as pd
with open("BrowserHistory.json") as f:
data = json.loads(f.read())
df = pd.DataFrame(data["Browser History"])# A possible param if differentiation is needed b/w different clients
df.drop('client_id', axis=1, inplace=True)
df.drop('favicon_url', axis=1, inplace=True)
This is how the output looks like:
page_transition: Contains info on the type of page open like reload, type & enter, link open etc. I was satisfied by filtering only on LINK and TYPED
df = df[(df['page_transition'] == "LINK") | (df['page_transition'] == "TYPED")]
Extracting/Extrapolating new columns(features):
To start off, I needed to break the time (in microseconds) to human-readable datetime format. Then I needed to derive features from it like hour, day, month, or day_of_week. From the URL field, extracting the top-level domain could be a useful field for analysis. So I used tldextract to create a new domain column in the dataframe.
return datetime.datetime.fromtimestamp(x/1000000)days_arr = ["Mon","Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
return days_arr[x.weekday()]def get_domain(x):
domain = tldextract.extract(x)
sub_domain = tldextract.extract(x)
if sub_domain == "mail":
return sub_domain + "." + domain
# Ugly hack to differentiate b/w drive.google.com and google.com
if domain == "google" and sub_domain=="www":
return domain# time_usec column is picked and for each row, convert_time(row) is called. The result is stored in the same dataframe under column dt
df['dt'] = df['time_usec'].apply(convert_time)
df['domain'] = df['url'].apply(get_domain)
Then I extrapolated the domain information to group well known domains into one or the other categories(buckets) defined by me:
if x in ["coursera", "kadenze", "fast", "kaggle", "freecodecamp"]:
elif x in ["ycombinator", "medium", "hackernoon"]:
return "Other"# Cluster popular domains into a category
df['category'] = df['domain'].apply(get_category)
After all the operations, the dataframe now contains the following columns and basic analysis could begin.
Available columns: title,date,hour,month,is_secure,is_weekend,day_of_week,domain,category
Exploring data and creating visualizations
Secure vs Insecure usage:
Once I have a dataframe with some numerical and categorical columns (month), creating a plot is super easy.
import seaborn as sns
sns.countplot(x="month", hue="is_secure", data=df)
# Manual inspection, picking 50 random domains which were insecure
random.sample(df[df["is_secure"] == "N"].domain.unique(), 50)# To view data for such domains
df[(df["domain"] == "mydomain") & (df["is_secure"] == "N")]["url"]
After looking at a couple of such visits, I ended up checking this site. It asks for Passport or Aadhar (India’s equivalent of SSN) number, along with email and mobile, while booking a jungle-safari, over HTTP. I failed to notice it earlier! Final booking is handled through a separate and secure gateway. However, I still would feel much safer typing my demographics and passport details over HTTPS.
Instead of manually exploring rows, one stricter solution could be to add all such domains to an extension like BlockSite. They could be enabled as and when needed.
Weekday vs Weekend browser usage:
#is_weekend="Y" for saturday and sunday, "N" otherwise
sns.countplot(x="hour", hue="is_weekend", data=df)
Browser usage over months:
To achieve this, I selected a subset of rows based on month’s condition and then grouped everything by the hour and date, to form a GitHub style heatmap viz.
from matplotlib import pyplot as plt# Getting unique values after grouping by hour and date
df_new = df[(df["month"] >= 11)].groupby(["hour", "date"])["domain"].size()
df_new = df_new.reset_index(name="count")plt.figure(figsize = (16,5))# Pivot the dataframe to create a [hour x date] matrix containing counts
sns.heatmap(df_new.pivot("hour", "date", "count"), annot=False, cmap="PuBuGn")
The above code can easily be filtered. This can be done by adding more conditions to identify productive vs non-productive tab open timings and to view patterns over days. For example:
cat_arr = ["Shopping", "TravelBookings", "YouTube", "Social"]df_new = df[(df["category"] in cat_arr)].groupby(["hour", "date"])["domain"].size()
Browser visits by day of week and hour:
I created another type of aggregated heatmap where I tried visualizing wrt hours and which day of the week it is.
df_heat = df.groupby(["hour", "day_of_week"])["domain"].size().reset_index()
df_heat2 = df_heat.pivot("hour", "day_of_week", "domain")
sns.heatmap(df_heat2[days_arr] , cmap="YlGnBu")
One would expect post 5 pm Fridays through Monday morning to be of light usage. But what was interesting for me to reflect on was the light-colored areas on Wednesday evenings.
Now to use the custom categories that I manually bucketed the domains into. I generate the same heatmap again. But now with a condition on popular shopping sites. Note that the list is manually created by me based on my memory and random peeks into the unique domains I visited.
df_heat = df[df["category"] == "Shopping"].groupby(["hour", "day_of_week"])["category"].size().reset_index()
It’s good to have the satisfaction that I usually do not go on a shopping spree during office hours. However, the chart encouraged me to manually explore Thursday(20:00–21:00) and Friday(15:00–16:00, 00:00–01:00). At a higher level, I was very confident that I never shop during office timings. However, the heat-map shows some instances of such visits, shattering my illusions.
Most revisited stackoverflow questions:
A good friend once told me:
Understanding stackoverflow usage helps you understand either your areas of improvements or configurations/syntax you ought to remember.
In any case, it’s good to have a cursory look at the most frequent visits for each month/quarter.
df_so = df[df["domain"] == "stackoverflow"].groupby(["url", "title"]).size()
df_so = df_so.reset_index(name='count').sort_values('count',ascending=False)[["title", 'count']]df_so.head(15)
Maybe I should cache the page which shows me how to iterate over a Pandas dataframe!
Apart from stackoverflow, one of my most visited sites related to Pandas would be Chris Albon’s notes on python and data-wrangling.
In general, it is very interesting to observe how your most-visited pages change theme over months. For example, they may move from simple questions to more complex, deeper ones. This is true as you build your understanding towards something new.
Lastly, just for fun, I ended up concatenating titles of all my stack-overflow searches for the past year. I then generated a decent looking word-cloud out of it.
Thank you very much for your time. If you enjoyed reading, please give me some claps so more people see the article. Thank you! And, until next time, have a great day :)
A working notebook is present on GitHub with some more visualizations and some quick hacks around the data. Please do try it out with your own history dump and share interesting insights!