Visualization of 10 Years Twitter Data (Part 1 — Datasets)

This article is a part of production notes for the data visualization project:

10 years ago today, I opened my Twitter account and since then I’ve tweeted more than 102K times. Yes, I tweet a lot. It’s a private account and I tweet mostly in Korean, so please don’t leave to follow me right now.

As a data visualization designer, I’ve always wanted to make something out of my own tweets, possibly the most personal, the most massive, and the most robust datasets that I could acquire. Before the actual design and coding, I’ve been contemplating what aspects of this data would be fun to design for myself and to amuse other people. What I wanted to discover at the beginning of this project includes:

  • Do I really tweet a lot, even during work hours, even right after waking up?
  • How often do I use Twitter to talk with friends?
  • Which one do I use more often for tweeting, computer or phone?
  • What does my network with Twitter friends look like (it was more exploratory that ended up with lots of ideations)?

In this post, I describe what I wanted to show with this visualization project and accordingly how I generated the datasets.

Acquiring and Cleaning Data

I figured that there are two ways to retrieve all your tweets — 1) Downloading archiving your tweets, an option you can find under your account setting, 2) Using Twitter’s Rest APIs. As we all know, there’s no dataset that is perfectly formatted with no missing information, so I had to clean the datasets quite extensively. I can write a whole new article about this process, but my Python code explains better. A few highlights are

  1. Regarding timestamp, old tweets data from the downloaded archive have only up to date, not hour, minute or second. Since I wanted know what time of the day I tweet, I needed this data. For this, I used the rest APIs to get all the complete meta data of every tweet whose ID is found in the archives.
  2. Again, I wanted to know when I tweet. The meta information of each tweet now includes detailed timestamp up to second, however, it’s in UTC so I had to convert all the timestamps to the correct timezone. For this, I made a JSON file that lists where I lived and traveled to for the past 10 years including the timezone.

For this I looked up the stamps on my passport and dug into my flight booking emails. Because I don’t have information about the exact point of the day at which I moved to a different timezone, I converted all tweets on the starting date of moving/travel to the new timezone. I understand this is not the perfect cleaning but acceptable considering the volume of the entire data (+100K tweets).

A JSON data for a single tweet looks like this:

Datasets: “Tweets” and “Friends”

Now I have +102K tweets with correct time and timezone info and other meta data. Then what’s next? When designing data visualization, it is critical to clarify what messages you want to deliver. Clarifying message does not mean that you constrain the insights or stories that users will explore, but it helps you figure out which aspects of the original data you have to focus on.

After lots of iterations and ideations of what to visualize with my tweets data, I ended up with largely two topics — Tweets and Friends.

Tweets

The Tweets view shows when I tweeted in both macro and micro point of views. First, It shows the number of tweets by month in a bar chart. Over this bar, a slider is overlaid to select the span of time, further analysis in which range is followed. Regarding the selected time span, I added a visualization of day (7) by hour (24) matrix, each of 7X24 blocks represents how many tweets I made in the selected hour and day combination during the selected time span.

Surely the time of tweets tell something interesting about myself (e.g., I tweet a lot as soon as I get up), I also wanted to utilize all the meta data of a tweet. Among the fruitful data, I focused on four categories —Interaction, Media, Language, and Source.

Interaction has three types — Mention, Retweet and Quote. Mention is a direct referencing to another user, usually a tweet that starts with @ with user name. Retweet and Quote are ways referencing other open tweets.

Media has four three types — Photo, Video, and URL. In fact, a tweet can have both photo and URL at the same time, but in this analysis, URL means a twee that has only links to external source.

As I mentioned earlier, I tweet mostly in Korean, but at the begging, I tweeted mainly in English. So I added this category to see the change of the use of language overtime. These two — Korean and English — are the types of languages that I include. Other or undefined (mostly pure Emoji tweets) languages are not specified in the analysis.

Source means a medium with which I used when I tweeted. Some examples are “Twitter for iPhone” (the official app for iOS) and “Twitter Web Client.” In this analysis, I aggregated these sources into largely two — Big Screen and Small Screen. Big Screen includes source like web client that I use for a big screen, that is my computer. Small Screen means mobile devices including iPhone, Android, and Windows Phone app, and mobile browsers. If you tweet through a shared button on a 3rd party service, the source is specified as the service, for example “Ask FM” or “YouTube.” These 3rd party sources are technically from either big or small screen, but I did not include these because I want to see source as a physical device I use for tweeting.

While mining the data acquired from the earlier steps to the final datasets that are used on front-end, I wanted to optimize the file size and format so Javascript code would do a minimal job. Now a single tweet data looks like below. Single letters in the second item of the array represent the types of the categories (e.g., p is for Photo, k is for Korean). The size of this minimized data for the entire +102K is now 3.6M!

["2017-04-15 13 6", ["p", "f", "k", "s"]]

Friends

I met lots of great people on Twitter. As a 10 year Twitter user, I do not have many followers/followings mostly because I’ve protected my account for the most of the time. The analysis, which I described above, shows that I have 349 friends, which does not include deleted accounts. Initially I wanted to include those deleted accounts although I could not retrieve their unique numerical Twitter ID. Then I realized that if I manually trace all those deleted accounts from my tweets text (mostly including @), it may be against those friends’ intention to erase their traces on Twitter. Thus, the 349 friends are those who keep their account active as of April 2017.

What would be interesting facts that I could discover through the analysis of my Twitter friends? Some immediate thoughts I had are how many mentions I’ve sent them, and how long I’ve been talking with them. Count and Duration of communication is easily mined; Count means the number of “replying tweets” to a specific friend’s own tweets. A friend is not included when her or his ID is simply included in a tweet; Duration is the differences in time between the first mention and the last mention.

Besides Count and Duration, I wanted to do something more personal and community-oriented.

For some personal aspect, I manually tagged each friend as following — 1) person that I met in the real world first, 2) person I talked only via Twitter, 3) person that I met first on Twitter then in person, 4) non-human (i.e., institution or event account) or celebrities. In the visualization, these four categories are color-coded for the visualization. A data blurb of a friend looks like this:

{
"name": "exampleUser",
"points": [
[
"2017-04",
32
],
[
"2017-03",
69
],
[
"2017-02",
27
]
],
"first": "Feb 6, 2017",
"count": 128,
"duration": 53,
"id": "931827750",
"common": 9,
"category": 2
}

Twitter allows me to build my network by connecting to the friends of my friends. Considering this nature of Twitter, I wanted to see who were involved when I talked with other people. When a thread of replies is generated, often times multiple Twitter accounts are be mentioned in one tweet. Parsing these mentions, I came up with a dataset of the community — who are involved in a conversation with a friend.

At the end, I analyzed the distribution of Count, Duration, and the number of shared friends of all of my twitter friends. Friends section is rather a visual analytics tool beyond a set of interactive visualizations. A friend can be selected in multiple ways including a conventional search/dropdown menu, as well as interactive features in visualizations. Selecting a friend triggers update of all these visualizations and specifies the ranking of the friend’s Count, Duration, and number of shared friends.

What’s Next?

This project is now available at http://tany.kim/twitter

In the next post, I describe the design process of data visualization including decision making of visual forms and interaction.