A new way to analyze your personal chat history
In the last decade, I’ve sent and received over 460,000 messages—that’s an average of 126 messages per day! For photos, Google and Apple have both introduced a “Memories” features in their respective photo apps, which lets you look back on photos that the almighty machine learning algorithm has deemed meaningful to your life. Why not do the same for messages?
Inspired by Chandler’s similar project, for the last couple of months I’ve been working on an open source tool called Converscope (github, interactive preview). It pulls data from your Facebook chats and iMessages and shows you who you message the most—over the last year, particular time periods of your life, or over all time. You can drill down into a particular thread and see interesting metrics like
- a histogram of message counts per day
- the total number of characters sent per person (a proxy for how talkative each party is)
- your longest streak and when it ended, a la Snapchat
- the most used emoji per person
- Blast From The Past™: a random, potentially cringy message from your past
- the TF-IDF top tokens for the person/group chat. TF-IDF stands for term frequency-inverse document frequency. It automatically surfaces the topics that you talk about a lot in this chat that you don’t talk about with other people/groups.
I’ve anonymized names and released an interactive preview at converscope.daylen.com, so you can browse around and see what it looks like. (Certain features such as Blast From The Past™ and TF-IDF Top Tokens are not available in the preview, for privacy reasons.) If you’re one of the top 20 people I message the most, you can try to figure out what SHA-1 hash you are 🙃. And as I mentioned, Converscope is open source, so you can run this analysis on your own data! Instructions are in the README, and please do let me know (e.g. via Twitter) if anything doesn’t work.
The rest of this post is split into two sections: Interesting Findings, where I talk about some of the interesting things I learned by poking around this data, and Building Converscope, which is about the visual and technical design decisions I made.
If you pop open converscope.daylen.com and flip over to the Group Chats tab, the first two chats you see are these mega-chats with over 20K messages each:
An artifact of the pre-Slack era, THE H@BMIND was the officer chat for Hackers at Berkeley, a club I joined as a wide-eyed freshman. We were primarily an educational club: we hosted events where we taught eager students things like how to build a web app, or how to use git. The chat was where we did event planning, and the ebb and flow of message counts corresponds to the school seasons—for example, lots of activity in fall 2013 and spring 2014 versus a drop in summer 2014. You can also see that the club never made a recovery from the summer 2015 slump: some of the more senior members graduated or moved onto TAing.
As for Laurel Grove crew: in the summer of 2015, I interned at Facebook, and this was the group chat for a set of us who lived at corporate housing at Laurel Grove (such creative naming). We were a chatty bunch, and attempts were made at keeping the group alive after the summer ended. But that petered out by the end of 2016. Every group chat eventually dies.
Every group chat eventually dies.
The predominant flavor of group chat varies based on time period. In my college days, there were the classics like the aptly named presentation due sunday 10pm! (which did in fact send the most number of messages on Sunday, November 27, 2016), CS162 project group, and 189 shittrs. These group chats shuttered after the respective classes finished.
Also in that same time period were several of group chats dedicated to trips: Thailand 12/31–1/13 (video recap!), NYC 8/15–8/21, and Roadest Trip 2017 (a spring break road trip across the PNW, after roader trip 2k16: A Profoundly Religious Experience and the original road trip in 2015, which lacks a group chat). Here, the Most Used Emoji metric comes in handy to distinguish the trips. (In Facebook Messenger, you can set a default emoji for a group, and it turns out people tend to smash that button a lot.)
After graduating, a larger percentage of group chats shifted to be trip-oriented. It turns out that group chats centered around trips follow a fairly standard formula:
- There is an intense planning phase several months or weeks prior to the trip
- Next, the actual trip generates the majority of the messages
- Finally, there may be one last hurrah where the Splitwise is closed out or photos are shared
The Longest Streak metric consistently identifies the actual trip duration. Modeled after Snapchat streaks, Longest Streak requires that at least one message be sent every day for consecutive days. So if I’m looking at the confusingly titled You and Taco chat, I can see that perhaps this chat was about the post-graduation Europe tour that my college friends took in 2017, along with our diminishing attempts to stay in touch afterwards.
Obviously, it’s more exciting when you can actually see the names as opposed to just XXX everywhere. I’ve got a private instance of Converscope with STRIP_PII (personally identifiable information) set to false, but in this post I’ll refer to everyone by their (salted!) SHA-1 hash.
Using the “Sort by…” dropdown and selecting the College option, it’s interesting to see who from college I’ve kept in touch with and who have fallen by the wayside. Just contrast the histograms for my longtime friend 414129b (with whom I have a 59 day streak!) to 368db00.
Equally interesting is seeing the new friends I’ve made post-college. Turns out, there are a lot! b01f615, bd417c1, and 8b20864, just to name a few. Here, the TF-IDF Top Tokens feature shows how most of these friendships are oriented around specific activities like cycling and photography:
And finally, to close out the Interesting Findings section, here’s a fun Blast From The Past™ to when I faceplanted when riding my Boosted Board:
Notes on building Converscope
The Converscope frontend is a React app and is pretty much just a fancy JSON viewer. I’ll just highlight two of my favorite features. First up is the hover animation for the card design, inspired by the card design in Apple TV. On hover, the card gains a drop shadow and also subtly grows in size.
Second, I’m pretty proud of the dark mode theme, which automatically kicks in if you switch to dark mode on your iOS or macOS device. (This is thanks to the prefers-color-scheme CSS Media Query!)
You can try these for yourself at converscope.daylen.com.
There’s two parts to the Converscope backend: there’s a script that parses and merges the Facebook and iMessage data formats into easy-to-understand Inbox, Conversation, and Message protos. From there, computing the metrics to display is a matter of filling a dictionary.
A special thanks to Kevin Chen who pointed out that I should add a salt before hashing the sender name when generating conversation IDs—otherwise it would be trivial to deanonymize the preview website by running a dictionary attack on an easily obtainable friend list (e.g. Facebook).
For TF-IDF, setting maximum document frequency to 0.2 really boosted the quality of the displayed tokens. That means that if a token appears in more than 20% of chats, it is discarded. This helps eliminate common words like “the” and “and.”
In a similar vein, to ensure good quality for Blast From The Past™, I require that a message be at least 30 characters (otherwise you just get a lot of “lol,” “haha,” and “yeah”).
Wow, you made it this far! Converscope has been a long time in the making and I’m happy to finally put it out there. I encourage you to check out the interactive preview (flip over to “Group Chats” so you don’t see XXX everywhere) and if you’re handy with the command line, to run it on your own data.