Rethinking Twitter’s “who to follow” (using Node.js and d3.js)
I did some work with the Twitter API and my approach gave me some pretty nice suggestions 😄
Sometimes I look at Twitter’s “who to follow” area and never see people that I really want to follow. I don’t know which approach Twitter is using (and I don’t want to say how anyone should do their job), but a long time ago I created followInsights as a way to find new GitHub profiles to be inspired by — I’ve applied the same logic here.
Let’s say that we have a “root user” (because this kind of remember me a tree). If this “root user” follows some people, we can surmise that they trust this network, right? Let’s call the users followed by the “root user” as “first level” users. Now, if this “first level” is relevant to the “root user”, the people that they follow (let’s call them “second level”) are probably relevant also. So what we have to do is to get all the users from the second level and count how many times a user appears there. Because if lots of people on the first level follow a given user that indicates some relevance (funny jokes, cats gifs, nice ideas).
The problem: Twitter’s API has a rate limitation (which is obvious, they don’t want us to download everything from there). So we need to do this in batches and it’ll take some time.
Once we get the ones on the first level we need to repeat the process to all this users to get who they are following. This took more than one week with a running process on terminal (at least it does not consume much memory 😬). Also, I established a limit (configurable on the repository) to only consider profiles that are following less then 5000 users because some brands/companies usually follow lots of people and it’s not necessarily someone that they are interested in.
👋 Followers
Using myself as example, here are the first eleven results counting how many times a user appears:
- paul_irish, 277
- rafael_sps (it’s me 👋), 249
- jeresig, 247
- addyosmani, 247
- github, 228
- BrendanEich, 213
- elonmusk, 187
- nodejs, 183,
- rauchg, 181
- mathias, 177
- BarackObama, 175
Obviously if I follow many people that follow me back that will make me relevant to my own network (even more than Obama? 🤔😄).
So let’s remove people that I’m already following from these results:
- elonmusk, 157
- BarackObama, 156
- chriscoyier, 126
- ChromiumDev, 124
- brianleroux, 114
- sarah_edo, 112
- slightlylate, 110
- SaraSoueidan, 108
- sindresorhus, 103
- rmurphey, 102
And here is a list with some indications. Yay!
But WHAT IF we consider a score based on how many times a user appears? Let's say that a user is followed by jeresig (who appears 247 times) and paul_irish (277) so this user will have a score of: 247 + 277 = 524. The list becomes (username, score):
- paul_irish, 14306
- jeresig, 13284
- addyosmani, 12941
- rauchg, 10971
- mathias, 10968
- BrendanEich, 10957
- jaffathecake, 10474
- brianleroux, 9707
- tomdale, 9481
- stubbornella, 9388
And removing who I’m already following:
- brianleroux, 9707
- slightlylate, 8956
- dalmaer, 8613
- chriscoyier, 8586
- sarah_edo, 8399
- rmurphey, 8008
- ChromiumDev, 7777
- cramforce, 7752
- reybango, 7706
- sindresorhus, 7469
And we have some new faces here! ✨
If we parse the data to fit the amazing d3.js Hierarchical Edge Bundling graph, this is the result:
The cool thing is that you can hover over a username to see how all the users are interconnected:
🌎 Locations
Since we have the users location (or kind of) we could group them to see if I’m missing someone that lives in the same city.
The problem with this approach is that the location does not follow a pattern, so if you want to check how San Francisco is written here are some examples:
San Francisco, CA
San Francisco
San Francisco, California
San Francisco Bay Area
San Francisco, California USA
San Francisco Bay Area, California, USA
San Francisco, CA, US
San Francisco, The Internet (🤔 cool state btw)
Vancouver || San Francisco
Maybe parsing some names I could remove the state and try to group (wow such regular expression) then ordering by users quantity and/or score.
Here it is how many acumulated score (all user’s scores summed) and how many users that I’m not following are on a specific location:
(Location, score, total users on the location)
- Bend, 10088, 2
- Vancouver || San Francisco, 9707, 1
- Glasgow, 7798, 2
- South Florida, 7706, 1
- Bangkok, 7469, 1
Nice to see really relevant people that is not only in Silicon Valley (duh, mr obvious!).
And when sorting locations according users quantity:
- “”, 1419, 224 (mostly keeps the location blank)
- San Francisco, 1377, 116
- São Paulo, 610, 58
- New York, 2784, 40
- London, 808, 36
So I’m mostly following people from big cities and that’s a shame because this is kind of a bubble. I should try to read more from folks that are from different places, not only white men working in California.
📝 Descriptions/Bios
We can group users by bio/description. My networks top five words are (word, frequency):
- web, 46
- developer, 25
- javascript, 21
- google, 14
- creator, 14
So, the users that have Web in their description are:
- slightlylate, 124
- reybango, 105
- lukew, 100
- cramforce, 96
- timberners_lee, 96 (I think this guy knows a thing or two about the “web”)
And the Developer word:
- ChromiumDev, 140
- sarah_edo, 126
- SaraSoueidan, 120
- MylesBorins, 100
- ThePracticalDev, 90
This is a good method to find relevant people on your network, but I believe that we should try to have a more diverse network to read about different subjects. Maybe I should run the same script to a different profile, someone that I trust as a voice from unrepresented people and take a look at that network, (not only mine) and that would give even better insights.
I’m using Twit as the API wrapper and d3.js to do the graph. The code is available here.
There are a lot of improvements to do, but I hope that this can be useful for some people! Liked this? Be sure to give it a clap. 👏