A better way to automate custom Twitter lists
Twitter is a very good way to communicate with people you like and for people you don’t like to communicate with you. There’s a few ways to deal with this problem. The simplest is to set your profile to “private” which allows only certain people whom you follow access to your tweets. Many find this is undesirable; the point of twitter to be heard by many not just a select few. It’s not too hard to simply block individuals you dislike and many celebrities gladly block commenters without a second thought. For some even the process of blocking individuals, which takes all of a few seconds, isn’t enough. They turn to using autoblockers.
I’m not going to go into the ethics or philosophy of using an autoblocking tool, or the technical aspects of the blocking tool itself. It’s largely a matter of putting together a for-loop script in your favorite programming language and feeding it a list. It’s the generation of this list that I’m interested in. And for the record, there are many very legitimate reasons to generate such a list that is not used for autoblocking other Twitter users. You may be interested in history or local DJs and use the methods I’m about to describe to get people you may want to follow or contact.
Before we really get into the matter, I’m just going to state that the point of this is to show how terrible the “Good Game Auto-Blocker” is (on a technical level) and how easy it is to build something that actually makes sense. I have experience in python (and networkx, a python module) and using Gephi, a network analysis tool, but nothing I’m doing takes more than a bit of dedicated effort to learn. These tools are free and the amount of actual coding needed is minimal. If I can do this for fun in an afternoon then somebody serious about creating meaningful lists can put in the effort to do it correctly. For those not in the know, the GG autoblocker simply compared the list of followers of two particular users and added any account that appeared on both lists to a blacklist. Getting a list of somebody’s followers is trivial (see below), and the amount of analysis done here amounts to running a single command on a linux computer: comm -3 list1.txt list2.txt. Seriously, that’s it. And that’s all the time I care to spend on that particular issue.
For the tl;dr crowd here’s the rundown of how to properly create a list of Twitter users that actually share something in common. One, generate a list of followers using the Twitter API. There’s a wonderful client called “t” written in Ruby. (Find it here: https://github.com/sferik/t; you don’t need to know any Ruby.) Two, use Python (or some other language) to parse the results into a format useful for analysis. I output the results into the *.graphml format since that’s what Gephi likes. Three, open up your graph in Gephi and run a few statistical analyses. If you fancy python and can use the “networkx” package, you can do it all in python instead of Gephi. Granted, at this point you need to know a bit about the subject of graph theory, but Gephi makes it easy to learn as you go along. My goal was to try to find who is “important” among a certain group of like-minded individuals.
For the non-tl;dr crowd I’ll explain a bit in detail, though you can skip to the end if you’re not interested in some of the finer technical details. The first task was to define my problem and my approach. I decided to take four sample individuals whom I suspected to be significant for finding community leaders. To get their “friends” I used the Ruby client “t” mentioned above, which provides output in CSV format. Note, I’m going to go with the Twitter API meaning of the word “friends” here, which refers to mutual following between users on Twitter. This neatly constrains the problem of determining if two users share the same interests. Twitter allows users to follow anybody, and celebrity accounts may have thousands or even millions of followers who have nothing in common. The assumption that people who follow each other have the same interests needs to be revisited for more rigorous applications, but for me it works well with my methodology.
With the list of friends for the four users and a little bit of python, I shortly had a graphic representation of the small network. The users were chosen based only on their particular interest, so there were only few nodes (accounts) that crossed between the four clusers. These were the ones of interest to me because they appealed to at least two semi-randomly chosen accounts. In graph theory, such nodes have a higher “betweenness centrality” value. Basically, it’s a way of saying that they connect lots of other nodes. (Graph theory is fantastically interesting, read up more about it.)
So then it was time to iterate. I took the users with the highest betweenness centrality and got their friends and compared. A few nodes again were notably more central than others, and these were the ones I wanted to learn more about. So I got a list of their friends. After the first iteration the analysis becomes biased. By looking at certain users in particular this makes them automatically have a higher centrality rating. I could have written some smart code to deal with this, but I found it easier to manually inspect the graph for new nodes of interest. The goal here is proof of concept more than anything else.
After a few iterations the graph largely stabilized. I could always add more data by getting the list of friends of higher-centrality nodes but a clear community (with leadership) arose quickly. The pleasant result is that the graph clearly showed accounts as important even though I never heard of them before. Checking their profiles and timelines it turned out they indeed shared the interests I was targeting. So I had a success. Using only network analysis I had found some new accounts that had a high degree of importance as far as the subject at hand was concerned.
It was time for some more in-depth analysis. I’ve glossed over what I mean by importance here, but luckily it’s very well mathematically defined. Gephi has (among others) three metrics of importance. PageRank, very similar to Google’s ranking system; betweenness centrality, which was mentioned before without clear definition; and authority, which is similar to PageRank but has some interesting technical differences. This isn’t the place to explain these metrics so the interested reader is highly recommended to look into them on their own. For my purposes having three different kinds of ranks would be useful to really figure out who the “important” accounts were. One analysis may assign a very high metric to a node whereas another will give it a much lower metric. Accounts with high scores in all three metrics are good candidates for being considered leaders. As mentioned before, my goal here is a proof of concept. The best and/or correct metric to use depends on the final goal. My needs here are simply anything that’s “good enough.”
So with these three metrics in place and the graph analyzed, it’s time to actually look at it and see what it really means.


The full graph has around 1200 nodes of which around 200 are shown here. Nodes with a degree (number of connections) less than 10 have been excluded to show only the most relevant data. The color and size represent authority and PageRank, with larger/deeper red indicating more importance. Not shown are the account names, size of which indicates betweenness centrality.
The largest, deepest red node in the middle is a very well known person. Other large, deep-red dots are accounts used to generating the list of friends. Most were accounts known to me as interested parties. The more intriguing results are the medium-sized nodes most of which are also lighter in color. Of these 40 to 50 moderately important nodes only about half were known to me. The other half (including some of the larger, redder nodes) were unknown to me before this analysis. Note that even the smaller dots are ones that have a high degree and are reasonably likely to share the same interests. All the parameters here are scale and the desired breadth of the community can be chosen at will.
So the result is that I have a list of about 200 of the most important figures in a given community for a certain topic. Depending on my end purposes, I could select only the most significant group members or anybody with certain degree and higher. The resulting list is based on more than just a simple coincidence but a somewhat more advanced analysis on the basis of connections between accounts. With only a little bit of work we can create a considerably more accurate list of community members than one gets from a simple follower list comparison.
The technical part is over. If you don’t care about my personal interpretation of the above graph, feel free to stop reading now. If you are here goes:
The GG autoblocker was written in a way to grab as many accounts as possible. As we’ve seen that’s led to a horrifically pointless tool that can’t distinguish harassment from fried chicken. If I were for some reason so inclined to make a list of community members I’d at least bother to make it somewhat accurate. And even the process described here has severe limitations. All assumptions about community interests are still “by association.” Just because two people follow each other on Twitter hardly means they share the same opinions or values. That being said, social communities do tend to share similar values and people tend to follow those who espouse the values they champion the most. This sort of network analysis is a tool that can result in a starting point for more detailed analysis. If I really were trying to build a list of community members I’d start with such a list and work my way down from the top, manually checking each account to see if there’s a clear interest in the topic or not. Or perhaps cross reference certain accounts with certain words. There’s many options. There’s so much somebody can do with minimal effort to learn about membership in a community. Sloppy code that’s poorly executed and a problem that’s poorly defined and poorly approached all point to the effort behind it being political in nature, not technical.
My analysis here is largely inspired by previous work done by Chris von Csefalvay, who had no input in any way with regards to the content of this post. You can find his analysis of #GamerGate on his site at http://chrisvoncsefalvay.com/2014/12/07/Gamergate.html. It’s much more in depth and useful. My point here was to contrast the methods used and level of effort required to generate community membership lists.