For this blog i am going to show how you can scrape Twitter for certain keywords using dmi-tcat:
“The Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT) is a set of tools to retrieve and collect tweets from Twitter and to analyze them in various ways. It is written mostly in PHP and runs in a webserver (LAMP) environment. On a Ubuntu or Debian machine, installation can be done with one command.”
The wiki has a great walktrough on how to install & set-up dmi-tcat. I prefer to run it on a cloud instance like amazonAWS or digitalocean. The whole set-up process is done in a matter of minutes as long as you follow the automated installation steps.
Once thats done you can start scraping depending one the type of dmi-tcat you have chosen. There are three options :
- Capture phrases/keywords (monitor a certain quote or hashtag for instance)
- Follow user(s) (follow one or more twitter users and capture all there tweets)
- 1% sample (capture 1% of all twitter’s live streaming traffic)
The most common option the choose would be either “Capture phrases/keywords’ or “Follow users”. In this blog i am using a dmi-tcat installation based on the Capture phrases/keywords option. I’ve uploaded the used dataset as well as the end result for download in this blog.
Collecting the data
First login to your dmi-tcat cloud enviroment by pointing your browser to: https://<ip adres of your dmi-tcat cloudserver>/admin
Here you see my admin panel. Currently i’ve got three scraping bins running. One scraping the word “osint” and one scraping the words “School AND Shooting” and a last one scraping the word “socmint”. For this blog i am going to use the “osint” scraped tweets. I’ve started the scraping on twitter at 2018–06–16 12:00 and from that point it scrapes the twitter streaming api and collects each tweet containing the word “osint”.
Visualizing the scraped data
On the upper right corner of the admin panel you can click >>analysis. This takes you to the analysis menu. Here you see my analysis menu:
As you can see there are numerous options to analyse your scraped Twitter data. These options are devided in a overview of the current selected dataset and four analytic export subgroups:
- Tweet statistics and activity metrics (All statistics and activity metrics come as a .csv)
- Tweet exports (All tweet exports produces a .csv or .tsv)
- Networks (All network exports come as .gexf or .gdf files which you can open in Gephi)
- Experimental (Well…..it’s experimental)
As you can see the possibilities for analysis are wide-spread and there are lots of options to choose from. So depending on your well chosen research question you can choose the best fitting option. This is where de “state of mind” comes in.
Visualizing the “osint” keyword
For this blog i have chosen to select the dataset containing the keyword “osint”. The daterange is from 2018–06–16 12:00:00 till 2018–06–22 12:00:00. This dataset contains 1.917 tweets and 847 distinct users which you can already see in the “overview of your selection”.
Already you can see lots of information to gain analytic insight from. There are 847 unique twitter users who have used the keyword “osint” and these 847 unique users are responsible for 1.922 tweets. And 55,6% of the 1.922 tweets contain a link. You can also see the peaks in time etc etc.
So far so good. To keeps this blog from taking to long i want to look specificly at how to visualize who mentions who based on this 1.922 tweets containing dataset. To do that select the “graph by mentions” option under the section Networks and click on launch.
After clicking you will get a pop-up menu asking:
Specify number of top users you want to get. (by number of mentions, enter 0 to get all)
Standard it is set to 500. This is to keep things more fast when u have large datasets containing millions of tweets. But since this dataset only contains 1.922 tweets i chose to get all. Once thats done you get a file presented which you can download.
As you can see the file is automaticly named after the bin name and the daterange of your selection and stating it is an mention network gephi file.
After downloading you can open the file in Gephi for further analysis & visualization.
If you want to play around with the Gephi file yourself you can download it here
Visualizing the file in GEPHI
After you start Gephi choose your file. After choosing it you will see something similair to this :
Since this is a mention network file the tweets will be directed so you will not have to make changes here. Simply just click open. After you’ve opened it you see something like this:
Now lets clean up the data and make it look a bit more pretty to try and start making some sense out of it. In the left lower corner you can see the section “layout”. The drop down menu gives you several options to visualy lay out the data. I suggest for this mention data set you use the option “Force Atlas 2”. Forece atlas 2 is specificly designed for directed network visualisations.
Once you’ve chosen Force Atlas 2 from the layout menu we need to finetune it a bit more before we actually hit the run button.
There is no good or wrong is these finetuning settings. It always depends on the size of the dataset and directions and hubs in it how Force Atlas 2 is going to map it out. When you use Gephi a lot you will get “the feel” for finetuning your data. For this dataset i’d suggest you’d stick to the settings above. After that hit the run button and let it run for about 15 seconds before hitting stop.
You can see now the data is starting make a little bit more sense. You can see some small clusters apearing already.
Now lets make the nodes in the network more visual by giving them proportional sizes. On the upper left corner you see the section “appearance”. Click on “Nodes” and then click on “Ranking”. You will get a pull down menu with some options which i will describe:
- Degree = number of nodes another node is connected to without weighting
- In-Degree = number of mentions a node received (top 500 of nodes)
- Out-Degree = number of mentions a node has given
- no_mentions = number of mentions a node received (whole dataset)
- no_tweets = number of tweets a node posted containing the keyword
Since i want to visualize who received the most mentions in this network i’ll choose the option no_mentions. I’ll set the min node size to 7 and the max node size 25.
After changing the node size you can zoom in on the graph and you will see the difference in nodes who received the most mentions based in their size. The bigger the node the more mentions that specific node has received. I suggest you run the Forece Atlas 2 now again for en few seconds to take in account the new node sizes. You will see something similiar to this:
Now we are getting there ! We are starting to see small networks and hubs of twitter users who get mentioned with tweets containing the word “osint”. The larger the node the more mentions. You can also see that hubs are connected to each other. This means that certain users (or groups of users) are mentioning a specific user but are also mentioning other specific users. Now lets zoom in on a node hub who receives a lot of mentions and make thing even more visual by activating their twitter username of the graph to make even more sense of what we are seeing. We can do this by clicking on the bold T and adjusting the size of the font to see who is who.
When you hover over a node you can see the subnetworks of who that node received mentions from and who that node gave mentions to. Now you can start identifying key players in your graph network.
Lets make the graph even better by color coding the nodes to see the more differentiation between the lower ranked nodes. We can do this by clicking on “nodes” in the upper left menu and selecting the coloring palette and click on ranking. Then choose “no_mentions” from the pull down menu. Click on the color icon. I like to choose a default color and invert it most of the time. This gives me the best visualization. And i like to adjust the spline from linear to curved for a even better way to visually show the differentiation of the lower ranked nodes.
After doing this you can zoom in on the different hubs and make sense of the different hubs. You can start answering some questions like :
- Who is the most mentioned in the network
- Who gives the most mentions
- What is the most populair conversation group/hub
And by looking into the additional data exports dmi-tcat has to offer you van even analyse what the conversations are about or what sentiment does a conversation or group have. Dmi-tcat has basicly all you need to go on a in depth osint analysis on twitter data.
At last i’d like to show how to visualize the betweenness centrality. This is an well know analytic model to calculate how often a node appears on shortest paths between nodes in the network. Basically it is analyzing the nodes who are acting as bridges between de hubs. On the right menu under “Network Overview” you see “Network Diameter”. Once we click on the run button you’ll see a pop-up. Just click on ok. In the Background Gephi now has calculated the betweenness centrality which we are going to color code in the same way we as we color coded the nodes before.
And now you are ready to zoom in on the nodes an start making sense of the dataset. You can can try and describe this graph to answer your questions.
I’ve uploaded this last visualization here for you to play with it some more in Gephi. I hope you liked reading this blog and. Feel free to share your opinions, suggestions or ask me questions.
Twitter : Dutch_Osintguy