Twitter Analytics — Part II
Using The Cloud as an excuse to stalk the Tweeters of Melbourne
This is a follow-up from a previous post, visible here.
Be aware, the following post contains gifs, these may take a while to load.
So what did I do?
As part of my course of study, alongside a group of incredibly talented people, the objective was to use harvested Twitter data from Melbourne and cross-reference it with available socio-economic datasets, in consequence drawing insights and meaningful conclusions about the nature of our city and its dwellers.
This was a monumental task to attempt within a month-long time frame. And it was ambitious for us few, so new to the cloud-game.
Nevertheless, it required capitalization on a medley of resources and technologies to reach our desired outcome. As will be mentioned later, we couldn’t have gotten far without a few core computing assets at our disposal, these being:
- NeCTAR, a federally-funded eResearch cloud service allowing us to configure and run our own cloud architecture and space with a fine grain;
- AURIN, an online gateway for Australian Urban research and our mediator for cross-organisational socioeconomic datasets; and
- CouchDB, a NoSQL database program for facilitating the management of our collected Tweets.
For a full run-down of our application, you can view our report here. Alternatively, you can enjoy the work process summary below.
So, how do we go about this?
Without even starting, the possibilities for socioeconomic discovery were innumerable. With public data at our disposal, including census results, health reports, demographic statistics, and the results of citywide mapping and survey, including tree locations, liquor-licensed venues, even the location of every public toilet, we had an arsenal of information on the nature of our city, it was just a matter of collecting it, distilling it, and contrasting it to the information that we could garner from one of the world’s foremost social media websites, Twitter.
In theory, we could cross-reference Twitter users’ location data, to analyse the flow of human traffic across Melbourne’s road and transit architecture. Or we could apply Machine learning techniques to gauge crowd sizes for sporting events, based off previous correlation between recorded crowd sizes and frequency of tweeters whom attended. We could even analyse the language used by tweeters, and cross-reference that with their originating suburbs, evaluating things like their likely education levels, social standing, or native backgrounds. The list is long and impelling.
Before deciding on our angle, it was important to first start harvesting. Tweets come with an array of information, the most useful of which, for our purposes, were location data and the tweet’s message itself; Though other inclusions like hashtags and follower activity also proved invaluable.
Each tweet comes attached with GIS location data in the form of a latitude and longitude, provided the user has not disabled this setting. This information is accessible as a GeoJSON data format, consistent with OpenGIS web standards. It is highly-accurate both in time and space, captured immediately on the publication of the tweet and able to pinpoint the user’s location down to the nearest meter or so.
Using the Tweepy Python library, and running our tweet harvesting program overnight through tmux software, we were able to harvest tens of thousands of tweets within the first day, which we could filter for by using location queries. We could keep this process ongoing, fed directly into our cloud-hosted database automatically, allowing persistent data visualization and feedthrough to our front end web application.
Our architecture composed of 4 virtual machines, along with mounted volumes for extra storage space. Our tweets were fed directly to one of these machines, and we harnessed CouchDB to routinely shard and distribute our data across the other nodes, assuring reliability and accessibility in the case that a node had failed. Our cloud architecture was not only responsible for storing our tweets, but it was also the medium for our analysis and web-serving. As administrators of our own cloud system, we automated the storage and management of our database through Ansible playbooks, in preparation of possible future scalability.
Once our data collection was up and running, we required socioeconomic data. This is where AURIN came in. Using what is classified as an SA2 Statistical Area by the Australian Bureau of Statistics, we could attract multiple diverse datasets of an homogeneous format. These sets, as collected by the 2011 and 2013 censuses and close to 100 available, purveyed socio-economic insight down to the grain of single, approximated suburbs. Collating this GeoJSON data, achieved by combining each suburb’s geographic polygon with its associated socioeconomic properties, we could easily map each tweet and visualize their suburb of origin using the Google Maps libraries and API, which is notable for its GeoJSON functionality.
There were 3 analysis scenarios we looked at:
1 — Sentiment Analysis
By analysing the sentiment of each tweet, we can ascertain which suburbs were, for example, the ‘happiest’ in Melbourne. We could also ‘tag’ these suburbs with their corresponding sentiment value, and correlate this sentiment rating with other twitter or census information, such as tracking which device the tweet was sent from: are iOS tweeters the happiest tweeters?
2 — Movement and Tracking
By mapping location data, we can get representations of movement patterns of our evaluated users. We were also able to filter by the highest-followed users who had enabled location data recording, using CouchDB’s ‘View’ functionality. We could further collect these users’ tweet history and produce a spatial storyline; in essence, a geo-chonological mapping of the movement of a thousand high-profile Melbournians…with scary accuracy. Undoubtedly, this is where our analysis is the most sensitive and provocative; and we had to take care not to go too far and breach the privacy of users, ironically in lieu of furthering our lesson on data privacy.
Let me further iterate that this geographical information was easily available to us; and without doubt it is subsequently foundational to a stalker’s dreamscape.
3 — Language Modelling
By training on socioeconomic datasets and the language used in tweets, we could forecast language patterns by suburb, and make estimates not only of where a tweet is likely to have originated, but also to what level of social advantage its connected user is likely to express.
For full details of each analysis see the report linked earlier in this post, such as the application of the Vader library in our sentiment SVM learning algorithm, or the use of n-grams in the language model.
What we found
In terms of sentiment analysis, we found insights of little overall significance. While our algorithm was accurate, the small sample sizes hindered the discovery of any grand conclusions. In fact, the most interesting conclusion we found was just how consistently populated the Twittersphere is with bots — close to 50% in some of our cases. This hurdle immediately cut our supply to meaningful tweets. Additionally, the majority of (human) tweets were evidently either from high-profile users, like local celebs or politicians, or from your resident sports zealot and his 3 followers. Topics were stereotypical and irredeemably idiosyncratic. Material was hardly diverse and deep.
Topics were stereotypical and irredeemably idiosyncratic. Material was hardly diverse and deep.
All in all, it painted a bleak and fairly incomplete picture of the fabric of society. And it obscured the overall representation of the wealth of culture and diversity our city should contain — Or at least that we hoped it contained. It sufficed to indicate that Twitter in the Melbourne area gave the appearance of a dying, or at the very least a limited, social media platform; a shadow of the gigantic social platform it once was; now catering to niche pockets of culture, weather forecasts, and marketing use cases: either of sport, the self, or commercial by nature.
My disenchantment with Twitter aside, the language model turned out a success, in so far as being an enjoyable source of data analysis and experiment. While not perfect, again due to the small amount of geographically significant tweets we could extract from Melbourne within a month, the model was appreciably germane.
Coastal suburbs lit up on the mention of beachside terms; and fake tweets about craft beer festivals lit up suburbs that contain local craft brewers. Even queries designed specifically to attract suburbs of high social advantage were in the majority successful, lighting up areas that were typical of wealth, privilege or education; and in the case of social disadvantage, not.
As for User Tracking, it told us a lot about user privacy and security. It was easy for us to identify the highest-profile citizens of the city and then narrow down their whereabouts, simply through querying our database. By the nature of Twitter and large data-gathering platforms, qualities like popularity and social reach are quantified, for example through number of followers or number of retweets per user. And with the rise of other technologies or platforms, like Google Streetview, Instagram, and Facebook, we could further enable our delving into the personal lives of people we had never met by bringing more resources into scope, in order to extend and validate our findings.
For example, we could find a user’s tweeting hotspots, and evaluate these locations. Then, using Google Maps, we could ascertain addresses for their homes and places of residence: in essence, their residential hotspots. These we could in theory decorate with Google Streetview captures of their house front and their car parked in the driveway. Further, we could locate their place of work or their favourite local haunts. It was even easier for older users. The rule was: the more you tweeted, the greater the resolution you gave to our stalking map.
The rule was: the more you tweeted, the greater the resolution you gave to our stalking map.
If anything, our project gave us a lesson on the consequence of Social Media colliding with the Big Data movement. Where web domains finish is not where your data terminates — access to your privacy hardly expires.
The internet is ever swollen and saturated with your information, where it floats around ambient and immortal, waiting to be pulled down to satisfy adventurous Melbournian data analysts with too much idle time on their hands — I guess this is why they call it the Cloud.
Since the contracts for our cloud resources have long since expired, this project is no longer running. Instead a recorded demo of our program is available on youtube here, and you can read our report as linked above.