Yet another exploratory analysis on Twitter geographical data

Published in

If you torture the data long enough, it will confess.

4 min readMay 29, 2017

I decided to learn Python a little bit more and I needed some data to perform a simple exploratory analysis to practice a bit with Pandas data frames & co.

So I came up with a collection of 79764 tweets collected between 2017–05–09 15:47:36 and 2017–05–19 geo localised within the London bounding box:

{ p1: [-0.489, 51.28] p2: [0.236, 51.686]}

To collect them, I just used a simple Scala/Play app that listens to all geo tagged tweets within a particular area and stores them in a relational db.

The code is open source and can be found here.

Number of tweets by hour

Let’s group by the hour of the creation date of the tweet and all of them. The result is pretty obvious: loud voices during day time and pretty quiet during the night.

Distribution by language

To understand a little bit more the nature of the tweets in London, might be useful to plot the language distribution.

The plot is on a logarithmic scale for axis representing the number of tweets. This is due to the preponderance of tweets marked as English or Undefined over other languages.

The second largest language is Spanish, two order of magnitude behind English.

Distribution by London borough

Once we know a little bit more about the languages of London, might be worth exploring where people tweets.

Similarly to what we did for languages, we can plot distribution by borough.

Again, we can notice a huge skew in the dataset: Westminster leads the chart with one order of magnitude more than Camden, the second most frequent borough.

Visualising tweets on a map

At this point is probably worth plotting those tweets on a London map. However, I couldn’t find any suitable tool that allows me to plot thousand of tweets and still being able to not crush my browser.

I came across Folium which nicely combines Python with Leaflet.js, a JS library for interactive maps.

It seems that Folium can handle a bigger amount of data points on a map than other tools, but I still needed to sample the data to make it more readable.

In blue English tweets, while in Red tweets with any other language. Also, the map differentiates Statutory Inner London boroughs from Outer London ones.

At a first glance, it seems that non-english tweets tends to cluster around one specific area (Westminster). It’s probably worth zooming in:

At this point is probably useful to ask a simple question:

Is there any relationship between the language of a tweet and whether it comes from Inner London or Outer London?

Test for Independence

One way to check if there is any relationship between the two mentioned characteristics (or variables) of a tweet (language: English or non English) and its geographical position (whether it comes from Inner London or Outer London) it’s the Chi-squared test.

For a great explanation of the Chi-squared test, I recommend Probability and Statistics for Engineers and Scientists.

In a nutshell, with the Chi-squared tests we assume that there’s no relationship between the two variables checking if the observed frequencies match the expected ones.

If not, our null hypothesis is rejected, hence we have to admit that the two variables are not independent.

First thing, we compute the following contingency table:

Then we can run our Chi-squared test, obtaining:

statistic=173.1806659689629, pvalue=1.4945622695801191e-39

Given such a small pvalue, we can confidently reject the null hypothesis.

Distance from the geographical center of London

However the Test Of Independence doesn’t tell us anything about the nature of the possible relationship between the two variables.

We might assume that non English tweets are more dense closer to the geographical centre of London. This might be due to the fact that tweets in a foreigner language are likely to come from tourist and that tourist tends to hang out around the very centre of the city.

To visualise that, we can compute the distance of each tweet from the coordinates:

51.507222, -0.1275

An have a scatter plot between the new computed variable (distance from the centre) and the categorical variable expressing the language:

As we can see, the density of non-English tweets (in green) drops after 10 kilometers from the centre, while English tweets consistently have the same density way more beyond the 20 kilometers.

Technology used

Scala/Play to collect the data
Python: Pandas, Folium, Seaborn for data visualisation and analysis
SherlockML for computational power (disclaimer: I work with the team that is building SherlockML. Please reach me out if you want a private invite for it)