Inferring Jakarta Commuting Statistics from Twitter
Some estimates for Greater Jakarta put the population at over 30 million. Within the boundaries of the city itself the transport system has to handle 1.38 million daily commuters. Policy makers need regular updates to track the rhythm of the city and best optimise public transport, so we teamed up with the Indonesian Institute of Statistics to look at whether data from social media could help.
Transport policy in a megacity
Jabodetabek, as locals fondly refer to the megacity, includes the settlements of Jakarta, Bogor, Depok, Tangerang and Bekasi. Jakarta, itself, is split into five administrative cities: North, South, East, West and Central. Population estimates for the settlements range from 10 million for Jakarta, to over 30 million for the broader metropolitan area, also known as Greater Jakarta.
The scale of the population and the state of the transport infrastructure makes the daily commute a common complaint among residents. The city administration is working to improve the commuting experience with significant investments being made in transport infrastructure, such as the new Jakarta MRT.
To inform these infrastructure investments and to understand the rhythm of the city, the Indonesian Bureau of Statistics conducted the first Jakarta commuting survey in 2014. The survey filled an initial data gap, but from design to delivery, it was one year before the results were available.
Filling the data gaps
The challenge of data relevance in urban planning is not unique to Jakarta, and has spurred many attempts to use other types of data to produce similar statistics. The most promising use geolocated information such as GPS devices, sensors, social media and mobile phone data.
In Indonesia, social media is recognised as a promising data source to understand macro patterns of behaviour. This is especially true of Jakarta, which has been named the Twitter Capital of the World due to the ten million tweets posted there every day.
Pulse Lab Jakarta used this opportunity to test whether the locational information from social media on mobile devices can reveal commuting patterns in the Greater Jakarta area. First we produced origin-destination statistics for the ten cities in Greater Jakarta from the GPS-stamped tweets in the database by identifying a subset of people who commute between these areas. Secondly, we calibrated the initial result based on the population distribution and Twitter user distribution. Finally we verified the result with the official commuting statistics produced by the Indonesian Bureau of Statistics.
We collected all GPS-stamped tweets posted in Greater Jakarta from a data firehose and subsetted tweets posted between 1st January 2014 and 30th May 2014, considering that the official commuting survey was conducted during the first quarter of 2014.
Per user, we inferred two locations: origin and destination, both at sub-district level. Origin location was inferred as the most tweeted sub-district location between 9pm and 7am. Destination location was determined as the most tweeted sub-district location during weekdays, excluding the origin location.
Using this approach, among the 1,456,927 unique users who posted GPS-located tweets in Greater Jakarta during the five months from January 2014, we found the origin and destination information for 305,761 users at the sub-district level (i.e. we were not zooming in any further). This represents about 2.8 per cent of the whole population, and 14 per cent of the commuting population in Greater Jakarta.
Due to the unequal penetration rates of Twitter, we mapped the origin-destination information at sub-district level to city level, and calibrated the information based on the population data from the ten cities. After calibration, the cross correlation score between the two forms of statistics, official statistics and the statistics from our approach, improved from 0.92 to 0.97.
Results in full
The chord diagram shows that Twitter is a promising source of data for inferring commuting statistics in Greater Jakarta.
In Table C we see the rank difference between Table A and Table B. For instance, the value for SJ ⇒ CJ is calculated as ‘0’ because the two statistics are the same. Table C suggests that our approach produces broadly reliable predictions.
We hope to improve the method with better calibration using other demographic variables, as well as expand our research by analysing commuting data from Transjakarta.
This research was originally presented at NetMob 2017 and more recently at the Asia-Pacific Economic Statistics Week. Our partner in this research, the Institute of Statistics, part of Statistics Indonesia, is planning to use this method to enhance commuting statistics produced by the Government henceforth.
Pulse Lab Jakarta is grateful for the generous support of the Government of Australia.
This blog also appeared on the United Nations Global Pulse website.