The Origin of Jakartan

Visualizing where 7 million people came from using Indonesia 2014 General Election data, for free.

This is a map of where the actual of hometown of Jakartan, circa 2014. It only includes origin cities with over 1,000 people migrating to Jakarta, and the line width is proportional to the logarithmic value of the actual number, ranging from 10³ to 10⁶. Each color represents their current domicile: South, North, West, East, or Central Jakarta. Can you guess which color represents which?

How It Started

So the story is that I always wanted to do something with the Indonesia Voter List data for three simple reasons:

  1. It’s huge. 188,268,423 unique voter, to be exact. I don’t want to use the B-word because it’s not exactly something on terabyte scale. Based on my back of the napkin calculation, the entire dataset would actually fits in a $50 USB flash disk. For a publicly available dataset, though, it’s still awesome!
  2. It’s public. It’s actually somewhat scary that each and every voters name and where they live (down to village level) itself is available for anyone to browse around, but if it’s already open why not actually use it, right? Just to be on the safe side, I asked my NGO friends working on election and a commissioner of the Election Commission confirming that the list itself is indeed public with no provision on how they can be used outside election context.
  3. It’s easy to access. FOR ONCE, IT’S NOT INSIDE A PDF OF MISALIGNED PHOTOCOPIED TABLE!
Like, one Chrome Inspector and an overnight Python script away

Immediately you can see that while their detailed address, gender, and age are not listed (as it should), one unusual variable is: birthplace. Of course, you can easily get the same and definitely much better coded thing from Indonesia 2010 Population Census from the statistics bureau which they will happily give to you for some $$$, but you have to fill out a form. Yikes. No, let’s just scrape it and see how far we can get.


The Deceivingly Easy Part

The only problem was that I didn’t know how to process the data easily as the entire thing was beyond what my laptop were able to. Fortunately, Google BigQuery came to the rescue! Something like:

SELECT KELURAHAN, TEMPATLAHIR, COUNT(*) AS CNT FROM DPTS GROUP BY KELURAHAN, TEMPATLAHIR WHERE PROVINSI = 'DKI JAKARTA'

gave us something like this under 15 seconds:

I don’t want to disclose how many times I screwed up the SQL query

As with any data exploration exercise, though, it’s never that easy. Fire up Rstudio or ipython notebook and you see there’s a problem on how many unique entries on the birthplace column:

For the purpose of this post, it looks fancier than R

120,000 divided for 5 parts of Jakarta should ideally yield ~24,000 unique hometown name instead of almost 80,000. The entire Indonesia only has 32 provinces, roughly 500 districts, and 80,000 villages. The data itself is sourced from the National ID database which is supposed to be the answer of Life, the Universe, and Everything. They used super expensive computerized system to build the database of every Indonesian ever in the whole Milky Way. It can’t be wrong.

But I can be wrong about them.

Sadly, opening the dataset in Open Refine and running the clustering function will give you bad, bad news. I can almost hear the program whispers softly in my ear, “I know how it feels. There’s nothing you can do. Just remember it’s part of the Plan.”

What you see, dear good people, is one of 22,000+ clusters of similarly spelled places. The typos are real. I actually checked several dozen of them in the Election Commission website just to make sure it’s not my scraper acting funny. Somewhere, somehow, an actual public official entered JQKARTA, JUAKARTA, JAKARTAT, or one of the many variation above instead of Jakarta in a citizen National ID as his/her birthplace.

You would have thought they actually have dropdown menu in the system instead of a textbox. Come on. It’s goddamn Jakarta, not a random marker on Google Map. It’s been there and called by that name for 50 YEARS.

Even worse, what constitutes of a birthplace is not always a city. The value actually ranges from the name of a village (without any other geographic indicator) to a country. Even if it’s spelled correctly, it might also spelled differently (Singapura instead of Singapore), shortened (Tanjung becomes Tjg or Tj), or ambiguous (is it X the village, the city, or the subdistrict? the one in that province or the other one?).

all hail r/HighQualityGifs

While the data is free, my time isn’t.


Moving On

In the end, I wanted this to be just a data visualization exercise. It will take me hours of hours to clean that godawful mess. I selected the sinful Select All, Merge Selected, and Close on that clustering window. I accepted the default merging of those similar names.

A couple minutes passed by and OpenRefine crashed.

I selected only the pair with 1,000 people coming from the origin city and deleted the rest. May the Gods of data science forgive me, for I have gravely sinned.

I ran each domicile-hometown pair to my bespoke, artisanal, locally-sourced geocoder whose job was to assign the latitude and longitude to each geography. I also created an additional field containing the log value of the number of people as it varies from a million to a thousand people. A couple of manual editing later, I had two CSV files of nodes and edges.

Aww yiss.

Now let’s visualize it! I always have a thing for airline route map.

If only I can afford something beyond zero-mileage promo ticket.

Let’s see whether it makes sense to do so. Fire up Gephi, load the node file to the node database and edges file to the edge database. It was a four-clicks affair. Assign the newly-created log value as the edge weight. Install the GeoLayout plugin, assign the latitude and longitude column. Click Apply. Boom.

We’re on to something!

Now it actually made sense to export this pretty thing to a mapping platform to overlay it on top of an equally pretty basemap. Unfortunately, the newest version broke the only Shapefile-exporting plugin which hasn’t been updated since 2013. Well, the only thing left to do was to export the graph as a 4000x4000 PNG and overlaid it manually using pirated Photos…I mean, free image editor program of your choice. You’re done!

But Why Not Make It Interactive

Yes, you can plot it using CartoDB, even on the free account. Upload the two same files, create a new map from the node layer, and thanks to this handy tutorial from them, use this on the SQL tab:

SELECT a2.cartodb_id,
a2.name,r.dest,r.cnt,r.origin,r.log,ST_Transform(
ST_Segmentize(
ST_Makeline(
a2.the_geom,
a1.the_geom
)::geography,
100000
)::geometry,
3857
) as the_geom_webmercator
FROM node a1
JOIN edges r ON r.source = a1.id
JOIN node a2 ON r.target = a2.id

The script above will create Great Circle lines from each of the city pair. Once again, I applied some additional styling rules to vary the line thickness depending on the log value and differentiate the line color based on the domicile. Now you can pan and zoom around! Hover your cursor above the line and you can see the hometown name along with the number as well. The embedded map is probably too small for your phone screen, but you can access the full screen version here.

Things rarely looks bad on CartoDB.

Final Thoughts

To be honest, the thousands of typos thing is a bummer. The map will look so much cooler had that one crazy issue wasn’t there in the first place. All things considered, though, I think it’s still an interesting dataviz mini project using entirely free software and services that started from a dataset meant for an entirely different purpose.

Millions of Jakartan will experience mudik, the ritual of going back to their hometown at Eid holiday, in a couple of months time. Those who live will go back to where they were born, where their families live (or were), for one more time. More than a tradition, it’s a gradually fading connection to places thousand of miles away where they once started their long journey,

the places once known as home.