My Experience as Jakarta Smart City Data Science Trainee

Grace K Susanto
jakartasmartcity
Published in
9 min readOct 14, 2021

--

Written by Grace Susanto, Hansen Wiguna, and Ayu Andika

Data Science Trainee Program

I was a Computer Engineering and Computer Science fresh graduate when I came across an opening for Jakarta Smart City’s Data Science Trainee Batch 3 program. I decided to try my luck with the program, submitted my application, went through 2 rounds of interviews and 1 hands-on project, and received my “accepted” decision around 2 weeks after I first applied.

There are 8 of us accepted “mentees” in the program. The program itself is a mentorship program where we are exposed to city-related problems and, for a 2-month period, are guided to give a contribution in solving these problems using data. The first two weeks are used to get ourselves familiarized with Jakarta Smart City, especially from a data analytics perspective, what kinds of problems Jakarta has been facing, and how Jakarta Smart City teams have been using data to help the government make better decisions faster.

After that, we were asked to define a city-related problem statement that we will be solving using data in the next one and a half months with the help of our mentors. The problem I chose to solve was: Creating a data pipeline to import time series data from an external API. I chose this problem because I realized that in order to make data-driven decisions, data scientists need to collect a lot of data. Some of these data can be gathered from public APIs, but some time-series data are only available for a certain period of time. For example, some public API endpoints may only provide live and current data, while others may provide the latest 24 hour worth of data. In order to analyze these data, we need to create a pipeline that incrementally fetches the data we need and stores it in our database.

In Dealing with Time Series Data Set

Time series data is interesting because we can use each datapoint’s timestamp can be an ordering sequence. Time series data sets can reveal trends that can be related to events happening in the real world. For example, intuitively, we know that long public holidays in Indonesia are usually followed by a surge in the number of covid-19 cases. This intuition can be affirmed by using the time series data set: Using covid-19 daily new positive cases data in Indonesia, we can see a trend that shows a positive correlation between the number of positive covid cases and public holidays. In other words, the trend in the data shows an increase in the number of cases a few weeks after public holidays.

There is a surge in the number of daily new covid-19 cases after long public holidays.

Back at the time I was formulating the problem statement, Jakarta was facing a significant rise in covid-19 cases, and the government took action by sanctioning a city-wide emergency public activity restrictions policy. To measure the effectiveness of this policy, data analysts can use the traffic index recorded during this social distancing and compare it over time during the public activity restrictions period, or compare it with the traffic index before and after the social distancing rule period.

One public API that provides this data for free is TomTom. TomTom Traffic Index has been providing detailed insights on traffic congestion levels in over 400 cities around the world for the past 10 years. One of TomTom’s API endpoints provides weekly traffic congestion by the time of day. It provides a data set of an aggregated average of traffic index in Jakarta per hour for a 7 x 24 hours period.

Problem

The data itself is a rolling 7 days period of time series of hourly data. Meaning when I fetch the data on 7 September at 13:30 PM, I will get the data from 1 September at 14:00 PM to 7 September at 13:00. When I fetch the data the day after, 8 September at 8:30 AM, I will get data from 2 September at 9:00 PM to 8 September 8:00.

Overlapped data illustration

In importing this kind of data, I could not just get the data and put it in the csv file, because it will overwrite the previous data. I also could not just append the data, because there would be some overlap. Manual implementation would be to save the new data in another csv file, then manually find the cutoff overlapping timestamp, copy and paste it into the main csv file. But this is not feasible in the long run, so I started brainstorming a design to automate the importing data process.

Proposed Data Pipeline Architecture

Knowledge

The first data pipeline flow I came up with was inspired by the mechanism of web caching that I learned in my computer networking class. The internet implements web caching that stores previously fetched websites. So, the subsequent time we try to access the website, we don’t need to travel across the internet and ask for the content from the server, but we can get it from the cache instead. The problem with web caching is that we cannot guarantee that the website that resides in our cache is up to date.

Source: https://infosolution.biz/wp-content/uploads/2021/04/Web-Cache-Diagram.png

The internet solves this problem by using HTTP HEAD requests. (Side note: The popular HTTP request methods are GET and POST. But there is also the less popular method: HEAD.) The HEAD method is functionally similar to GET, except that the server replies with a response line and headers, but no entity-body. So it is a very lightweight response. To solve the web caching problem, an HTTP HEAD request is used to ask the website’s server, “Hey, I saved a version of your website from this timestamp. Is there any modification on the website between then and now?”. The response will just be a header with status 304 Not Modified without a body if our cached website is still fresh, or the actual website if our cache is not fresh. Sending a response with only a header will save a lot of unnecessary bandwidth.

Based on this knowledge, I tried to solve the import data problem. Whenever I import and save data to csv and the database, I will also save the latest timestamp of the imported data to a config file. The next time I try to save a new batch of data, I will first consult my config file for the timestamp of the latest traffic data already saved in the database. Then, I will slice the data that is later than the saved timestamp, append it to the csv file, insert the data to the database, update the latest timestamp in my config file, and repeat. And it worked fine, with no duplicate data and no performance issues.

Then I submitted a pull request and asked my mentors to review my code.

Experience

One of my mentors is a data engineer with experience in the industry. Upon reviewing my pull request, he raised one concern: by appending data to csv file, the data will accumulate over time, and eventually, there will be a performance issue to read from a csv file with too many rows. Thus, the old approach will incur unnecessary overhead when dealing with an ever-increasing csv file.

He proposed another approach, in which the data saved in csv files are divided by date. So I revised my system design to be: First, extract data from API endpoint, transform it into a proper storage format/structure for the purposes of querying and analysis. Then I will slice the transformed data per date and save it in different csv files. Each csv file will end up only having 24 hours worth of data. The data from the csv file will then be exported to the database. This new implementation will be run by a scheduler once a day.

General workflow of the data pipeline

The detail of the implementation of this data pipeline can be found in my other Medium article post.

Reflection

My experience at JSC, my first step on my transition from a fail-safe environment that universities offer to the real-world environment in industries, has been a very humbling one. School projects were difficult. Sometimes they required the implementation of more complicated algorithms. Other times, like in Operating Systems or Networking class, they required us to code in the lower-level programming language in addition to concurrent programming (as opposed to sequential programming).

In my experience dealing with school projects, writing the code was hard, getting it to compile successfully could be a challenge, and debugging was a headache. But once I had successfully compiled my code, wrote unit tests, and ran it against my code; once I had passed those, I could safely submit my code for grading and probably would never touch the code ever again. Once upon a time, coding was hard, but it had always been a one-time deal. If it works for small-sized input, I had an untested conviction that it would also work with input of any size. The projects were also usually started from scratch. And sometimes, hours of debugging can result in a spaghetti code that I did not bother to fix due to approaching deadlines — as long as it works, I will submit it.

Going to the industry, I learn that I need to revise my perspective towards programming. It is no longer a one-time deal. Code needs to be readable; application needs to be reproducible, robust, and scalable. The ability to foresee and anticipate potential points of failure in the system is crucial, like how my mentor pointed out the potential unnecessary overhead that my original design might incur. My original design might work well for some period of time, but in the longer term, its performance might decline. It will take more than knowledge — it will take experience and a shift of mindset — to create a robust system.

Conclusion

I might have exaggerated the story a little bit — I might not do justice to the hard-earned knowledge acquired from school. Knowledge is powerful, in a way that you learn from the world experts’ experience, for example about the tricks they use to optimize the performance of a system, which can be a library for you to use when you are building your own system. But knowledge by itself is sometimes not sufficient to build a robust system. Experience is needed to be able to foresee and anticipate potential points of failure in the system.

My experience at Jakarta Smart City has brought me invaluable exposure to industry-level system design and programming, the real-world example of problems that cities are facing, and how to make data-driven decision that is better and faster. These things are hard to come by only from school and I am grateful for the opportunity that Jakarta Smart City Data Science team has entrusted me to learn under their guidance.

This article was written by Grace Kartika Susanto (Data Science Trainee), Hansen Wiguna (Business Analyst & Lead Sub-Team), Ayu Andika (Data Analyst) from Jakarta Smart City, Data and Analytics Team. All of the opinion expressed in this article are those of the authors and did not necessarily represent the opinions or views of Jakarta Smart City or DKI Jakarta Provincial Government.

--

--