How did you get the data? How did you process it? What tools did you use to generate all the beautiful images and videos?
Projects like the data analysis of car2go availability are not really my bread and butter — after all, processing and visualizing 10MM rows of raw data probably isn’t an everyday task for most people. As the whole project was a great learning experience for me, and as the original post raised a couple of questions, I decided to follow-up to with a more technical explanation of how it went from an idea to the post.
On one sunny winter day I noticed that the car2go website includes a nice, automatically updated map of all their available cars in Berlin. As this map was rendered on the client-side, there had to be some connection to the car2go backend, and quick inspection of the network traffic from the browser confirmed this suspicion.
Every 5 seconds the website was making a request to a specific API endpoint (
/vehicles) which returned a list of all the available cars in Berlin, together with their address, coordinates, VIN, and license plate (and some other data). The API endpoint was the same one as documented in the public API docs, however using a clearly temporary OAuth key
car2gowebsite. It was immediately clear to me that this data would make a perfect foundation for the analysis that I had in my head for some time.
The first, and most important thing that I needed to do was to start saving this data somewhere (at least temporarily), so that I can analyze it later. A bit of ruby scripting and cron setup and we’re good to go.
Initially I was planning to get at least a month of data to have a relevant sample, however, at the beginning of February, car2go started deprecating their API as well as removing the nice map from their website, which meant my plan no longer worked. This left me with around 2 weeks of data — not as much as I wanted, but good for a start.
In hindsight this step already brought one big learning — always get as much data as you can. Initially I focused on and stored only data between 7am and midnight, which slightly limited my available data set. I also did not store the VIN, which, as it turns out, can be used to identify the specific make and model of each car. Well, next time I’ll know better…
What do with all this data?🤔
Just push all of it to Postgres, no?
And so I did. I spun up a local Postgres 10 (the version is most likely completely irrelevant anyways) instance and imported all the data points there, doing some small manipulation on the way (e.g. separating zip code from the address so that I can work with it easier). This left me with about 10M rows, each representing the position of one car in a specific minute.
Couple of things that I learned on the way:
- Importing a couple thousand CSVs into the DB through ruby was slow as hell, but the Postgres
COPY📑 function made it super easy.
- The same applies to actually processing the data. Manipulating the data directly in the DB (using temporary tables) was WAY faster than doing the same with ruby or python.
- I have no idea about how indexes work and what is the correct way to set them up.
Data cleanup 🛁
Before diving into the actual analysis I wanted to clean the data up a bit, to make sure there were no surprises.
One of the things that I found was that a little over 400k rows had no zip code filled out. As the zipcode was extracted from the address, there must have been something odd with it. It turned out that the majority of those were entries for cars that were on one of the two airports, and the address from the API response did not include a zip code. Others were for locations that either don’t have a specific zip code (cars parked along the A100 and A103), or that were not identified properly.
A couple of manual Google searches later (and manual mapping for the airports) and all of the entries were correct.
There was also a couple of curious cases of entries with zip codes larger than 15000, which should not be the case for Berlin. Looking at that data I realized that as part of the API deprecation process the address returned on the last day before the API was shut down was only the GPS coordinates, and as my regular expression matched the first 5 consequent digits, it returned the first 5 decimal places of the latitude. I did not want to spend time on determining the specific location of those cars and deleted them instead 🤷.
Making sense out of it
One my favorite questions when interviewing product managers is “How comfortable are you with data analysis and presentation?” With this little project I put myself in front of the same question — how comfortable am I with analyzing this kind of data, at this scale? What tools do I even use?
My initial attempts at processing the data were in ruby, but that turned out to be a dead end, partially for the performance reason, and partially because I could not really find any serious data-related tools.
That’s when I turned to python, remembering it as a somewhat quirky, but otherwise really powerful language often used for data processing.
After some experiments I would usually prepare a SQL query that would give me the data that I need, partially aggregated, and then use SQLAlchemy and Pandas to directly load the results into a DataFrame. Afterwards I could do pretty much anything with it, since DataFrame is such a lovely structure for column-based data sets.
Analysis 1: Average availability
Figuring out the average availability over time was fairly straightforward. After aggregating the data per timestamp (truncated to a minute) I would calculate the average, minimum, and maximum value directly in the query, grouping the data set by weekdays and weekend.
The charts were then generated using matplotlib, which plays nicely with the individual columns from the DataFrame.
Analysis 2: Average car density per zip code
The second analysis was definitely a tricker one, since it concerned much more data (as I needed to group the data not only by time, but also by each zip code), as well as much more complex visualization.
Thanks to this project I discovered the whole world of geographical information online, including the GeoJSON and TopoJSON formats, and almost got lost in the whole topic of geoclustering, which is a bigger deal than it might seem at first.
While initially I wanted to base the analysis on automatic clustering, as you can see on the picture the results were not that great. All the experiments with different methods, libraries, and algorithms either yielded too many or too few clusters. At that point I decided to go with a choropleth map instead, calculating and visualizing the density per zip code, as opposed to relying on calculated clusters.
The process to get from 10MM rows of raw data in Postgres to the beautiful video that you could see in the original article took a lot of time and experimentation, and can be summarized in the following 4 steps:
- Get Berlin’s geographical data
- Calculating the absolute density of each area over time
- Generate the individual choropleth maps
- Compile the video
Getting Berlin’s geographical data 🗺
To position the analysis in space (and time) I needed to get some sort of geographical data. This was possible thanks to this repository from Berliner Morgenpost, which included, among other things, the TopoJSON for all Berlin’s Postleitzahlen, or zip codes. This raw data was then used to calculate the actual area of each zip code, which is needed for the density calculation.
To generate the choropleth I needed a simplified version of Berlin, ideally in a format that is easily manipulated programmatically (such as SVG). Getting that data turned out to be much easier than anticipated, as this website has all the different source files available under the Open Database License, based on data from OpenStreetMap contributors. I’ve cleaned up the map a bit before using it, as some areas had very little to none car2go traffic and I just did not need them, and added a placeholder for the timestamp, so that I can embed actual time information while generating the images.
Calculating the absolute density of each area over time 💎
As in the previous case, most of the data work was done directly in the database, first calculating the density per zip code per date and time, then aggregating and averaging the values per zip code and time of the day, and then finally normalizing each density value on a scale from 0.0 to 1.0 using my beloved window functions.
Generating the images 🖼
The resulting dataset, which was pretty much just a list comprising the zip code, time of the day, and the normalized density, was then used to generate a SVG and PNG version of the choropleth map, where each area was filled with the same shade of blue, just with varying opacity (that’s where 0.0 to 1.0 came handy), and with an embedded timestamp for reference.
The SVG manipulation itself was done through ElementTree, finding the nodes representing each zip code (based on their
id) and setting their fill attribute to a specific shade of blue. The SVG was the converted to PNG using ImageMagick as part of the processing.
The result of this step was 1020 snapshots (each as SVG and PNG), one for each minute of the day, with accurate representation of the density per zip code.
Making it all fast ⚡️
At the beginning I was doing all of this in a single thread, which took a long time. After all, we’re talking about processing some 120k rows and generating more than 1000 high-resolution PNG images.
Thankfully python has solid support for parallel processing, so instead of going through all the data one by one, I split the DataFrame into several groups based on the timestamp and processed them in 4 parallel processes. As opposed to having to wait for ~10 minutes I could get all the results in ~4 minutes, which made it much easier to test, iterate, and fine-tune the results. Going up to 8 processes (obviously) did not lead to further speed up, since my CPU only has 4 threads 🤷️.
~/D/c/python >>> time python generate_images.py 1
Reading data from DB
python generate_images.py 1 533.95s user 21.20s system 93% cpu 9:54.11 total~/D/c/python >>> time python generate_images.py 4
Reading data from DB
python generate_images.py 4 792.82s user 22.94s system 325% cpu 4:10.76 total~/D/c/python >>> time python generate_images.py 8
Reading data from DB
python generate_images.py 8 827.38s user 30.15s system 255% cpu 5:35.11 total
What I also found out was that rendering the SVG into a PNG directly in python is much faster than batch converting all the SVGs later using the command-line version of ImageMagick (even without parallelization), which not only took longer, but also totally locked up my terminal.
Making it move 📹
The last step in the process was to compile together all the images into an animation. While I originally wanted to create a GIF, since they’re in fashion now, the results were not very pretty, so I went with a simple video instead.
Everyone’s go-to tool for such purposes is FFmpeg, which not only powers a lot of open-source video editing and conversion tools, but it is also super easy to use directly from the command line. Combining all images into a MP4 file was then just a matter of Googling “ffmpeg combine PNG into MP4” and slightly adapting one of the examples from their wiki. You can see the resulting video in the original post.
Was it all worth it?
Yes, definitely. I’ve really enjoyed this whole project not only because I could finally answer some of the questions that I had since a long time, but also because I got to create something while rediscovering and learning new technologies. I’m sure that knowing how to parallelize computations in python will come handy at some point in the future, right? RIGHT?
Overall, the whole process of conceptualizing, iterating, debugging, and finally making it work is really rewarding, and while it was also a bit frustrating at times (I’m looking at you,
numpy.datetime64), for the last couple of weeks it felt like I was back in my childhood room, working on crazy LEGO creations using small DC engines from a gutted toy train. And that’s what matters.
Thanks for reading ✌️!