Image for post
Image for post

Analysis of Warsaw Public Transport Data in Kibana and Elasticsearch

Maciej Szymczyk
Jun 14 · 7 min read

In the previous posts I documented the creation of a data flow using technologies such as Kafka, Kafka Streams, Logstash and Elasticsearch. After a few days of work, I already have enough data to check the possibilities of urban transport data analysis in Elasticsearch and Kibana.

Data

The first record had 2020–06–02 20:24:58 timestamp (Tuesday). I made the screenshot below on Sunday evening, which is about 4.5 GB for 5 days. The API request is sent every 10 seconds. At the time I am writing this article there are 16 578 668 records at the “ztm” alias.

Image for post
Image for post

Cardinality

Using cardinality aggregation, we can check how many lines and vehicles a set is. There are about 309 bus lines and 1828 vehicles.

You probably wonder why I wrote “about”. The answer is in the documentation. This query (like many others in Elasticsearch) returns only estimated values. I will explain it on a blog sometime.

POST ztm/_search
{
"size": 0,
"aggs": {
"cardinality_lines": {
"cardinality": {
"field": "lines"
}
},"cardinality_vehicle": {
"cardinality": {
"field": "vehicleNumber"
}
}
}
}

Map

Viewing city traffic on a map allows analysis using a protein interface (read: human eye 😊). Below you can see vehicle assemblies suggesting bus loops and depots.

Image for post
Image for post

Maps in Kibana allow you to add layers related to document aggregations.

Image for post
Image for post

Heatmap

Unfortunately, the heatmap does not allow aggregation using the average. Technically you can get around this, but for now let’s check without combining what we can get using simple record counts.

There are filters available, let’s check how the heatmap looks for records at a speed greater than 50 km / h.

Image for post
Image for post

High speeds are most often seen on bridges and transit roads. What will the map look like when we negate this condition?

Image for post
Image for post

Now the city center and bus depots prevail. It should be remembered that these are not average speeds but counting records (the bus standing on the loop can send)

Grid rectangles

The next type of aggregation on the map is a grid. Here we can already use the average speed. Let’s take a look at what Warsaw looks like from afar and around the city center.

The values and size of the squares depend on what area we look at and with what approximation.

Image for post
Image for post

The darker the shade of blue, the higher the speeds. You can see the distinctive “Wisłostrada”. Route alongside the Vistula River

Image for post
Image for post

Above you can see access to Warsaw from the direction of the Łomianki. The first thing you see is the inaccuracy of measurements. Buses do not usually hit the Vistula River or nearby forests 😉.

On the right below you can see high speeds on the north bridge. However, let’s focus on a fragment of the road near the Młociny forest. At first glance, it’s not so bad. What happens if we add a condition that shows records of vehicles heading southeast?

Image for post
Image for post

Speeds have dropped significantly. After all, hundreds of drivers stand in a traffic jam every morning. Bus drivers are not exempt from this “attraction” 🙄 What about the northwest direction?

Image for post
Image for post

It’s much faster here. Let’s return to the south-east direction and check the difference between 6:00 a.m. and 9:00 a.m. and 9:00 a.m.

Image for post
Image for post
6:00–9:00 >> 9:00–12:00

You can analyze maps long and passionately. It seems to me that the above observations have proved the value that lies dormant in maps in Kibana.

Graphs and charts

Let’s see if any meaningful charts come out from the collected data. At first sight the date histogram. Aggregations are created for records 03/06/2020–07/06/2020.

Image for post
Image for post

The above graph is a date histogram of an average speed. However, something is missing in it. It would not be as legible for more than a few days. It would be clearer to use graph like a histogram collectively for each hour of the day. However, this is not so simple.

Adding hour field

The first thing that came to my mind was Scripted Fields in Kibana. Unfortunately, these fields are read-only and I didn’t manage to aggregate over them.

Image for post
Image for post

The Script Fields option in a query to Elasticsearch will also not meet our expectations. We can display something, but we can not transfer it to Kibana.

POST ztm/_search
{
"_source": "@timestamp",
"script_fields": {
"test1": {
"script": {
"lang": "painless",
"source": "doc['@timestamp'].value.getHour()"
}
}
}
}

As a result, I had to use the Update By Query, executing a script on each record adding a new field with the time based on timestamp. Eventually you need to update the previously created pipeline in Logstash.

Charts

Personally, I expected a greater decline during peak hours. It is interesting that 22 is quite “slow”.

Image for post
Image for post
Image for post
Image for post

Median + average already gives a slightly larger picture of what urban traffic looks like. Maybe those low speeds 22–23 are caused by buses that stand in the depot?

Image for post
Image for post

I have tried different aggregation combinations and the one below looks interesting. It is an hour of the day histogram, in which there is a speed histogram with an interval of 10 km / h. The height of the bars is the number of records, so we see how many buses were active.

Image for post
Image for post
Image for post
Image for post

Chart for lines 122 and 190

Image for post
Image for post
Image for post
Image for post

The fastest bus in the city

I know you have only been waiting for this. Which line is the fastest? Here is the script (kibana could not sort on percentiles). 99 percentile speed.

POST ztm/_search
{
"size": 0,
"aggs": {
"by_lines": {
"terms": {
"field": "lines",
"size": 10
},
"aggs": {
"95_percentile_speed": {
"percentiles": {
"field": "speed",
"percents": [
99
]
}
},
"sales_bucket_sort": {
"bucket_sort": {
"sort": [
{
"95_percentile_speed[99.0]": {
"order": "desc"
}
}
]
}
}
}
}
}
}

Results:

{
"key" : "186",
"doc_count" : 225539,
"95_percentile_speed" : {
"values" : {
"99.0" : 65.91829377908059
}
}
},
{
"key" : "112",
"doc_count" : 171393,
"95_percentile_speed" : {
"values" : {
"99.0" : 64.00226838213632
}
}
},
{
"key" : "509",
"doc_count" : 186253,
"95_percentile_speed" : {
"values" : {
"99.0" : 62.05728980093291
}
}
},
{
"key" : "523",
"doc_count" : 176120,
"95_percentile_speed" : {
"values" : {
"99.0" : 59.11082142026221
}
}
},
{
"key" : "190",
"doc_count" : 179619,
"95_percentile_speed" : {
"values" : {
"99.0" : 58.488623481840385
}
}
},
{
"key" : "116",
"doc_count" : 190566,
"95_percentile_speed" : {
"values" : {
"99.0" : 54.07943470657681
}
}
},
{
"key" : "105",
"doc_count" : 179109,
"95_percentile_speed" : {
"values" : {
"99.0" : 53.821592225265405
}
}
},
{
"key" : "189",
"doc_count" : 266184,
"95_percentile_speed" : {
"values" : {
"99.0" : 53.12168077972748
}
}
},
{
"key" : "179",
"doc_count" : 207693,
"95_percentile_speed" : {
"values" : {
"99.0" : 49.59739096543882
}
}
},
{
"key" : "157",
"doc_count" : 171677,
"95_percentile_speed" : {
"values" : {
"99.0" : 46.13637487155884
}
}
}
}

Conclusions

It’s hard to look for surprising conclusions about public transport traffic here. More something like “better to be beautiful, healthy and young than old, sick and ugly” 🙂 There is also no context for the analysis. Searching for the sake of searching is not very fruitful.

In my opinion, Elasticsearch and Kibana offer quite a few out-of-the-box options. The entire process from processing to analysis and visualization falls under the Basic license. You just have to remember that it’s not spark and similar, so all denormalizing and joining operations should be done before throwing them into Elasticsearch.

The data, although pre-cleaned at Kafka Streams, is still a little dirty. From issues related to the accuracy and filtering of senseless records, to marking specific route routes and their direction. Such enriched data would allow further analysis and conclusions.

If you have an interesting idea, let me know 🙂

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store