Mining geo-tagged photos
TLDR: This is a release of some details of my old unfinished project on the automatic city guide generator based on public geotagged photos. I’ll try to describe some ideas and ways to extract such an implicit information signal from the data.
I owe this post myself for about six years — all the presented here results that are dated around 2012. For some time I did not publish them, hoping to proceed with this project, then I passed it on to my friends, who, after completely redesigning the original ideas, eventually made a tourist mobile application (but this project, alas, already closed).
Here I publish my initial results in almost the original form. I just rebuilt them on modern technologies, using mostly leaflet.js (with a bunch of plugins) for visualization and tiles from Stamen. Detailed credits are at the end of this post.
Illustrations for this post made mainly based on the data of Berlin and Moscow, as I know these two capitals good enough to check the plausibility of the results.
In many aspects, this research initially inspired by the visualization of cities based on the Flickr-dataset, made by Eric Fisher in 2010.
DATASET
For all experiments, I used a dataset of geotagged photos published by users on the Internet. The first versions of the dataset I scrapped myself from various services similar to Panoramio, later a team of Yandex.Fotki provided me with a larger anonymized dataset for analysis, so all the illustrations for this post made based on its basis.
The dataset structure is pretty simple — one photo per line, and each photo is described by 5 fields: unique photo id, unique author id, latitude, longitude, timestamp. Here it is a tiny random sample:
DOTS AND CLUSTERS
The first thing you can do with such a dataset is to draw all the dots:
Obviously, the dataset mostly consists of Russian photos, but large European and American cities are also present.
However it’s not very interesting to draw just the points of photos — it’s much more useful to visualize a relative density of dots, for example, on a heat map:
Indeed the photos clearly condense around the main attractions and form distinct groups. On this map of Berlin, for example, you can clearly see Charlottenburg Schloss, Berlin Zoo, Ku’damm, East Side Gallery et cetera. The big blobs in the center also break up into distinct clusters as you zoom in. Is it possible to detect these areas automatically so that each cluster corresponds to a meaningful zone for walking and its center coincides with some attraction?
To check it I tried some algorithms for clustering points on a plane. There are a lot of such algorithms and they could give noticeably different results. For example, here’s how the algorithm from Leaflet.markercluster makes the clusters:
You could see that the centers of the selected clusters are shifted from the regions of maximum density, so this algorithm is not very suitable for our purposes. After a series of experiments, I found that the MeanShift with Gaussian kernel produces stable, decent results. It does not require specifying the number of resultant clusters and uses only the sensitivity threshold parameter which is also very handy.
Manual verification showed that the algorithm accurately identifies the main attractions, including temples, palaces, parks and cozy places for walking with beautiful views.
Let’s check it on Moscow center:
To color clusters I used a simple heuristic from Eric Fischer: if a given photographer has old photos (for example, older than a couple of weeks) from the same city, you can consider him as a local. If each of his previous photos made in other cities, we can consider him as a tourist. This partition of the authors allows you to divide clusters into tourist attractions (red) and cozy places, chosen by the locals (blue). Clusters without a significant tourist specifics indicated in yellow.
To automatically generate a description of the cluster, I used a few closest to its center photos, as well as information from the geo-API of services such as Wikipedia, Wikimapia, Foursquare.
I also noticed that the cluster of attractions well divided according to the popularity profile in time (months, days of the week, hours), for example, parks and manors are more popular on weekends, in summer and early autumn, and beautiful views of the city in the evening hours. But I never managed to use this idea.
TRACKS AND ROUTES
Having information about the author of the photo and the time of the shot, you can construct sessions describing the successive movements of each photographer, and connect them with lines:
It’s pretty, but not very informative in this form. However, you could clean them a little and get a more useful signal. Since we are more interested in pedestrian walks, it is possible to exclude routes containing large “jumps” in time or space. Also, knowing the distance between the photos and the time difference of the shots, it’s possible to estimate a speed of the photographer’s movement and remove segments with the speed above the threshold, say, at 10 km/h.
That’s what remains, for example, in the center of Berlin and in the center of Moscow:
It is a bit dirty, but all the main hiking trails and park areas are here. How can you further purify this signal? I used iterative merge of overlapped segments in generalized routes, and then I add a couple of simple heuristics: to clean the excess intersections and to add closing segments in situations when the extensions of existing segments intersect.
Finally, I smoothed the resulting segments with Bezier curves:
Here’s results for a district in the center of Moscow:
You could notice well-defined squares, embankments and park areas; some walking streets also are drawn. Note that there was no use of OSM road graph at all, with it would probably be better.
WIND AND FIELDS
Just at that moment when I was experimenting with cleaning and smoothing out all these routes, Fernanda Viégas and Martin Wattenberg published a beautiful WindMap visualization. So I came up with the idea to visualize the route data with the similar technique.
I used the metaphor of the magnetic field: let’s pretend each of the route segments is a magnet pulling dots from one its pole to another.
So it’s possible to calculate the total magnetic field by the set of routes, and then use it as a basis for visualization of particle motion:
Depicted on top of the map it could show the main city routes and poles of interest:
THANKS AND CREDITS
Completing the article, I want to list all those who somehow contributed to this old story. Thanks to:
- Eric Fisher for his beatiful visualizations and clever ideas,
- Aleksander Krainov— for participating in the discussion and development of my ideas, as well as for helping me in obtaining data,
- Yandex.Fotki team for the dataset provided,
- Elena Kolmanovskaya and the whole team Yandex.Progulki for an attempt to transform these strange experiments to the operating service,
- Stamen team for, perhaps, the best open map tiles on the Internet,
- Leaflet.js team for, perhaps, the best opensourced map engine on the Internet,
- Fernanda Viégas and Martin Wattenberg for their beautiful WindMap visualization,
- Danny Cochran for a Leaflet.js Windable plugin, which implements the wind field visualization,
- Dmitry Laptev for our joint research on a proper areas-of-interest detection.