Week 6 — Landmark Identifier

Serhat Sağlık
bbm406f17
Published in
3 min readDec 31, 2017

After long time with no posts, we are going to talk about the data that we collected from different resources. TLDR: Even though collecting data is easy, cleaning it from noise is a painful process.

To start with the resources that we harvested our data from, we used Flickr, Google Places, and Google Images. Let us tell you more about the scripts we used for doing the “harvesting”.

Flickr is an image sharing platform.

For flickr, we used flickr api with a python wrapper. (https://github.com/sybrenstuvel/flickrapi https://www.flickr.com/services/api/). Flickr api itself is very interesting and easy to use. And the wrapper made it even easier. Searching for pictures is as easy as writing:

\codephotos = flickr.walk(text = keyword, tags = keyword, per_page=200, sort=”relevance”)\code

And after this, all you have to do is to create an url from the id, server and some other data that comes from photos object, and download the image. For the downloading part, we limited the photo amount to 200 because after some amount of pictures, they get irrelevant and start containing more noise. Cleaning 30 noisy pictures out of 200 is much easier than cleaning 200 noisy pictures out of 500. Even though the noise was not that bad, it wasn’t good either. There were plenty of selfies and pictures of kids shared by their parents. Even though flickr wouldn’t be my first choice for sharing my kids’ pictures but apparently most people don’t think this way.

As we told, noise gets ridiculous. This image is supposedly from “Boğaziçi Köprüsü”.

After getting a maximum of 200 pictures from almost 200 hundred classes each, we have noticed that some classes had less pictures than others and for solving that we changed our keywords slightly and re-downloaded photos for missing classes.

After finishing our job with flickr, we had around 3gb’s of data. Which is of course far less from necessary. Then, we started mining our data from Google Places, which was much harder than flickr because google wasn’t as generous as flickr. Google Places api lets you download only 10 pictures at max for a place. And as we hate giving up so easily, we took this as a challenge and tried to beat Google Places Api, which we did, indeed.

Google places, also called as Google Business, is a platform which is mostly used by business owners for putting pictures and informations of their companies on Google.

We have written a web parser using selenium library of python and some macro code which used win32 api for python. And the main problem we encountered while downloading from Google places was the photo limit. After some some sleepless nights, we also found a workaround for that problem. The code for getting images from Google Places and Flickr can be found on github and can be modified freely and easily for self-use.

And the third resource that we used for getting our data is our reliable fellow Google Images. We used google images for filling out empty data in some classes. For doing the Google Images, we didn’t write any script, it is done by hand. Except the Mozilla Plugin that we used for downloading every image on a google images search. The reason behind that manual work was also to clean the noise which was leftover from previous cleanings. We searched for every place by hand and cleaned the noise after downloading.

Lastly, after downloading our data from various resources, we ended up with more than 30.000 pictures with almost 6 GB of size of 176 classes. The last data is “almost” noise free. It is almost, because it is possible that there are some irrelevant pictures that we missed when our eyes were hurting. Actually to be honest, we didn’t check every single picture, it was more like skimming the folder. But getting %99 noise free images in 2 days is easier that getting %100 in 2 weeks.

--

--