Identifying Berlin birds. Part I.

Gaiar Baimuratov
6 min readMay 3, 2019

--

Motivation

After I relocated to Berlin from overpopulated Moscow, I was surprised how vivid and alive nature is here. You can find rabbits, foxes and even raccoons with wild hogs just wandering around the city

No one gets surprised by the fox in the center of the city

But not only by animals surprises Berlin nature. The city is full of birds. Ducks and Swans in almost all lakes, seagulls flying over Alexanderplatz and Mockingjays in Humbolthain Park.

Following my parents’ tradition, one of the first things I’ve had on our balcony was birds feeder (even though it is “semi-allowed” in our building). And then suddenly I realized that almost not able to identify birds having lunch next to our windows.

The more birds I’ve captured, the more puzzled I became. There were so many kinds of birds I’ve never met before. Of course, now, I can without any hesitation classify around 10–15 species, but at the beginning, it was as an absolute mystery.

Bird on the right side probably is Kestrel. The left one I’ve identified as Common Wood Pigeon

And what regular person does, when he doesn’t know something? Correct, they buy a book. So did I. I’ve found a wonderful classifier called “Vogel in Europa”.

Green Woodpecker?! I even didn’t that they exist until I’ve met one.

And yes, it helped. At least, I’ve learned some German birds names.

But you won’t always have a book on your trips and to be honest, most of the time I was forgetting it on a bookshelf.

Switching from Lo-Tech

The same time I’ve started investigating the area of Computer Vision and Deep Learning. And as “To a man with a hammer, everything looks like a nail", I’ve picked up the idea of automating birds identifications. And that will be my story.

Preparing dataset

Looking for the bird names

It is amazing that the City of Berlin has a website which provides you any information: starting from visa and immigration questions and up to aerial shots from the beginning of XX century.

Aerial shot of North railway station area from 1928. Today non-existant.

My favorite sections of the berlin.de website are about green walking routes and flora and fauna species inhabiting the city area. One of such sections I’ve used as a starting point for building the dataset.

As the city provides us information about Berlin birds species, let's pull and convert to a structured form as CSV.

The most common way of getting data from the websites (aka scrapping) with Python is using a library called BeatifulSoup

Preparing the environment

For all parts of my project, I’m using the conda package manager

Installing conda

Ignore that part if you already using conda

Creating an environment

Not to mess up with existing Python programs and packages I prefer keeping every project in an isolated environment

Running these commands should give you an output similar to

/YOUR_USER/miniconda3/envs/berlin-birds/bin/python

If your output is different try the following steps:

  • Check that conda is installed, by running conda info
  • If the command above shows you an error, try running source ~/.bashrc
  • If nothing helps then log off from current session and login. Or just restart your computer

Installing packages

To pull birds information from the websites, we’ll need additional packages.

  • Requests — for easy HTTP requests handling
  • Beautiful Soap — for parsing HMTL pages structure and extracting information

Scrapping the website

I’m not going to dive into details about using Beautiful Soap library, I hope that code itself is self-explanatory.

You may notice that I’ve added the list of user-agents. That is because before I’ve figured out the actual steps and page parts for extracting the data, I was sending more requests than a usual browser and could get banned as a bot.

Cleaning the data

Open now CSV file and check the structure. Split names and name columns.

As you may, results are mostly good, except few names like “Amsel Schwarzdrossel”, which is actually original name and synonym. So cases like that, I’ve added an additional column, called “Alternative name”. In addition to that, I’ve added Wikipedia link and Russian name (as it is my native language). Worth mentioning that name “Star” I’ve renamed to “ Gemeiner Star”, as we won’t need images of ⭐️ but photos of 🐦.

Enriched list of Berlin birds

Downloading dataset images

Now after I’ve got a curated list of birds inhabiting Berlin, I can start building an images dataset.

Sources of the images

There are different ways of getting images for training the neural network and each comes with it is pros and cons. You can collect images yourself, buy labor on special websites, use APIs like Flickr or Bing Images Search or just use Google Images Search as a data source. I’ve selected the latter as it has access to any possible image website and is supported by Python libraries like google-images-download ( https://github.com/hardikvasa/google-images-download)

Install Google Images Download

First, installing the library

Using Google Images Download

Google Images Download (further GID) has a very flexible configuration. I’m going to review parts I’ve using for my project, for the rest, I’m recommending checking the official manual on GitHub

There are two ways of using GID: as CLI application, by calling googleimagesdownload [Arguments…] or directly from Python application as from google_images_download import google_images_download

I’ll be working with the second one.

Setup Google Images Download

I won’t describe every single parameter, as most of them are self-explanatory, but few ones which I think are not straight-forward until you discover them

ChromeDriver — As I need more than 100 photos for every single species, GID requires ChromeDriver for downloading such amount of images. Make sure you are having and provide a path for the library.

Stopwords —I’ve discovered that along with bird photos, I’ve received images of cafes, magazines and football teams. To avoid junk downloads, prepare the list of stopwords but remember that Google doesn’t allow search terms containing more than 32 words.

Enable logging — Log all your actions to save time debugging errors. You can pick up examples from my code.

Download images

I’ve wrapped all parts mentioned above into a reusable script

Script needs:

  1. berlin-birds-extended.csv as a source for the list with birds names
  2. stopwords — file for list words to ignore. One word each line
  3. proxies.lst — list of proxies (optionally)

As output, I get a list of species folders with bird images. All folders located in thedata folder.

Sort and clean downloaded images

Of course, images just download from Google search, can’t be used as is. Images names are cryptic, some images are HTML responses and some are empty files.

The easiest way I’ve figured out was to check file contents and it was matching image type, then rename it and keep it, otherwise delete.

There is a nice Python library called filetype which is doing exactly what I needed — https://github.com/h2non/filetype.py

Extra: Removing duplicate images

In my case, another part which needed cleanup was getting lots of duplicate images. Especially photos for species not so well known were heavily impacted. With the help of StackOverflow, I came up with the small script, which removed duplicate images from all folders.

Summary

That was the first part of the project for training a neural network to detect birds species inhabiting Berlin and surrounding areas.

Using the code described above I’ve collected more than 300 photos of every bird species. That is more than enough for a well-trained network.

Stay tuned for the upcoming parts!

Part I: Building the dataset

Part II: Sorting the images using a neural network

Part III: Training MobileNet using Transfer Learning approach

--

--