Identifying Berlin birds. Part I.
Motivation
After I relocated to Berlin from overpopulated Moscow, I was surprised how vivid and alive nature is here. You can find rabbits, foxes and even raccoons with wild hogs just wandering around the city
But not only by animals surprises Berlin nature. The city is full of birds. Ducks and Swans in almost all lakes, seagulls flying over Alexanderplatz and Mockingjays in Humbolthain Park.
Following my parents’ tradition, one of the first things I’ve had on our balcony was birds feeder (even though it is “semi-allowed” in our building). And then suddenly I realized that almost not able to identify birds having lunch next to our windows.
The more birds I’ve captured, the more puzzled I became. There were so many kinds of birds I’ve never met before. Of course, now, I can without any hesitation classify around 10–15 species, but at the beginning, it was as an absolute mystery.
And what regular person does, when he doesn’t know something? Correct, they buy a book. So did I. I’ve found a wonderful classifier called “Vogel in Europa”.
And yes, it helped. At least, I’ve learned some German birds names.
But you won’t always have a book on your trips and to be honest, most of the time I was forgetting it on a bookshelf.
Switching from Lo-Tech
The same time I’ve started investigating the area of Computer Vision and Deep Learning. And as “To a man with a hammer, everything looks like a nail", I’ve picked up the idea of automating birds identifications. And that will be my story.
Preparing dataset
Looking for the bird names
It is amazing that the City of Berlin has a website which provides you any information: starting from visa and immigration questions and up to aerial shots from the beginning of XX century.
My favorite sections of the berlin.de website are about green walking routes and flora and fauna species inhabiting the city area. One of such sections I’ve used as a starting point for building the dataset.
As the city provides us information about Berlin birds species, let's pull and convert to a structured form as CSV.
The most common way of getting data from the websites (aka scrapping) with Python is using a library called BeatifulSoup
Preparing the environment
For all parts of my project, I’m using the conda package manager
Installing conda
Ignore that part if you already using conda
Creating an environment
Not to mess up with existing Python programs and packages I prefer keeping every project in an isolated environment
Running these commands should give you an output similar to
/YOUR_USER/miniconda3/envs/berlin-birds/bin/python
If your output is different try the following steps:
- Check that
conda
is installed, by runningconda info
- If the command above shows you an error, try running
source ~/.bashrc
- If nothing helps then log off from current session and login. Or just restart your computer
Installing packages
To pull birds information from the websites, we’ll need additional packages.
- Requests — for easy HTTP requests handling
- Beautiful Soap — for parsing HMTL pages structure and extracting information
Scrapping the website
I’m not going to dive into details about using Beautiful Soap library, I hope that code itself is self-explanatory.
You may notice that I’ve added the list of user-agents. That is because before I’ve figured out the actual steps and page parts for extracting the data, I was sending more requests than a usual browser and could get banned as a bot.
Cleaning the data
Open now CSV file and check the structure. Split names and name columns.
As you may, results are mostly good, except few names like “Amsel Schwarzdrossel”, which is actually original name and synonym. So cases like that, I’ve added an additional column, called “Alternative name”. In addition to that, I’ve added Wikipedia link and Russian name (as it is my native language). Worth mentioning that name “Star” I’ve renamed to “ Gemeiner Star”, as we won’t need images of ⭐️ but photos of 🐦.
Downloading dataset images
Now after I’ve got a curated list of birds inhabiting Berlin, I can start building an images dataset.
Sources of the images
There are different ways of getting images for training the neural network and each comes with it is pros and cons. You can collect images yourself, buy labor on special websites, use APIs like Flickr or Bing Images Search or just use Google Images Search as a data source. I’ve selected the latter as it has access to any possible image website and is supported by Python libraries like google-images-download
( https://github.com/hardikvasa/google-images-download)
Install Google Images Download
First, installing the library
Using Google Images Download
Google Images Download (further GID) has a very flexible configuration. I’m going to review parts I’ve using for my project, for the rest, I’m recommending checking the official manual on GitHub
There are two ways of using GID: as CLI application, by calling googleimagesdownload [Arguments…]
or directly from Python application as from google_images_download import google_images_download
I’ll be working with the second one.
Setup Google Images Download
I won’t describe every single parameter, as most of them are self-explanatory, but few ones which I think are not straight-forward until you discover them
ChromeDriver — As I need more than 100 photos for every single species, GID requires ChromeDriver for downloading such amount of images. Make sure you are having and provide a path for the library.
Stopwords —I’ve discovered that along with bird photos, I’ve received images of cafes, magazines and football teams. To avoid junk downloads, prepare the list of stopwords but remember that Google doesn’t allow search terms containing more than 32 words.
Enable logging — Log all your actions to save time debugging errors. You can pick up examples from my code.
Download images
I’ve wrapped all parts mentioned above into a reusable script
Script needs:
berlin-birds-extended.csv
as a source for the list with birds namesstopwords
— file for list words to ignore. One word each lineproxies.lst
— list of proxies (optionally)
As output, I get a list of species folders with bird images. All folders located in thedata
folder.
Sort and clean downloaded images
Of course, images just download from Google search, can’t be used as is. Images names are cryptic, some images are HTML responses and some are empty files.
The easiest way I’ve figured out was to check file contents and it was matching image type, then rename it and keep it, otherwise delete.
There is a nice Python library called filetype
which is doing exactly what I needed — https://github.com/h2non/filetype.py
Extra: Removing duplicate images
In my case, another part which needed cleanup was getting lots of duplicate images. Especially photos for species not so well known were heavily impacted. With the help of StackOverflow, I came up with the small script, which removed duplicate images from all folders.
Summary
That was the first part of the project for training a neural network to detect birds species inhabiting Berlin and surrounding areas.
Using the code described above I’ve collected more than 300 photos of every bird species. That is more than enough for a well-trained network.
Stay tuned for the upcoming parts!
Part II: Sorting the images using a neural network
Part III: Training MobileNet using Transfer Learning approach