City Guesser AI — Classifying Street View Images Using Transfer Learning

For access to our codebase, visit our github repo.


Over the last 15 years, Google has photographed 10 million miles of street views, which is nearly 20% of the earths surface. With this many images, there have been numerous applications that incorporate these unique perspectives. Perhaps one of the most popular is GeoGuessr, an online game which places you inside a random street view and has you guess where on the planet you are. The closer to the actually location the guess is, the more points are awarded. Having all played this game for a while, we decided to explore how we could use deep neural networks in order to classify a google maps street view into its correct city.

An example of a geo guessr image
An Example Round of GeoGuessr

The Dataset

Our data came from Mapillary, a Swedish company based in Malmö that crowdsources street level photos from around the globe. Mapillary provides their datasets for free, along with access to powerful APIs that allow for further access to image metadata. We chose to avoid the more complex datasets, opting for their Street Level Sequence Dataset. The full set includes 1.6 million photos from 30 cities across six continents, encompassing all four seasons over a nine year span. For simplicity of the model, we decided to focus on classifying images for 16 cities consisting of Amsterdam, Austin, Boston, Budapest, Helsinki, London, Manila, Melbourne, Miami, Phoenix, San Francisco, Sao Paolo, Tokyo, Toronto, Trondheim and Zurich. We chose these cities due to them having distinct features that we believe the model will have the highest chance of picking up on. For example, we were curious if the model would confuse cities like London, Melbourne and Tokyo due to the fact that all these cities drive on the left side of the road. We took 1000 photos for our dataset from each city, using 800 to train our model and the second 200 to test it.

Sample Photos for Each City


Our main goal for this project is to create a model that achieves over 50% prediction accuracy, and at least 70% top three accuracy, meaning that the correct label is in the top three guesses from the model. Additionally, we wanted to create a confusion matrix that allows us to further analyze where our model tends to make mistakes.


Data Preprocessing

Before finding libraries in Keras to manipulate the images properly, we first had to handle downloading, moving, uploading, and computing 16,000 images. Preprocessing our data involved downloading data from each city of interest on Mapillary.

The models we had in mind only accepted images with dimensions of 224 x 224. Since most of the Mapillary images arrived as 640 x 480 images, some image manipulation was required. This also allowed us to limit the size of our dataset, allowing for easier uploads and file movement. We used this script to go through each folder of city images, and replace each larger image with the scaled down version of itself.

cities = os.listdir('./bigDataCopy/train')
for city in tqdm(range(len(cities))):
if cities[city] != ".DS_Store":
cityDir = os.path.join('./bigDataCopy/train', cities[city])
for image in os.listdir(cityDir):
if image != ".DS_Store":
imagePath = os.path.join(cityDir, image)
img =
img = img.resize((224,224)), img.format)

Later on when using our model, we also took advantage of the Keras Image Data Generator class, which allowed us to leverage the built in functionality to separate the training and testing data. It also has many useful tools to help diversify our dataset such as randomly flipping images horizontally, randomly zooming in on images, and slanting images. This helped us avoid overfitting later on when we trained the model.

Minimum Viable Product with CNNs

As we are processing image data, it is typical to use a convolutional neural network. Convolutional neural networks (CNN) help mitigate the issue of computing massive amounts of weights. In a CNN, a filter the fraction of the size of the input image is used on a portion of the pixels to calculate a dot product. This reduces the dimensionality of the data, leaving less weights for computation, and also helping extract features. To create a minimum viable product, we built a small CNN, and trained it on our data.

The Basic Convolutional Neural Network

Even when being trained and tested on a dataset only consisting of seven cities, the basic CNN performed extremely poorly, only achieving a top one accuracy around seven percent. This was nowhere need the goals we were aiming for, so we decided to take a different approach.

Transfer Learning

Transfer learning involves utilizing pre-trained models that have been developed through research and testing in order to quickly develop powerful models. By using an already trained neural network, and adding our own layers at the end, we are able to build an accurate model that we wouldn’t necessarily be able to train with our limited computing power.

Building the Model

After researching and trying out several different models that keras provides, including ResNet50 and DenseNet121, we found that the VGG16 model was providing the best results. The VGG16 model consists of 13 convolution layers with 3 fully connected layers adding to 16 layers total. We removed the last softmax layer of the pre-trained model as to extract only the features of each image and then added several dense layers on top of the transferred model.

In early iterations of our neural network, we noticed a large discrepancy between our training accuracy and validation accuracy. This was most likely due to overfitting, where the model was becoming used to seeing the training data and was slowly losing its ability to generalize to further problems. Thus, we added several dropout layers in the model, allowing us to remove 5% of the nodes in each layer at random from weight changes. This helped us eliminate a significant amount of overfitting, which will be show later in the model evaluation.

conv_model = VGG16(weights='imagenet', include_top=False, input_shape=(224,224,3))
x = Flatten()(conv_model.output)
x = Dense(100, activation='relu')(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.05)(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.05)(x)
predictions = Dense(16, activation='softmax')(x)

full_model = Model(inputs=conv_model.input, outputs=predictions)
for layer in conv_model.layers:
layer.trainable = False

This is our full model structure consisting of over 17 million parameters. One key thing to note is the fact that despite having this many parameters, only around 2 million of them are trainable. This comes from the majority of the layers belonging to the pre-trained VGG16 model. This will allow us to train on faster epochs without losing much accuracy.

The Full Model we Used

Training the Model

Once we had established the neural network we would use, we were ready to train the model. We experimented with several training parameters, and ended up using the following code to train our model.

metrics=['acc', TopKCategoricalAccuracy(k=3)])
history = full_model.fit_generator(
validation_data = validation_generator,

In the code snippet, you can see that we used Categorical Cross Entropy as our loss function, as well as the Adam optimizer with a learning rate of 0.001. Although this learning rate seemed small at first, models we tested with larger learning rates seemed unable to converge at an optimized point. We also utilized a large amount of epochs, which although might sometimes raise concerns of overfitting, we were able to take advantage of without seeing a large discrepancy in training and validation accuracy.


After playing around with several different numbers of layers, number of epochs, training and validation splits, we were able to train the model to have nearly 90% top three validation accuracy and 70% top one validation accuracy. We were very pleased with this result as originally we were only intending to train on 10 cities but changed the dataset to have images from 16 different cities. One thing to note from the accuracy and loss graphs is how the validation lines act as more and more epochs pass. We originally were only using 10 epochs and getting similar results, and wanted to see if training for a longer period of time with the same model would improve the accuracy. However, we were still able to achieve remarkable accuracy with the model

Training Accuracy vs. Validation Accuracy Over 20 Epochs

Confusion Matrix

We developed the confusion matrix using the metrics library from sklearn. The x-axis on the graph corresponds to each predicted label and the y-axis corresponds to each true label. Each data point corresponds to the number of times a label was predicted for each city (which we identify by its true label), divided by the total number of times each city came up in the dataset (which was 200 each). For instance the top right data point corresponds to the number of times the model predicted an image was Zurich when in reality the image was from Amsterdam, divided by 200. You can see in this matrix that generally there was a significantly higher number of instances when the predicted and true labels matched, which shows in the accuracy of the model.

Confusion Matrix, Showing Predicted vs. True Labels

Interestingly enough, there seemed to be two main mistakes our model made. First, predicting Trondheim for Helsinki, and another guessing Budapest for Helsinki. Looking at samples where the model guessed wrong, it is clear to see how the model, and also a human could have made similar mistakes.

Predicted Helsinki, Actually Trondheim
Predicted Trondheim, Actually Helsinki

Further Steps


While the model’s accuracy was impressive, there was no easy way for anyone to pick up our code and test it out. Thus, we built a GUI to integrate our model into a user friendly experience. Utilizing the Tkinter library, we were able to build a simple Python application where a user could upload an image, display it, and feed it through our model. The top three predictions would then be displayed based on what our model guessed. A quick video demoing our application can be found here.

The Basic GUI for Our Application

Model Improvements

One other further step might be to improve our model. Using other techniques to mitigate overfitting such as regularization could help bridge the gap between training and validation accuracy, and help our model generalize better when confronted with new images of cities.

More Data and More Cities

The most interesting further step we could pursue with our project would have to do with adding cities to the dataset. Due to dataset and computing power limitations, we opted to only include 16 cities. However, with more time and resources, a similar model could be trained with a much wider variety of cities. Experimenting with more cities would test the boundaries of the transfer learning approach, and it would be interesting to see if accuracy could be maintained.

However, if more cities were to be included in the dataset, then almost surely one would need more data to train on. The 800 images per city would hardly be enough for a model, again driving up computing costs and resources. However, with ample time, this would be a step definitely worth taking.


Overall, we were able to create a model that achieved the goals we outlined for our project. Reaching around ~77% validation accuracy, and ~91% top three validation accuracy on 16 different cities, our model ended up performing remarkably well. Also, by creating a confusion matrix, we were able to understand more how our model worked. Lastly, by creating a simply graphical user interface, we were able to interact with our model in a streamlined manner, and ultimately enjoy the product we created.




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Most widely used Optimization techniques : Optimizing Algorithms.

Deep Learning for Dummies: 10 Models You Need to Know

Classification Techniques in Text Mining

Kaggle: a first experience on Machine Learning and Regression Challenges

How We Used PyTorch Lightning to Make Our Deep Learning Pipeline 10x Faster

Multiple GPU

Build your first Machine Learning & Tensorflow model in 5 minutes!

Solution to Navigation Project Using Double DQN in UDACITY NanoDegree Programme:

Weight Initialization of Deep Learning Layer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Light & Shadows enriches digital twins with CAD data with the help of CAD Exchanger

Generating Images with DALLE-2

Keep water from overflowing with a float sensor

AI in Warfare: Fiction or Impending Reality