# Teaching computers how to see

Feb 18, 2019 · 5 min read

This leaves our Data Scientists with a vast amount of unstructured data to analyze. We already use structured data from user profiles, such as age, occupation, gender, etc. to better understand their behavior in the platform, study how new and exciting features fare, craft better recommendations for our users and so on. Our next logical step are pictures. There are many analyses we could do, which we could use to improve our user experience if we harness this information. What knowledge can we get out of those pictures? Do some users only contact rooms that are full of natural light? Do they prefer to live with people who depict themselves practicing sport? Let’s try to answer some of these questions but first we’ll need to translate these images in a way a computer can understand. Let’s extract features out of them.

A one megapixel image, which for current camera standards is extremely poor quality, is composed of 1 million pixels, which have three color channels: red, green and blue. This means that for a machine learning model an image has 3 million columns! Having such a vast amount of features is an issue for machine learning models. We need to reduce the dimensions of these images to something bearable for a computer but that still encodes the essence of them. That’s where Neural Networks kick in.

Luckily for us we’re standing on the shoulders of giants - in recent years we’ve experienced an enormous advance in Artificial Neural Networks. Roughly speaking they are systems vaguely inspired by human brains, which are capable of learning complex patterns and neurons are mathematical operations on these patterns. Whether the data are closings of the S&P, the contents of tweets, audio files, or images is up to you.

One very active area of research in Neural Networks has been image classification. Tasks of this field could be telling apart what’s in a picture, detect objects on the road (which is required to teach your Tesla how to drive), or detect hotdogs. Networks used for these tasks are called Convolutional Neural Networks (CNN), because they use convolutions, which are mathematical operations that emulate the response of neurons to visual stimuli, which only sees its surrounding area. Applying the convolution over the whole image means striding through it in patches.

In recent years we’ve experienced an enormous democratization of AI with many frameworks being open sourced, and already trained state-of-the-art networks being released, ready to be used. The most time- and probably resource-intensive part in designing a Deep Neural Network is gathering labeled images and training it, so it’s a luxury to be able to use them without having a cluster of computers running for days.

However many of these networks may not be suitable for our desired task. CNNs are normally trained to detect thousands of different objects, and we may be interested in telling apart double beds from single beds, not a dog from a car. Here’s where transfer learning comes in. It consists of transferring the knowledge of a neural network trained for a general task, i.e. detecting thousands of different objects, to your specific task, such as detecting a good looking apartment. This is done by taking intermediate layers of the neural network, called embeddings, and use them as an input for your specific problem. These embeddings are an intermediate representation of the image, not very specific to the main problem, but able to encode important information about the image. The output of this process translates an image into a vector of fixed size of floating point numbers.

This representation makes very little sense to humans. They’re the evaluation of a long list of chained derivatives in the specific value of that image, which is still a very high dimensional problem for a human to understand. In order to see its results, we encoded a few hundred thousand images about rooms and users in Badi, and clustered similar embeddings. For the neural networks we used Keras to extract encodings after playing around with various architectures such as VGG16 and Resnet50, which yielded similar results. In both cases we removed the fully connected layer that mapped neurons to the outputs of the model.

Once we had the embeddings of the images, we used K-Means clustering from Scikit learn:

And here are some examples of plots of images clustered together. Bear in mind that this process has been obtained in a completely unsupervised manner.

As you can see, clustering embeddings produces groups on images with similar scenery, facial traits, etc. which resemble each other. This is because intermediate layers in neural networks encode shapes or patterns, not the specific class where they belong.

If you love Data Science, are passionate about making an astounding product through Artificial Intelligence and really want to revolutionize how people find their next home, Badi is the place for you. Have a look at our jobs page or reach out to me with all your questions.

Curated stories on Machine learning, Data Science, AI and…

Curated stories on Machine learning, Data Science, AI and many more…

Written by

VP of Data Science at Badi. Compulsive learner.

Curated stories on Machine learning, Data Science, AI and many more…