Team Members: Ahmet Tarık KAYA, Ayça Meriç ÇELİK, Kaan MERSİN

Kaan Mersin
bbm406f18
4 min readDec 9, 2018

--

As we mentioned in our first blog post, the aim of this project is to label the objects which can be found in a typical house, and classify the rooms by using object information. This week, we searched for a suitable dataset and works which can be a starting point for our project.

OUR DATASET: ADE20K
We could not have found a suitable dataset for this project before. As we mentioned earlier, all datasets we could find were RGB-D datasets. In the recommendation of our advisor, we decided to use ADE20K dataset of MIT. The further information about the dataset can be found here:
http://groups.csail.mit.edu/vision/datasets/ADE20K/

It basically contains images separated by scene category. Each image has both object and part segmentation. It mainly is used for scene parsing. It is a quite large dataset which contains both indoor and outdoor images. We will extract indoor images of houses from the dataset to train our model.

We gathered some information about available images by using the keywords. We have two label sets, the scenes(rooms) and the objects (furniture/items). The information about the labels and the number of images which contain that labels can be found in the graphs below.

OUR INSPIRATION: PSPNet
Aside from analyzing the dataset, we also searched for projects which had used this dataset. There are lots of incredible projects, but we could not find a similar one to ours. That excited us more and make us more ambitious of our goal. While we were looking for a starting point, we found the research that amazes us the most: “Pyramid Scene Parsing Network”.

“Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet 2016 scene parsing challenge, PASCAL VOC 2012 benchmark, and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.”

The further information about the paper can be found here: https://hszhao.github.io/projects/pspnet/

OUR IMPLEMENTATION
Let us explain our ideas about our research inspired from that paper. We want to parse the stream of scenes as seen in the video and label the objects. Of course, our stream will be from indoors and our model will try to label furniture and household items. To achieve this, we will train the model with the extracted dataset we mentioned earlier.

In the last part of our project, we are going to predict what room do our model see by using the objects it found. We can use simple machine learning models for this purpose and achieve probabilities of different room labels.

We are aware that all the items which our model found are not significant equally for the room labeling. Let’s take a look at a window and an oven. They both take enough space to consider in a scene, but a window cannot give us much information about the room type because of its high frequency. Some items have deep connections with a particular room label, such as bathtub and bathroom. So we will consider only the most significant items for room labeling.

References:

  1. https://hszhao.github.io/projects/pspnet/
  2. http://groups.csail.mit.edu/vision/datasets/ADE20K/

--

--