Convolutional Autoencoder based Dimension Estimation from Depth Map of Monocular Images

Hariharan Natesh
Geek Culture
Published in
5 min readJul 28, 2021

This project was done as a part of my Undergraduate Final Year Thesis (2020). My teammates were Guruprasad Viswanathan Ramesh and Rakesh Vaideeswaran. We did this project under Dr.E.S.Gopi of National Institute of Technology Tiruchirappalli (NITT). This article describes the process.

Introduction

This project is about using a deep learning approach to estimate the Euclidean distance between two arbitrary points in 3D space, given the monocular image and the corresponding depth image. The proposed technique is user-friendly, where the user is required to select two arbitrary points on the monocular image. An Autoencoder-Artificial Neural Network architecture approach is used, in which the feature vectors are extracted using an Autoencoder Network and are used to construct a regression model based on an Artificial Neural Network (ANN). The mean deviation error on estimating the Euclidean distance is obtained as 0.059 meters. The experimental results reveal the importance of the proposed technique, which can be incorporated in various dimension measurement applications. The flow chart given below describes the method.

Flow Chart of the method
Fig 1: Flow Chart of the method

Data Collection

A prerequisite for any machine learning project is data. A Microsoft Kinect was used to prepare the dataset. Doors were used as the reference objects and, a set of stickers, which are the points of interest, were placed (Fig. 2) on them. Seven yellow stickers were placed on each of the three doors we collected data with. Therefore, a total of 21 different target values (sticker pairs) were obtained for each door. The color and depth images of doors were collected simultaneously using the Kinect camera, placed at an arbitrary distance. For one position of the camera, 6–8 different angles for each door were obtained.

Fig 2: Three different doors are used for the data collection process. Each row represents a unique door (Door 1, Door 2, Door 3 in order) and each column represents an arbitrary angle relative to an arbitrary position of the Kinect at which the image of the door was captured. Stickers are placed on it so that different distances can be measured between each sticker pair.

To capture color and depth images in the same resolution simultaneously, a Robot Operating System (ROS) module was interfaced with the Kinect camera. By placing the Kinect camera (works best in 1.5–4 meters range) in suitable positions such that the obtained color and depth images were entirely conceivable by the Kinect software, the data was collected for each of the three doors. By ensuring the target values of different doors were unique, a total of 63 distinct target values were obtained. A total of 99 raw color images and corresponding depth images were collected. Each image further provided 21 pairs of points for distance estimation, thereby giving a total of 2079 preprocessed images. Each of the 2079 images was labeled with the corresponding dimension. The table is given below.

Table mentioning the number of values for each position and angle for each door.
Fig 3: Table mentioning the number of values for each position and angle for each door.

Data Pre Processing

A convolutional autoencoder model was developed to extract the features from the image without losing information. The raw depth images were passed through the autoencoder model, and the feature maps of the encoded portion were used. The coordinates of the chosen points are later fed to the artificial neural network along with the feature maps. The autoencoder model has a compression factor of 4. The architecture is as given in Fig 4.

Auto Encoder-Decoder Architecture
Fig 4: Auto Encoder-Decoder Architecture

Mean Squared Error (MSE) loss function and Adam optimizer were used. The value of the training and validation loss as a function of the number of epochs for the model can be seen in Fig 5.

Loss plot for the auto encoder-decoder
Fig 5: Loss plot for the auto encoder-decoder

Fig 6 shows the raw depth images and their corresponding reconstructed images after passing through the model.

Top row shows the input depth images and the bottom row shows the corresponding reconstructed images from the encoder-decoder model.
Fig 6: Top row shows the input depth images and the bottom row shows the corresponding reconstructed images from the encoder-decoder model.

Artificial Neural Network

The feature maps (encodings) of the depth images obtained from the encoder part of the autoencoder model (Fig 6) are flattened. The coordinates of the points chosen by the user are then concatenated with the flattened encodings as input to the ANN. The input layer is connected to a layer with 1000 neurons which is followed by a batch normalization layer. The number of neurons in the subsequent layer decreases to 100, and then finally, there is only one neuron for the output. MSE loss function and RMSProp optimizer are used.

Results

The models are implemented using Keras with Tensorflow backend and are trained on the Google Colab TPU. The performance of the model is determined by how close the predicted dimension is to the actual dimension. Since the problem statement is a regression problem (as the output variable can take any real value), the accuracy cannot be measured in terms of the number of correct predictions as there is a high probability that no value is accurately predicted. However, the model’s performance can be evaluated by the amount of error in its prediction. For instance, if the true value of a dimension is 150 cm, a model that predicts 148 cm or 152 cm is better than a model that predicts 165 cm. The scatter plot is shown in Fig 7.

Scatter plot between the True value (original measurement) and the Predicted Value.
Fig 7: Scatter plot between the True value (original measurement) and the Predicted Value.

The Mean Squared Error of the model was found as 0.00339 and the resulting error is (+/-) 0.059 meters on average.

True value: 0.95m, Predicted value 0.912m
Fig 8: True value: 0.95m, Predicted value 0.912m

Limitations

  1. Our method of estimating the distance between two points is based on the depth maps generated using Kinect. It assumes the availability of the depth map of images to estimate the distance in 3-D space.
  2. The process of improving accuracy in a Machine Learning problem involves a lot of hits and trials. There is a need to experiment in different ways that could help in improving performance. The amount and type of data available is a significant constraint for achieving the best results in a Machine Learning problem. So, by collecting more data, the model performance can be improved, provided the data adds variety to the patterns that can be learned by the model. Choosing proper hyperparameters also plays a role in improving the model’s performance.

--

--