The main purpose of the summer research project I’m working on is to find out whether it is possible to reliably detect customer interaction with a store shelf using a unique combined sensor of three camera sensors. The global goal is to adjust the layout of the merchandise in the shelf based on the detected interaction. A customer might do one of the following actions:
- Take merchandise (and keep it)
- Return merchandise (from the shopping cart)
- Take merchandise and return it immediately
- Touch the merchandise
There are already other similar works on this topic , , however, they use low-quality sensors and none of them use the combination of the sensors proposed here, hence the results are suboptimal. The goal is to combine the already existing research effort in separate areas and develop a coherent and reliable customer recognition approach.
The idea is to have RGB, depth and thermal camera mounted on top of the shelf facing down on the aisle (see the image above), record the customer interaction and then use computer vision algorithms to analyze it and recognize which items the customer was taking, returning or touching.
The depth camera should help with the segmentation of the hand and merchandise from the background based on the distance data, which could be further improved using skin detection based on the RGB camera color data. With a thermal camera, it should be easier to separate the smaller merchandise from the customer’s hand thanks to the fact that their temperature is usually different. Using a thermal camera however raises new challenges, mainly in synchronization between different types of sensors from different manufacturers.
The following sensors are used for this project:
Intel RealSense Depth Camera D415 providing two streams:
Color — resolution up to 1920 x 1080px at 30 frames per second
Depth — resolution up to 1280 x 720px at 30 frames per second
FLIR Tau2 with Workswell USB3 and GigE modules providing one stream:
Thermal — resolution 640 x 512px at 30 frames per second
An attempt to solve the customer interaction recognition has already been made on the Faculty of Information Technology at Czech Technical University in Prague using Microsoft Kinect sensors for depth and RGB stream.  This approach has proven to be unreliable, because of the issues with depth stream resolution and low frame rate which is crucial for capturing the fast motion of the hand coming towards the shelf. The following image illustrates the previous software in action:
Numerous publications exist on hand detection using RGB and/or depth camera. The two most relevant that were recommended for this project are:
Minsun Park et al. propose combining depth thresholding with color and shape detection from an RGB stream. The proposed method allows for the detection of the hand directly under certain conditions without searching for body or face first. 
Ekaterini Stergiopoulou et al. propose a technique that can reliably detect hand (with 98.75% accuracy) in a complex background and under difficult light conditions. They use a combination of motion, skin color, and morphology features to achieve such results.
Other relevant articles are, for example,  and .
Use of thermal camera
Ing. Lukáš Brchl in his bachelor thesis focuses solely on distinguishing between customer hand and merchandise using a thermal camera. Even though he used a thermal camera with only 7.5 frames per second, he managed to get 85–88% accuracy. It can be safely assumed that the accuracy will be higher with a higher frame rate.
While the summer research project is still in its early stages, progress is being made on the matter of working with the cameras and synchronizing the video streams. The synchronization alone is a complex problem and has a research potential. The current approach is to record the data first and analyze them afterward, real-time processing is left for future development.
The synchronization of video streams
To reliably detect customer interaction, it is crucial to have all the video streams properly aligned. If the hand with merchandise is segmented by depth and color information and then the thermal sensor is used to determine what is hand and what is merchandise, it is necessary for the three frames (RGB, depth, and color) to be taken at the same time or the recognition won’t work correctly. Fortunately, the RGB and depth cameras are both on the same device (Intel RealSense) with builtin software synchronization. The difficult part is to synchronize the frames from those two streams with the thermal camera which is a separate device.
Custom synchronization solution
A C++ application is being developed by me and my colleague Bc. Petr Kasalický to collect the frames and save them in an orderly manner to enable easy manipulation with the data later. RGB and depth streams are synchronized by the RealSense library since they are on the same device. Working with the thermal camera raised new challenges though because the Workswell API doesn’t provide hardware synchronization options even though the sensor supports it. Also, the thermal camera performs calibration from time to time, so there are no frames available for up to one second. Therefore it was necessary to come up with our own synchronization solution.
The Intel RealSense API  which has been my primary focus so far is much more pleasant. It contains high-level data structures, that allow polling for frames using blocking call wait_for_frames(milliseconds), which provides synchronized pair of frames from the RGB and depth streams when both are configured properly. The parameter milliseconds specifies the timeout of waiting for frames, which has been taken advantage of.
The current approach of the proposed solution is to create two separate threads that are polling for frames every few milliseconds (1000/FPS) and insert blank frames if no data arrived during the time window. One thread polls for thermal frames, the other polls for synchronized pair of depth and RGB frames. All of the collected frames are stored in appropriate queues, where two more threads come in. Their purpose is to pick up the frame from the queue and save it on disk (possibly with metadata).
The disk speed turned out to be an issue. While testing the program, the 5400RPM HDD available wasn’t able to catch up with the frame rate and the queues were filling up. A single test with only depth and RGB streams was concluded with a much faster SSD, but the queues were emptying properly only when the RGB resolution was decreased to 1280x720. Currently, more tests are being concluded with all of the streams.
RealSense synchronization analysis
The official RealSense documentation  doesn’t say much about the depth and color stream synchronization other than that wait_for_frames() returns a synchronized pair of frames. For that reason, the frame metadata were collected and analyzed to better understand how the synchronization works. As it turns out, the streams drift from each other ever so slightly, so once in a while, a color frame is dropped.
The first measurement I took was using the example code from the RealSense documentation  modified to save metadata into one CSV file. The first graph shows the difference between the depth stream and the RGB stream frame counter. After 4400 frames, the depth stream was 36 frames ahead. The second graph shows the difference between the sensor timestamps obtained from the metadata. It reveals that the biggest difference between the streams is up to one frame (33.33 at 30 frames per second), which seems to be reasonable.
The second measurement was taken using the custom solution which is polling for frames every few milliseconds (1000/FPS to be exact). In this solution, the method wait_for_frames gets called with the timeout parameter, so if the RealSense device doesn’t provide any frames in this time window, blank ones are inserted and the computation can continue. The data reveal, that the behavior is very similar and acceptable. Even tough there is a little discrepancy in the frame counters (one more RGB frame was skipped), the biggest difference of the two timestamps is still the same as before.
To summarize, the data collection program seems to be working reliably with the RealSense camera with virtually no frames dropping and only occasional duplicate frames. Because of no frame drops, blank frames don’t have to be inserted, thus the recognition algorithm won’t be missing any data from RealSense at any time. Duplicate frames are not an issue as long as the difference in timestamps is reasonably low. However, more testing is necessary, especially together with the thermal camera.
After the data collection program is thoroughly tested, data sets of people grabbing merchandise will be recorded. Then development of the detection algorithm can begin.
So far we are using C++ for the data gathering application. The language of choice for the final recognition application is Python with OpenCV library for image processing and possibly more libraries for machine learning and data classification algorithms. For this, further research on hand recognition and tracking needs to be done to choose optimal detection strategies, however, a few approaches have already been mentioned in the “Related work” section of this post. What remains is to analyze body recognition algorithms to enable reliable detection of people.
 Keruľ-Kmec, Oliver. Detekcia prítomnosti tovaru v ruke zákazníka. Bachelor thesis. Prague: Czech Technical University in Prague, Faculty of Information Technology, 2016.
 Monitoring for users. http://126.96.36.199/tiki/tiki-index.php?page=Monitoring+pro+u%C5%BEivatele
 Surmon. Detekce osob v supermarketu. https://www.youtube.com/watch?v=FiwRjwvfrfE
 Brchl, Lukáš. Detekce zboží v ruce zákazníka pomocí analýzy snímků z ter-
mokamery. Bachelor thesis. Praha: Czech Technical University in Prague, Faculty of Information Technology, 2017.
 PARK, Minsun, et al. Hand detection and tracking using depth and color information. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2012. p. 1.
 STERGIOPOULOU, Ekaterini, et al. Real time hand detection in a complex background. Engineering Applications of Artificial Intelligence, 2014, 35: 54–70.
 Intel Corporation. SDK Knowledge base. https://dev.intelrealsense.com/docs/sdk-knowledge-base
 Save-to-disk example code. Intel RealSense library (librealsense) Github. https://github.com/IntelRealSense/librealsense/blob/master/examples/save-to-disk/rs-save-to-disk.cpp
 MITTAL, Arpit; ZISSERMAN, Andrew; TORR, Philip HS. Hand detection using multiple proposals. In: BMVC. 2011. p. 1–11.
 MEI, Kuizhi, et al. A real-time hand detection system based on multi-feature. Neurocomputing, 2015, 158: 184–193.
See more FIT CTU research blogposts.