Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV

Published in

Apache MXNet

5 min readOct 21, 2020

Picture from camera, on the left side the scenario and on right side the data from model applied

Who is Nexity, what was the challenge?

Nexity is France’s leading integrated real estate group, with business operations in all areas of real estate development and services (residential real estate, commercial real estate, real estate services to individuals and real estate services to companies, distribution networks and client relations, major urban projects) and enjoys a strong presence across all industry cycles (short, medium and long). Nexity heavily adopted the cloud as a catalyst for technology-driven innovation.

How did Nexity solve the challenge?

This initiative occurred in the context of an internal innovation hackathon organized at Nexity. The goal of the hackathon was to raise executive-level awareness of the potential of cloud technologies. Volunteer teams had to compete by demonstrating an innovative system that could be built quickly using cloud services. Each team could choose its own scenario. We decided to work on the services we could offer at our general meetings of co-owners. Real-estate management involves a lot of in-person meetings where decision submitted to show-of-hand votes and manually tallied. The team “KFC”, consisting of engineers, tech leaders and project managers Antoine Pellet, Vincent Boidin, Xavier Top, Valentin Lecerf, Jérémy Desvaux and Grégory Hivin, decided to tackle the challenge of automatically tallying votes of a show-of-hand poll in real-time from fictional yet representative pictures of meeting rooms. In this post, the team tells about the experience with Apache MXNet and GluonCV.

“Initially, our idea was to use an object detection model to detect raised hands. Due to the very limited implementation time, we switched approach to use a pre-trained pose estimation model from the MXNet GluonCV model zoo. This model gave us a list of coordinates representing people and from this we could deduce easily if a right or left hand was up. We used the coordinates of hands, shoulders and head to decide if an arm is raised or not. Without specific tuning we observed a 70% success rate. The main challenge was the inference latency, as we implemented a web-application displaying real-time vote count over a 1080p video stream. Some important aspects to keep in mind for this type of solution are (1) the performance (need to handle a video stream) and (2) the video size and quality.”

The code

The team developed and tested from an Amazon SageMaker Notebook and provided the following sample code:

Installation of gluoncv and import of relevant packages

! pip install gluoncvfrom gluoncv import model_zoo, data, utilsfrom gluoncv.data.transforms.pose import detector_to_simple_pose, heatmap_to_coordimport mxnet as mxfrom mxnet import ndfrom mxnet.gluon.model_zoo import vision as models

Loading detector and pose estimation model, set detector to person class only

detector = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)pose_net = model_zoo.get_model('simple_pose_resnet18_v1b', pretrained=True)detector.reset_class(["person"], reuse_weights=['person'])

Run person detection and feed output into pose estimation:

class_IDs, scores, bounding_boxs = detector(x)pose_input, upscale_bbox = detector_to_simple_pose(img, class_IDs, scores, bounding_boxs)pred_coords, confidence = heatmap_to_coord(predicted_heatmap, upscale_bbox)predicted_heatmap = pose_net(pose_input)

For each pose detection, check if the left hand or right hand is raised:

if (coords[10][1] < coords[8][1] < coords[6][1]  # left hand raised  or coords[9][1] < coords[7][1] < coords[5][1]):  # right hand raised    return Trueelse:    return False

The full mapping between prediction coefficients and body joints was listed, to facilitate development:

# 0 # Facepoints# 1 # Facepoints# 2 # Facepoints# 3 # Facepoints# 4 # Facepoints# 5 # right shoulders# 6 # left shoulders# 7 # right elbow# 8 # left elbow# 9 # right hand# 10 # left hand# 11 # pelvis right# 12 # pelvis left# 13 # right knee# 14 # left knee# 15 # right foot# 16 # left foot

Going further with GluonCV

Abundant models, training and inference scripts are available in the gluoncv package, that as of May 2020 features 14 pre-trained models and 4 tutorials for pose estimation alone!

*Accuracy-Throughput tradeoff in the* *gluoncv pose estimation model zoo*

GluonCV is a python computer vision toolkit built on top of the efficient Apache MXNet deep learning framework. GluonCV also comes with features dedicated to inference optimization, such as pruned models and quantization functions.
Once scientific code is developed, it can be exposed as a service via a web server. Multi-Model Server (MMS) is a framework-agnostic model server that can be deployed as-is on compatible platforms, but that also ships as the managed backend of the SageMaker MXNet inference containers. Eventually, numerous ideas can be exploited to improve the performance and economics of the deployment, such as model compilation and hardware acceleration. Those concepts were presented in a previous blog post.

This is not the first time deep learning is used on a real-estate or urban planning use-case, yet this is a particularly original and creative one! Among existing deep learning research relating or relevant to real-estate, we note in particular:

— In Launching Similar Homes and Real-Time Personalized Recommendations (Gautam Narula, Ran Ding, Samuel Weiss, and Joseph Sirosh) Compass researchers describe the challenge of recommending real-estate listings. A deep embedding model is developed with Apache MXNet to learn listing similarity. Significant business impact is reported (+153% click-through rate and +107% engagement actions)
— In 2018, Development Seed described using Apache MXNet to classify building presence from aerial imagery
— In an AWS ML Blog post, the property data analytics company EagleView presents aerial computer vision solutions developed with Apache MXNet to assess urban damage created by natural disasters. Using deep learning, EagleView can assess property damage within 24 hours and inform insurance and homeowners more rapidly.

Conclusion

In summary, if you are working on a computer vision use-case, chances are GluonCV and Apache MXNet can help you drastically reduce your time to result while keeping a state-of-the-art scientific and efficiency bar. Please do not hesitate to give it a try, contribute to those projects and reach out to the community on the forum discuss.mxnet.io/!