Hand tracking in Unity3D

Published in

AI2 Labs

8 min readAug 5, 2019

Since we were very young, we dreamed of using our hands to remotely control things around us. Certainly, we can touch things, move things, roll things around, throw things away, but none of these are cool. We can only become a Wizard or a Jedi Master when we can control something without being in contact with it.

While this dream is hard to realize in real life, people have long been exploring the possibility in the virtual world. Microsoft Kinect has once been popular as you can use your hands to hack and slash virtual fruits. Another commercial product is Leap Motion, which enables you to interact with the virtual objects in a delicate way. But both of them have not become mainstream. There is additional hardware needed, most of the people are not motivated enough to spend a few hundred bucks for the experience, and even if you have the hardware, it is difficult to bring them around with you.

So how awesome would this be if there is a way to use just a normal RGB camera, which every smartphone has, to track your hands?

This post introduces how to do hand tracking using RGB camera in Unity3D. We want to use Unity3D because it is multi-platform, once you build the application, you can deploy it to PC, Mac, Web, Android, and iOS.

The pre-requisites are:

Knowledge of Unity3D (my version is 2018.3.14)
Basic knowledge of OpenCV, and need to get the “OpenCV For Unity” plugin in the Unity3D asset store
Basic understanding of Neural Network

There are 3 ways to do hand tracking using RGB camera, namely Haar Cascade way, Contour way, and Neural Network way. In the following sections, I am going to illustrate how to implement them, but before that, let me give a summary of the performance of each method and my opinions, so that you can feel free to jump to the section you are interested and skip the ones you not interested.

Summary of the 3 ways

Haar Cascade way:

In short, it is super easy to implement, super quick, but super unstable.

This is the conventional method for tracking face, but hands are unlike face as hands do not have a fixed form. If you just want to find static and standard hand image (open-hand with the palm facing front), this method may work.

Contour way:

Straight-forward concept, easy to implement, there are many parameters you can customize to fit your use-case, and it is not computationally expensive.

It works well if the user is using it in his or her room. However on a mobile device, with the changing light, moving background, and other people around the user, it does not perform so well.

Neural Network way:

It is very cool, I mean, everything with a sprinkle of Neural Network is cool isn’t it? Besides being cool, it has the best stability among the three methods, and it can cope with various situations. However, it is rather computationally expensive.

Ok, it is time to dive into the code, and you can check the full code here in my Github repo.

Haar Cascade Way

Haar Cascade is a method that uses machine learning to extract features into an xml file. For more information on this topic, I suggest you take a look at this medium post.

To get the haar cascade detection running, we first need to get a pre-trained xml file (if you plan to train it by yourself, you can take a look at this post). I get the pre-trained xml from this repo. The file is called “palm.xml”, as the name suggests, it is specifically trained to recognize palm. Once the file is downloaded, move it to the “StreamingAssets” folder in Unity.

To load the cascade file, just do this:

var cascadeFileName = Utils.getFilePath("palm.xml");cascadeDetector = new CascadeClassifier();cascadeDetector.load(cascadeFileName);

To run the detection, we do some necessary pre-processing: convert the image into gray-scale and equalize the histogram. Then we can call the “detectMultiScale” function, for the parameters of this function, I find this answer provides a good explanation.

MatOfRect hands = new MatOfRect();Mat gray = new Mat(imgHeight, imgWidth, CvType.CV_8UC3);Imgproc.cvtColor(image, gray, Imgproc.COLOR_BGR2GRAY);Imgproc.equalizeHist(gray, gray);cascadeDetector.detectMultiScale(gray, hands, 1.1, 2, 0 | Objdetect.CASCADE_DO_CANNY_PRUNING | Objdetect.CASCADE_SCALE_IMAGE | Objdetect.CASCADE_FIND_BIGGEST_OBJECT, new Size(10, 10), new Size());OpenCVForUnity.CoreModule.Rect[] handsArray = hands.toArray();if (handsArray.Length != 0){
//Hand detected
}

Contour Way

Contour based method is a straight forward concept. It is purely computer vision, without any fancy models. This method is heavily inspired by this post, please check it if you are interested.

Basically, there are two major steps involved.

Find the area in the image that matches human skin color
Find contour shape which matches fingers

Human skin color is usually within a band of the color spectrum. To find it, we need to convert the color representation from RGB to YCrCb, and check if each pixel is within the range.

Mat YCrCb_image = new Mat();int Y_channel = 0;int Cr_channel = 1;int Cb_channel = 2;Imgproc.cvtColor(imgMat, YCrCb_image, Imgproc.COLOR_RGB2YCrCb);var output_mask = Mat.zeros(imgWidth, imgHeight, CvType.CV_8UC1);for (int i = 0; i < YCrCb_image.rows(); i++){for (int j = 0; j < YCrCb_image.cols(); j++){double[] p_src = YCrCb_image.get(i, j);if (p_src[Y_channel] > 80 && p_src[Cr_channel] > 135 && p_src[Cr_channel] < 180 && p_src[Cb_channel] > 85 && p_src[Cb_channel] < 135){output_mask.put(i, j, 255);}}}

The end result is a mask, where the pixels that match human skin color are white, and the rest of the area is black. After we get this mask, we extract the convex hull and contour of it, and detect the “defects”. “Defect” points refer to the points on contour line that are far from the convex hull. If the parameters of the “defect”, like its angle and length, satisfy certain criteria, then we know this “defect” is corresponding to a finger.

Once we find enough of these “finger defects” (empirically, we define “enough” as more than 1 and less than 4), we tell the system we have found hand.

Illustration of defect point. Blue line is convex hull, green line is contour.

//Find Contours in imageList<MatOfPoint> contours = new List<MatOfPoint>();Imgproc.findContours(maskImage, contours, new MatOfPoint(), Imgproc.RETR_EXTERNAL, Imgproc.CHAIN_APPROX_SIMPLE);//Find convex hull
var points = new MatOfPoint(contours[index].toArray());var hull = new MatOfInt();Imgproc.convexHull(points, hull, false);//Find defectsvar defects = new MatOfInt4();Imgproc.convexityDefects(points, hull, defects);var start_points = new MatOfPoint2f();var far_points = new MatOfPoint2f();//Loop through defects to see if it satisfy for (int i = 0; i < defects.size().height; i++){int ind_start = (int)defects.get(i, 0)[0];int ind_end = (int)defects.get(i, 0)[1];int ind_far = (int)defects.get(i, 0)[2];double depth = defects.get(i, 0)[3] / 256;double a = Core.norm(contours[index].row(ind_start) - contours[index].row(ind_end));double b = Core.norm(contours[index].row(ind_far) - contours[index].row(ind_start));double c = Core.norm(contours[index].row(ind_far) - contours[index].row(ind_end));double angle = Math.Acos((b * b + c * c - a * a) / (2 * b * c)) * 180.0 / Math.PI;double threshFingerLength = ((double)maskImage.height()) / 8.0;double threshAngle = 80;if (angle < threshAngle && depth > threshFingerLength){//start pointvar aa = contours[index].row(ind_start);start_points.push_back(contours[index].row(ind_start));far_points.push_back(contours[index].row(ind_far));}}// check if the number of defects found are within rangeif (far_points.size().height > min_defects_count && far_points.size().height < max_defects_count){//Hand detected}

Neural Network Way

In recent years, neural network based solutions have achieved better performance compared to traditional solutions for many tasks. This is especially true in the computer vision field. As our task is a computer vision task, it is natural we want to use a neural network.

In fact, object detection is an intensely-researched domain, and there are already very good Github projects available for hand tracking. One good example is this repo. It uses a classic object detection model architecture — SSD, which stands for Single-Shot Detection. SSD is famous for its sheer speed in inference time, making it suitable for real-time applications. And to further boost its speed, the model has been adapted with a MobileNet structure. It ends up with this SSDMobileNet structure. The size of the pre-trained model is only ~20mb size, which is super petite in the neural network model kingdom.

However, these solutions are coded in python with Tensorflow framework dependency. How can we use this in Unity3D?

I have encountered a few posts along the way. One post suggests to open a python thread when running Unity3D, but it is tooooooo slow when I tried it so I immediately gave up on it. Another popular solution is to use TensorflowSharp, which is a C# version for the Tensorflow framework. I have tried it too, it is faster than the previous solution, but still too slow for a real-time application (the frame rate is not even 1 fps..).

Finally, I came to OpenCV DNN module for the savior. This module is part of the OpenCVForUnity plugin, and there are some example scenes showing how to use it. The example makes everything look easy, but it is not the whole story. If you want to use a custom model, there are extra steps to convert the model into a format the OpenCV understands.

So let’s start with the conversion of the model. OpenCV requires a specific “pbtxt” file to get the model load properly. To generate this file, we need the frozen graph of the model and the pipeline configuration for the model.
I get the respective files (frozen_inference_graph.pb and ssd_mobilenet_v1_coco.config) from this project, and please refer to my repo for the implementation of the conversion. The generated frozen_inference_graph.pbtxt file is also included in the repo.

Once we have the frozen_inference_graph.pb and frozen_inference_graph.pbtxt files ready, we move them to the “StreamingAssets” folder in Unity.

To load the model:

var modelPath = Utils.getFilePath("frozen_inference_graph.pb");var configPath = Utils.getFilePath("frozen_inference_graph.pbtxt");tfDetector = Dnn.readNetFromTensorflow(modelPath, configPath);

To do detection:

var blob = Dnn.blobFromImage(image, 1, new Size(300, 300), new Scalar(0, 0, 0), true, false);tfDetector.setInput(blob);Mat prob = tfDetector.forward();Mat newMat = prob.reshape(1, (int)prob.total() / prob.size(3));float maxScore = 0;int scoreInd = 0;for (int i = 0; i < newMat.rows(); i++){var score = (float)newMat.get(i, 2)[0];if (score > maxScore){maxScore = score;scoreInd = i;}}if (maxScore > thresholdScore){// hand detected}

One parameter you may want to pay attention to is the size value in the “blobFromImage” function. The image will be resized to this value before being passed to the model. The recommended size is 300 as this is the size of the training image for the model, but if frame-rate is crucial to the application, this value can be reduced to 150.

Conclusion

You may have noticed that what we have done so far has a big limitation — we can only track “hand”, while finger tracking is not included in the package. There are certainly ways to track fingers too. One trendy way is to use neural network models to estimate 3D hand pose from RGB image, and this is an example repo. I have not tried it yet, but judging from the size of the pre-trained model (~140mb), you definitely need a beasty machine to get it running at near real-time speed. I am sure one day it will become a feasible solution even for mobile devices, but meanwhile, hardware like Leap Motion and Kinect are the best alternatives.