The Startup
Published in

The Startup

Photo by Tyler Nix on Unsplash

Number Hand Gestures Recognition Using TensorFlow.js

Theory

As the demand for data-driven products grows, the data science community has been rapidly developing solutions that allow us to create and apply all the recent revolutionary advances in artificial intelligence across multiple platforms. In the early years of the so-called AI era, it was very common to have a deep learning model running on a script. But as the scope of our problems and their requirements evolved, these frameworks were ported onto various platforms such as IoT devices, mobile devices, and the browser.

To answer the demand for a battle-proven and browser-centric solution, in March 2018, the TensorFlow team released TensorFlow.js, a library aimed at web and Javascript developers to develop and train machine learning models in Javascript and deploy them right in a browser.

Like its larger and more complete counterpart, TensorFlow.js provides many tools and off-the-shelf models such as MobileNet that simplify the already complicated and time-consuming task of training a deep learning model from scratch. First, it provides the means to convert pre-trained TensorFlow models from Python into TensorFlow.js format, supports transfer learning — a technique for fine-tuning pre-existing models with the small amount of custom data — and even a way to create ML solutions without having to deal with the low-level implementations through the library ml5.js.

In this article I will use MobileNetV1 from tfhub.dev. MobileNets are a class of small, low-latency, low-power models that can be used for classification, detection, and other common tasks for which convolutional neural networks are suitable. Due to their small size, they are considered excellent deep learning models for use on mobile devices.

As a quick comparison, the size of the full VGG16 network on disk is about 553 megabytes. One of the largest MobileNet networks currently is around 17 megabytes in size, so that’s a huge difference, especially when you’re thinking of deploying a model in a mobile app or running it in a browser.

╔═══════════╦════════════════════╦═══════════════╗
║ Model ║ Size ║ Parameters ║
╠═══════════╬════════════════════╬═══════════════╣
║ VGG16 ║ 553 MB ║ 138,000,000 ║
║ MobileNet ║ 17 MB ║ 4,200,000 ║
╚═══════════╩════════════════════╩═══════════════╝

This huge difference in size is due to the number of parameters within these networks. For example, VGG16 has 138 million parameters, while the 17MB MobileNet mentioned above has only 4.2 million.

Now, while MobileNets are faster and smaller than other large networks, like VGG16, for example, there is a tradeoff. That tradeoff is accuracy, but don’t let let that disappoint you.

Yes, MobileNets are usually not as accurate as these other large, resource-demanding models, but they still actually perform very well, albeit with a relatively small reduction in accuracy. Here is a MobileNets paper that elaborates further on this tradeoff if you’re interested in studying this further.

Practice

To learn how to classify multiple different classes from a webcam feed in a small amount of time, we will fine-tune a pretrained MobileNet model, using the output from an internal conv_pw_13_relu layer as input to our new model.

So, to get things done, we need two models, actually:

  • One model is pretrained MobileNet which is truncated to output the internal activations. This model is not trained after being loaded into the browser.
  • Our second model will take the output of the internal layer of the truncated MobileNet model as input and predict probabilities for each of the output classes which can be numbers from 0 to 9. This is the model we will actually train right in the browser.

First of all, it is necessary to define a web page layout for this example. Stream from the web camera will be shown using <video> html element:

<video autoplay playsinline muted id=”camera” width=”224" height=”224"></video>

Prediction from our model will be displayed in this <span>:

<span id=”prediction”>0</span>

Then we need a group of 10 buttons (one for each digit) that will be used to label the training samples:

// id of this button defines label for training sample
<button type=”button” id=”0" onclick=”handleButton(this)”>0</button>

Web camera and MobileNet initialization

After that, we need to initialize the webcam, pass its stream to the video component and prepare the MobileNet network:

// Adjust the size of the webcam viewport 
// while maintaining the original aspect ratio

function adjustVideoSize(width, height) {
const aspectRatio = width / height;
if (width >= height) {
webcamElement.width = aspectRatio * webcamElement.height;
} else if (width < height) {
webcamElement.height = webcamElement.width / aspectRatio;
}
}

async function setup() {
return new Promise((resolve, reject) => {
// Prompt the user for permission to use webcam and
// get the stream of media content from it

if (navigator.mediaDevices.getUserMedia) {
navigator.mediaDevices.getUserMedia(
{video: {width: 224, height: 224}}).then(stream => {
webcamElement.srcObject = stream;
webcamElement.addEventListener('loadeddata',
async () => {
adjustVideoSize(
// Take the intrinsic height
// and width of the video track

webcamElement.videoWidth,
webcamElement.videoHeight);
resolve();
}, false);
}).catch(error => {
reject(error);
});
} else {
reject();
}
});
}
// Download and prepare MobileNet model
// The base model used in this example is MobileNet
// with a width of .25 and input image size of 224 X 224
async function loadMobilenet() {
const mobileNetModel =
await tf.loadLayersModel('https://storage.googleapis.com/tfjs-
models/tfjs/mobilenet_v1_1.0_224/model.json');
// Pick an intermediate depth wise convolutional layer
const layer = mobileNetModel.getLayer('conv_pw_13_relu');
mobilenet = tf.model({inputs: mobileNetModel.inputs,
outputs: layer.output});
}

Frame capturing

Here we need to capture raw images from the webcam and preprocess them for use in our deep learning models:

function cropImage(img) {
const size = Math.min(img.shape[0], img.shape[1]);
// Find the center of an image
const centerHeight = img.shape[0] / 2;
const centerWidth = img.shape[1] / 2;
// Find new starting points for the cropped image
const beginHeight = centerHeight - (size / 2);
const beginWidth = centerWidth - (size / 2);
return img.slice([beginHeight, beginWidth, 0],
[size, size, 3]);
}

function capture() {
// tf.tidy() executes the provided function and after
// it is executed, cleans up all intermediate tensors
// allocated by that function (except the returned ones)

return tf.tidy(() => {
// Create a tf.Tensor from an image
const webcamImage = tf.browser.fromPixels(webcamElement);
// Reverse image horizontally
const reversedImage = webcamImage.reverse(1);
// Crop image to a square with 3 channels (RGB)
const croppedImage = cropImage(reversedImage);
const batchedImage = croppedImage.expandDims(0);
// Normalize image from 0:255 to -1:1
return batchedImage.toFloat()
.div(tf.scalar(127))
.sub(tf.scalar(1));
});
}

Creating one training sample

A single training sample consists of the output from MobileNet after passing of the image captured from the webcam and user-provided ground-truth label:

function addExample(example, label) {
if (xs == null) {
xs = tf.keep(example);
} else {
const oldX = xs;
xs = tf.keep(oldX.concat(example, 0));
oldX.dispose();
}
labels.push(label);
}
// This function is called when user clicks one of the label buttons
function handleButton(elem) {
// Get the ground-truth label by the id
// of the button that the user clicked

let label = parseInt(elem.id);
array[label]++;
// Update the according counter
document.getElementById("samples_" + elem.id).innerText =
array[label].toString();
// Capture an image from the webcam feed
const img = capture();
// And pass it to the MobileNet model, then save its output
addExample(mobilenet.predict(img), label);
}

Model training

That’s where the actual training process occurs. Before that, we need to encode the user-provided labels as one-hot-encoded vectors using the encodeLabels() function. Then these vectors are used as target labels during the model training:

function encodeLabels(numClasses) {
for (let i = 0; i < labels.length; i++) {
const y = tf.tidy(
() => {
return tf.oneHot(tf.tensor1d([labels[i]]).toInt(),
numClasses)});
if (ys == null) {
// tf.keep() keeps a tf.Tensor generated inside
// a tf.tidy() from being disposed automatically
ys = tf.keep(y);
} else {
const oldY = ys;
ys = tf.keep(oldY.concat(y, 0));
// tf.dispose() disposes any tf.Tensors found
// within the provided object

oldY.dispose();
y.dispose();
}
}
}

async function train() {
ys = null;
// Encode labels as OHE vectors
encodeLabels(10);
model = tf.sequential({
layers: [
// Simply take the output of the last layer
// of our truncated MobileNet model and flatten it

tf.layers.flatten({inputShape: mobilenet
.outputs[0]
.shape
.slice(1)}),
// Then pass the result to the dense layer - the 'core'
// of our second fine-tuning model
tf.layers.dense({units: 100, activation: 'relu'}),
// Output layer gives us probabilities for each
// of the output classes
tf.layers.dense({units: 10, activation: 'softmax'})
]
});
// Compile the fine-tuning model using Adam optimizer
// and categorical crossentropy loss function

model.compile({optimizer: tf.train.adam(0.0001),
loss: 'categoricalCrossentropy'});
let loss = 0;
// Train the model for 10 epochs and report
// loss value after each epoch
model.fit(xs, ys, {
epochs: 10,
callbacks: {
onBatchEnd: async (batch, logs) => {
loss = logs.loss.toFixed(5);
console.log('Loss: ' + loss);
}
}
});
}
// This function is called when user clicks the Train button
function doTraining() {
train();
alert("Training Done!")
}

Inference process

The process of recognizing digits is quite similar to the training process itself — take an image from a webcam stream, run it through the MobileNet model, take the output, pass it to the trained fine-tuning model and take the value with maximum probability:

async function predict() {
while (isPredicting) {
const predictedClass = tf.tidy(() => {
const img = capture();
const activation = mobilenet.predict(img);
const predictions = model.predict(activation);
return predictions.as1D().argMax();

});
document.getElementById("prediction").innerText =
(await predictedClass.data())[0];
predictedClass.dispose();
await tf.nextFrame();
}
}
// This function is called when user clicks
// the Start/Stop Predicting buttons

function setPredicting(predicting) {
isPredicting = predicting;
predict();
}

Results

For the best results, you will need about 20 samples for each class. Please note that too many samples may cause the tab to crash. To take one training sample, show the desired gesture and press one of the 0–9 buttons to label it. Then train the model by clicking Train button, wait for the alert, and then start the gesture recognizition process by clicking the Start Predicting button:

After the training process is finished, an alert pops up
Just as expected, this gesture is correctly recognized by our fine-tuned model

Full project is available here on my Github profile.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store