Play Hangman with your hand gestures using TensorflowJS

Play Hangman with your hand gestures using TensorflowJS

A hangman game in ASL made with pre-trained Hand pose detector model in TensorflowJS

Tensorflow has been a powerful tool for Machine learning tasks for quite a while. However, did you know that traditional ML tasks can also be performed on the browser as well? The learning curve for the Tensorflow python library is notoriously steep for beginners in ML let alone for people trying ML from other backgrounds. In this article, I will show you the wonderful world of Machine Learning in the browser and how easy it is to get started with your already known tools for the web, particularly Javascript.

In this article we will:

  • quickly go over what ML is
  • How the pretrained HandPose model works to recognize gestures
  • create a Hangman game played using Sign Language

Have a look at the end project here ASL hangman

A quick primer on Machine Learning (skip if you like)

Machine Learning is basically the opposite of what you have been doing in software development. Instead of giving the computer a defined set of steps, you would give the computer lots and lots of data, and the machine would figure out the pattern from that data and solve a particular task kind of how our brain works in real life. Now, there are famously two ways you can approach an ML task:

  1. Supervised Learning - you would give the machine a bunch of data and also the outcomes of that particular set of observations and train it. Then when you would present an unknown observation the machine would predict an outcome based on that training data. This is a high-level idea and you can go a lot deeper into the rabbit hole which I encourage you to do so.
  2. Unsupervised Learning - In this type of training you would only show the observations and would not show the outcomes or labels. The machine would itself figure out the patterns in the dataset and usually, group together/cluster data points with similar properties. When you would present new and unknown data, the trained model would then group it in a cluster with similar properties and will output the cluster it may belong to. This kind of ML task is a bit harder to grasp. Usually, you would be using supervised learning for most of the times.

In this article, we would do neither though ! Just kidding, we could technically take a dataset, train it and analyse the model's behaviour on new data points. But often you don't need to reinvent the wheel. As an engineer a useful skill to master is learning how to adapt already existing solutions for your particular problem. By an already existing solution I mean that the collecting, training and evaluating the model is already done and you can get started with the actual fun part, building something cool with your new ML superpowers!

The ASL Hangman Game

We could of course play a hang man with traditional input/output from the keyboard but where's the fun in that. Let's learn how to build a hangman game and recognize letter inputs from our hand gestures. Now please don't fret as it isn't as hard of a task as it sounds. We will be using the pre-trained Handpose model made by the TensorflowJS. You can have a look at it here github.com/tensorflow/tfjs-models/tree/mast.. The great thing is that in order to run the models all you have to do is insert some script tags and voila the desired model would be loaded and ready to run for you. So we will first inspect the Handpose model and explore its features. Go ahead and feel free to run the barebone demo on Codepen

{% codepen codepen.io/Silver1/pen/BadjVPP default-tab=js,html %}

1.  
inspecting the returned objects in chrome console
   [{…}]
   0:
   1. annotations: {thumb: Array(4), indexFinger: Array(4), middleFinger: Array(4), ringFinger: Array(4), pinky: Array(4), …}
         2. boundingBox: {topLeft: Array(2), bottomRight: Array(2)}
         3. handInViewConfidence: 0.9999997615814209
         4. landmarks: (21) [Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3), Array(3)]
         5. [[Prototype]]: Object

      3. length: 1

      4. [[Prototype]]: Array(0)

If you inspect the array of objects called predictions in the browser, you would find a number of useful elements and it also contains a probability with which a hand is detected. For our purpose the key annotations is of particular interest. Lets take a closer look at it

   [{…}]
   0:
   annotations:
   indexFinger: Array(4)
   0: (3) [389.5245886969721, 244.7159004390616, -0.30365633964538574]
   1: (3) [381.65693557959514, 181.97510097266763, -3.5919628143310547]
   2: (3) [374.36188515696244, 132.26145430768776, -8.026983261108398]
   3: Array(3)
   length: 4
   [[Prototype]]: Array(0)
   middleFinger: (4) [Array(3), Array(3), Array(3), Array(3)]
   palmBase: [Array(3)]
   pinky: (4) [Array(3), Array(3), Array(3), Array(3)]
   ringFinger: (4) [Array(3), Array(3), Array(3), Array(3)]
   thumb: (4) [Array(3), Array(3), Array(3), Array(3)]
   [[Prototype]]: Object

you would see that it contains five elements each corresponding to the five fingers and upon expanding the keys of each finger we see that there are four nested arrays of x, y and z co-ordinates, corresponding to the four separations of the finger. Now with a little more experimentation we would further discover that the 0th element is the base of the finger and the 3rd element is the tip of the finger and the middle arrays are for the two separations in each of your fingers.

Detecting gestures

We will use simple mathematics in order to recognize certain gestures of the sign language. Take a look at the ASL chart for reference

ASL chart from Google images

We will only concentrate on the vowel words for simplicity. So let's start with the A . You can see that in this gesture only the thumb is upright and the rest of the four fingers are folded. So if we want to define this gesture, the co-ordinates of the tip of the thumb will be lower than that of the other four fingers. We are saying lower because the co-ordinate system in the browser canvas is upside down opposite to what we are used to seeing in maths. Hmm..all good now how do we get the co-ordinates of those fingers? Ahh..Handpose model to the rescue. Remember how the model returns co-ordinates of each of landmarks of each five fingers, so let's use those.

To recognize the A gesture we could do the calculations like so

 const thumbTip = predictions[0].annotations.thumb[3][1];
 const indexTip = predictions[0].annotations.indexFinger[3][1];
 const middleFTip = predictions[0].annotations.middleFinger[3][1];
 const ringFTip = predictions[0].annotations.ringFinger[3][1];
 const pinkyTip = predictions[0].annotations.pinky[3][1];
// We need to check if the tip of the thumb is higher than the other for fingers so we could write the condition to check for the other four fingers like this
 const otherFourFingersFolded =
      indexTip > indexBase1 &&
      middleFTip > middleFBase1 &&
      ringFTip > ringFBase1 &&
      pinkyTip > pinkyBase1;
//the entire condition to check for A gesture goes like so
if(thumbTip < (indexTip && middleFTip && ringFTip && pinkyTip) &&
      otherFourFingersFolded
    ) {
      console.log("a");
    }

Similarly, we will combine simple maths with if-else statements to check for other gestures and the whole thing looks like this.

const thumbTip = predictions[0].annotations.thumb[3][1];
const indexTip = predictions[0].annotations.indexFinger[3][1];
const middleFTip = predictions[0].annotations.middleFinger[3][1];
const ringFTip = predictions[0].annotations.ringFinger[3][1];
const pinkyTip = predictions[0].annotations.pinky[3][1];
const indexBase1 = predictions[0].annotations.indexFinger[1][1];
const middleFBase1 = predictions[0].annotations.middleFinger[1][1];
const ringFBase1 = predictions[0].annotations.ringFinger[1][1];
const pinkyBase1 = predictions[0].annotations.pinky[1][1];
const diffThumbIndex = thumbTip - indexTip;
const diffIndexMiddle = indexTip - middleFTip;
const otherFourFingersFolded =
      indexTip > indexBase1 &&
      middleFTip > middleFBase1 &&
      ringFTip > ringFBase1 &&
      pinkyTip > pinkyBase1;
if (diffThumbIndex >= 20 && diffIndexMiddle <= 0) {
    console.log("o");
} else if (pinkyTip < (middleFTip && ringFTip && indexTip)) {
    console.log("i");
} else if (
    thumbTip < (indexTip && middleFTip && ringFTip && pinkyTip) &&
    otherFourFingersFolde
) {
    console.log("a");
} else if (
    thumbTip > (indexTip && middleFTip && ringFTip && pinkyTip) &&
    !(diffThumbIndex >= 20 && diffIndexMiddle <= 0)
) {
    console.log("e");
} else if (diffThumbIndex > 100 && diffIndexMiddle <= 20) {
    console.log("u");
}

Now for the fun part, let's incorporate this new superpower inside a hangman game and give the traditional game a nice twist.

I have created a barebones hangman script for you, feel free to customise and add your personal touch to it but the basic concept is that you would fill the missing letters by showing the gesture to your webcam and the model would decode that gesture to a possible letter. The code structure is very simple and without any external UI/JS framework dependencies. I like separating similar functionalities in separate files since we have two different functionalities U have separated hangman.js and the functionalities containing that of the handpose model into index.js . The output of these will be displayed in the index.html file. The entire project is made using vanilla JS template from Codesandbox with Parcel as the bundler.

It is worth spending sometime looking at how we setup the ML model to run in index.js

import * as tf from "@tensorflow/tfjs";
import * as handpose from "@tensorflow-models/handpose";
let video;
let model;
const init = async () => {
  video = await loadVideo();
  await tf.setBackend("webgl");
  model = await handpose.load();
  main();
};
const loadVideo = async () => {
  const video = await setupCamera();
  video.play();
  return video;
};
const setupCamera = async () => {
  if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
    throw new Error(
      "Browser API navigator.mediaDevices.getUserMedia not available"
    );
  }
  video = document.querySelector("video");
  video.width = window.innerWidth;
  video.height = window.innerHeight;
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: false,
    video: {
      facingMode: "user",
      width: window.innerWidth,
      height: window.innerHeight
    }
  });
  video.srcObject = stream;
  return new Promise(
    (resolve) => (video.onloadedmetadata = () => resolve(video))
  );
};
init();
async function main() {
  const predictions = await model.estimateHands(
    document.querySelector("video")
  );
  if (predictions.length > 0) {
    const thumbTip = predictions[0].annotations.thumb[3][1];
    const indexTip = predictions[0].annotations.indexFinger[3][1];
    const middleFTip = predictions[0].annotations.middleFinger[3][1];
    const ringFTip = predictions[0].annotations.ringFinger[3][1];
    const pinkyTip = predictions[0].annotations.pinky[3][1];
    const indexBase1 = predictions[0].annotations.indexFinger[1][1];
    const middleFBase1 = predictions[0].annotations.middleFinger[1][1];
    const ringFBase1 = predictions[0].annotations.ringFinger[1][1];
    const pinkyBase1 = predictions[0].annotations.pinky[1][1];
    const diffThumbIndex = thumbTip - indexTip;
    const diffIndexMiddle = indexTip - middleFTip;
    const otherFourFingersFolded =
      indexTip > indexBase1 &&
      middleFTip > middleFBase1 &&
      ringFTip > ringFBase1 &&
      pinkyTip > pinkyBase1;
    const inputLetter = document.getElementById("letter");

    if (diffThumbIndex >= 20 && diffIndexMiddle <= 0) {
      inputLetter.value = "o";
    } else if (pinkyTip < (middleFTip && ringFTip && indexTip)) {
      inputLetter.value = "i";
    } else if (
      thumbTip < (indexTip && middleFTip && ringFTip && pinkyTip) &&
      otherFourFingersFolded
    ) {
      inputLetter.value = "a";
    } else if (
      thumbTip > (indexTip && middleFTip && ringFTip && pinkyTip) &&
      !(diffThumbIndex >= 20 && diffIndexMiddle <= 0)
    ) {
      inputLetter.value = "e";
    } else if (diffThumbIndex > 100 && diffIndexMiddle <= 20) {
      inputLetter.value = "u";
    }
  }
  requestAnimationFrame(main);
}

After importing the necessary libraries, the script waits for the video object in the index.html file to get input data upon initializing by the init method. The model would then run on the data from the webcam feed and save it in a placeholder called predictions. Once you get your predictions you would put in your logic as we did with the finger coordinates. The hangman game would get the inputs from this part of your project and play the game accordingly. you can view the full working project here ASL hangman That's it, folks. In this article, you learned the basic concepts of Machine Learning and played along with how you can implement fun stuff in the browser with already existing models.