Gesture-Controlled Drone using Hand Pose Estimation — Part 2

Vishal
Warwick Artificial Intelligence
13 min readNov 15, 2021

--

Hey everyone, welcome to the second and final part of our gesture-controlled drone project! In part 1 we developed the code for real-time hand pose estimation and hand classification, using a webcam as input. Today, we will be adding to the codebase from part 1, to do the following:

  • Use the 21 3D landmark points on each hand to check the orientation of our hands and ensure they are in their correct initial orientation required for gesture control.
  • Compute and store geometric data that represents the initial orientation of our hands.
  • Compute vectors between different points on each hand in real-time and use them in conjunction with the stored geometric data to compute raw tilt/acceleration values in each axis.
  • Perform basic mathematical operations to convert the raw values into controller values and transmit them to the drone.

Don’t worry if the above steps seem complicated, they are quite easy to implement, especially since we are using Python. All the source code and required files for Part 2 are available here.

Import additional Libraries 📄

In addition to the libraries and variables we defined at the start of our code in Part 1, we need to add a few more. Download the files tello.py and stats.py and put them in the same folder as the code we are developing.

Note: The tello.py file available to you only has the method definitions but doesn’t actually do anything as you don’t have the drone to test it on at the moment. During the live demo session, this file will be swapped for a fully implemented one that transmits commands and receives data to/from the drone. This fully implemented file will also be available on the GitHub repository after the demo session, if you would like to analyse exactly how the computer communicates with the drone.

We need to update the block of import statements to include:

  • The Plane class from SymPy, which allows us to define and store planes in 3D space.
  • Tello class from tello, which will allow us to interface with the drone.
  • Python’s inbuilt time library

So the following are all the import statements used in our final code:

import cv2 as cv
import mediapipe as mp
import numpy as np
from sympy import Plane
from tello import Tello
import time

Then, we add a few more lines of code just below src = cv.VideoCapture(0) to connect to the drone, put it in command mode (so that it receives commands from the laptop via the Tello SDK) and instruct it to auto takeoff. We also initialise some global variables that will be used in the main loop of our code.

The code to be inserted is shown in the green box

The command_timer variable is used to store the timestamp of the last command sent to the drone. We will use this to do 2 things:

  1. Ensure if our hands are not detected or removed suddenly during active control of the drone, we send a default command that makes the drone stop moving and hover in place.
  2. Prevent the drone from autolanding before the program initialises. This is due to the drone being programmed by default to auto land if it has not received a command in more than 5 seconds.

Such failsafes are vital when developing code as it prevents unpredictable/unsafe behaviour during unexpected scenarios such as the removal of hands or loss of connection between the drone and the controlling computer. Insert the following code immediately within while src.isOpened() to implement the described failsafe behaviour.

Pose Initialisation 🙌

Firstly, we need to ensure our hands and their major joints are in the correct orientation before we start using them to control the drone. To do this, we will first define a helper function called calcAngle(l1, l2, l3)that will be used to calculate the 2D angle on the YZ plane between the lines (vectors) formed by any three 3D landmarks output by the MediaPipe model. The 3 landmarks l1 , l2 and l3 are parameters of the function and the 2D angle in degrees is calculated via the formula:

The formula for calculating the angle θ between 2 vectors a and b

We are only interested in the angle formed by these 3 points along the YZ plane as that is the plane on which the joints of our fingers revolve (from the camera’s perspective). Thus allowing us to only calculate the angle by which each of our fingers is bending without worrying about any subtle ‘left/right’ deviations along the plane of our palm. The function calcAngle(l1, l2, l3) is defined as follows:

def calcAngle(l1, l2, l3):
# Store the (y, z) coordinates of each landmark as points
a = np.array([l1.y, l1.z])
b = np.array([l2.y, l2.z])
c = np.array([l3.y, l3.z])
# Calculate the 2 vectors (ab and bc) between the points
ab = b - a
bc = c - b
# Apply formula to calculate the cosine of the angle
# (The advantages of numpy and how easily it allows us to do
# matrices/vector based computation can be seen here)

cosine_angle = np.dot(ab, bc) / (np.linalg.norm(ab) * np.linalg.norm(bc))
# use arccos to get the angle, convert to deg and round to 0 d.p
return np.round(np.degrees(np.arccos(cosine_angle)), decimals = 0)

Now that we have the above function, we can use it to easily calculate the angle of our fingers exactly the way we want it and use it to check they are all within the allowed range for the initial pose. This initial pose check will be done by another function called initialPose that takes the 3 parameters: img (video frame), left (set of landmarks corresponding to the left hand) and right (set of landmarks corresponding to the right hand). It returns either True or False depending on whether both our hands are in their expected initial pose or not, respectively.

def initialPose(img, left, right):
# If either hand has not been detected yet, return False
if not right or not left:
return False
# Initialise flag to True - default assumption is it is in
# initial pose

pose = True
# Iterate through landmarks of the base of each of our fingers
# (i.e. 5, 9, 13 and 17. Refer to the landmarks image in part 1)
for i in range(5, 18, 4):
# Calculate the angle between the finger's base, first digit
# and tip

angle = calcAngle(right.landmark[0], right.landmark[i], right.landmark[i+3])
# Render the angle as text at the tip of the finger on the
# video frame (img) - this is just to aid the user via UI

x = int(right.landmark[i+3].x * img.shape[1])
y = int(right.landmark[i+3].y * img.shape[0])
cv.putText(img, str(angle), (x, y), cv.FONT_HERSHEY_COMPLEX, 0.6, (255, 0, 0), 1, cv.LINE_AA)
# If angle is more than 15 degrees, consider the finger to
# be bent and thus out of initial pose (set flag to False)

if angle > 15:
pose = False
## Do the same for the corresponding finger on left hand ##
angle = calcAngle(left.landmark[0], left.landmark[i], left.landmark[i+3])
x = int(left.landmark[i+3].x * img.shape[1])
y = int(left.landmark[i+3].y * img.shape[0])
cv.putText(img, str(angle), (x, y), cv.FONT_HERSHEY_COMPLEX, 0.6, (255, 0, 0), 1, cv.LINE_AA)
if angle > 15:
pose = False
# Return the pose, which would be false if any finger on either
# hand is perceived to be bent by more than 15 degrees

return pose

The initialPose function can be used in the main loop of our program to develop an ‘initialisation phase’ such that the program proceeds to later parts of the pipeline (i.e. storing initial orientation data, computing and sending control commands etc) only if our hands have been in the initial pose for at least 5 seconds continuously. To do this, we need to store 2 things: a timestamp of when our hands first went into the expected initial pose and a flag representing whether or not initialisation has been completed. So, add the following 2 lines of code immediately below with hands_model.Hands(min_detection_confidence=0.7, min_tracking_confidence=0.5) as hands: so that the start of our main program loop now looks like this:

Add initialisation timestamp and flag (shown in the green box)

We used the multi_handedness object of result after running results = hands.process(img) in part 1 to identify whether the hand was left/right and render it as text on the respective hand in the video frame, but we never stored the set of landmarks separately as belonging to either the left or right hand. We will do so now by declaring 2 variables, left and right, immediately after results = hands.process(img) and initialising both of them to None like so:

Add the line shown in the green box

Inside the loop for num, hand in enumerate(result.multi_hand_landmarks) we will assign hand to be the value of either the left or right variable we have declared above, depending on which hand it belongs to.

Store the set of landmarks in the variable corresponding to the hand to which they belong

We can now use left and right in the code for our ‘initialisation phase’, to ensure the program only proceeds once our hands have been in the expected initial pose for at least 5 seconds continuously, as described above. The following block of code is to be inserted after theif results.multi_hand_landmarks block, to implement our initialisation phase:

This was a large chunk of code, so we are going to go through it in parts. Firstly, on line 170 we check if the program has already completed the initialisation phase on a previous iteration of the main loop using the initialised flag. If this is the case, we will pass the set of landmarks for the left and right hand got from processing the current frame of the video, to a controller function which we will develop later. This function will be responsible for making use of these landmarks and sending actual commands to the drone.

When initialised is False (i.e. program hasn’t yet completed initialisation) we pass left and right to the initialPose function (line 175) we defined at the beginning, to check if both hands are in the expected initial pose. If initialPose(frame, left, right) returns False, we reset the initialisation timer and render a message in red onto the video frame informing the user they are not in the initial position. Else (hands are in initial pose), we render a message in green saying “Initialising…” with the number of seconds left for initialisation to complete, onto the video frame.

The final block in this section if time.time() - init_timer >= 5 (line 185) is used to check if 5 or more seconds have elapsed since both hands have been in their initial positions and if they have, we do the following:

  • Set the initialised flag to true so that in the next iteration of the loop, the initialisation phase is skipped and we go directly to the controller function that is to be defined.
  • We store the x, y and z points of the wrist (landmark 0) and base of both the index and pinky fingers (landmarks 5 and 17, respectively) as tuples of the form (x, y, z).
  • These 3 tuples (3D points) are used to construct a plane (Line 193) in 3D space that contains these points. This is done using the Plane class provided by SymPy and will be the plane representing the orientation of the palm of the right hand at the moment of initialisation.
  • We also construct a plane perpendicular to the first one, but since there are infinite such perpendicular planes, we pass in the x, y and z coordinates of the landmarks of the base and tip of the middle finger on our right hand (Line 197). Thereby creating a nearly vertical plane (based on our orientation of the world) that is perpendicular to the plane of the palm.
  • To improve the performance and frame rate of the program when controlling the drone, we store the normal vector of both these planes and only use that to calculate the angle between landmarks on our hand (converted into vectors further below) and either of the planes. At line 195 rightPlane_N stores the normal vector of the first plane (representing the palm) and at line 199 rightPlane_Orth_N stores the normal vector of the perpendicular/orthogonal plane.

Drone Controller 🎮

This is the final section of the program, where we will be developing the controller function seen above. This function is responsible for taking the left and right sets of landmarks and using them in conjunction with the normal vectors of both planes (defined in global scope) to generate and send commands to the drone!

Before we start developing the controller function, there are 3 simple helper functions we will create to simplify the controller function and make the code more reusable and concise. The first helper function is angleToPlane(vec, plane_N) this takes a vector vec as a NumPy array and the normal vector to a plane and returns the angle between the vector and that plane in degrees using this formula:

def angleToPlane(vec, plane_N):
# Calculate the cosine of the angle
cosine_angle = np.dot(vec, plane_N) / (np.linalg.norm(vec) * np.linalg.norm(plane_N))
# Use arccos to get angle, convert it to degrees, subtract from
# 90 and return the value

return 90 - np.degrees(np.arccos(cosine_angle))

The second helper function is scaleValue(val, in_min, in_max, out_min, out_max) and as the name suggests, it is used to scale the value val from the range in_min - in_max to out_min - out_max . The function is simply a single mathematical equation and can be implemented as shown.

def scaleValue(val, in_min, in_max, out_min, out_max):
return (val - in_min) * (out_max - out_min) / (in_max - in_min) + out_min

The last helper function is constrain(val, min, max) and is simply used to constrain the value val within the range min to max .

def constrain(val, min, max):
if val < min:
return min
elif val > max:
return max
else:
return val

Now we will develop the final function in our program controller(left, right) . The following section of code shows the portion of the function responsible for handling the controls of the right hand, namely tilt in all 4 directions (forwards, backwards, left and right).

The portion of the controller function that handles all controls mapped to the right hand

There are 4 major steps being implemented:

  1. Lines 80 – 83: Creating a vector from the wrist (landmark 0) to the base of the index finger (landmark 5) and calculating its angle to the plane representing the initial orientation of the right hand’s palm. This angle is the raw forwards/backwards tilt angle.
  2. Lines 84 – 87: Creating a vector from the base (landmark 9) to the tip (landmark 12) of the middle finger and calculating its angle to the plane perpendicular/orthogonal to the initial orientation of the right hand’s palm. This angle is the raw right/left tilt angle.
  3. Lines 95 – 97 and 101 – 103: Defining deadband regions for both axis of tilt. For inputs (angle values) within these regions, the output of the function is 0. This is common in control system design and is used in this case to ensure small deviations from the hand’s exact orientation at the instant of initialisation doesn’t cause the drone to drift in any particular direction.
  4. Lines 98 – 100 and 104 – 106: The raw tilt values are scaled into the percentage range required for controlling the drone, in this case limited to +- 90% for safety. The outputs from the scaling function scaleValue are then cast to an integer and passed into the constrain function to ensure the final outputs are within +- 90%. This is required as if the input to the scaling function is outside the given input range, then its output will also be outside the given output range.

Now that we have our command values for tilt, we use the left hand to control the height of the drone by simply using the length of the vector (i.e. the cartesian distance) from the tip of the thumb to the tip of the index finger. The following section of code implements this and is part of controller(left, right) immediately after the if right and type(rightPlane_N) == np.ndarray block.

The portion of the controller function that handles all controls mapped to the left hand - i.e. height control

We create 2 vectors (Lines 109 - 112), one from the tip of the thumb to the tip of the index finger and the second from the wrist to the base of the index finger. We then take the ratio between the lengths of the first and second vectors (Line 114) to make the raw value invariant to the distance of the hand from the camera. If we did not do this, the height control value would change significantly when we move our left hand closer to/further from the camera without changing the distance between our thumb and index finger in the real world, due to the camera’s perception (i.e. an object that is closer seems bigger than the same object further away). By dividing it with another vector on the hand, whose length is also perceived in the same way, we ensure the value used (i.e. the ratio) is not significantly affected by distance from the camera. This ratio is then scaled and constrained to a target height in the range of 350mm (0.35m) to 2200mm (2.2m).

Lines 117 - 122 request the current height in millimetres from the drone, check if the response is valid and parse it into an integer, which is used in line 124 to calculate the difference in target height and current height and store it as theerror . Similar to right hand controls, a deadband is applied on this error and only if it’s greater than 15mm, is the error scaled and constrained to produce a throttle value between -90% and 90%.

Finally, at line 134, we format a string command of the form rc {right/left} {forwards/backwards} {throttle} {yaw} and send it to the drone, causing it to move in the direction controlled by the orientation of our right hand and at a height represented by our left hand!

To make the drone stop and land upon pressing q we send the 2 commands within the if key == ord('q'): block as shown below.

While running the code, the terminal will show the commands that are being sent. Try tilting your right hand and pinching the tip of your thumb and index finger to see how these values change. We now have a drone that can be controlled just by using our hands, through an AI-based computer vision application we developed in just 2 sessions! 🎉

Come try it out for yourself on the actual drone at the demo session this Friday, 19th November 2021, in OC0.01 at 18:00.

Extension: For now we can only control the tilt in all 4 directions and the height of the drone. Can you come up with a way of adding yaw control (the last value in the control command string) to make the drone turn/rotate based on the pose of one of the hands?

--

--