Analytics Vidhya
Published in

Analytics Vidhya

Mediapipe: Hand gesture-based volume controller in Python w/o GPU

If I say you wanted to do something with hand gesture recognition in python, what would be the first solution that will pop in your mind: train a CNN, contours, or convexity hull. Sounds good and feasible, but when it comes to actually make use of these techniques, the detection is not very good and requires special conditions(like a proper background or similar conditions that you used while training).

Recently I came across a super cool library called Mediapipe which makes things pretty much simple for us. I would suggest you go through its official site to read more about it because the site explains pretty much everything that the library provides to you. What I would do in this article would be to show how I used this library to come up with some great projects, because that's what made you land up to this article.

Before we begin with real code stuff let’s take some time to appreciate this library. I am really fascinated by how easy is to use this library and do innovative things which otherwise I found very difficult to code from scratch. You don’t even need a GPU to use this library and the things work pretty much smooth even on a regular CPU. To add, it is backed up by Google, so this gives another reason to use this library. In this article, I would be dealing with Python stuff but this library supports nearly all platforms(Android, iOS, C++). At the time when I am writing this article, there are only a few modules available for Python but don’t worry it is still evolving fast and you can expect more coming very soon.


In this article, I would be using the hands module of this library to create 2 cool projects. The hands module creates a 21 points based localization of your hand. What this means is that if you supply a hand image to this module, it will return a 21 point vector showing the coordinates of 21 important landmarks present on your hand. If you want to know how it does that, go and check the documentation on their page.

The image is taken from Mediapipe's official website. Please check here to view full working details.

The points would mean the same irrespective of your input image. This means that point 4 would always be the tip of your thumb, 8 would always be the tip of the index finger. So once you have the 21 point vector, it’s up to your creativity what kind of project you create.

Determining Hand Landmarks

What we are trying to do is control our system volume through the following hand gesture(notice the volume change on the bottom right corner):

Before using the mediapipe library in Python, you have to do a:

pip install mediapipe

Let’s create a utility class called HandDetector which will make our project modularized

  1. Import the needed packages
import mediapipe as mp
import cv2

2. Create an instance of the hands module provided by Mediapipe followed by an instance of Mediapipe’s drawing utility. The drawing utility helps to draw those 21 landmarks and the lines connecting those landmarks on your image or frame(which you have noticed in the above video).

This step is pretty much constant and would have to be done in each of your projects.

mpHands =
mpDraw =

3. We then start writing our class

class HandDetector:
def __init__(self, max_num_hands=2, min_detection_confidence=0.5, min_tracking_confidence=0.5):
self.hands = mpHands.Hands(max_num_hands=max_num_hands, min_detection_confidence=min_detection_confidence,

max_num_hands: the number of hands that you want Mediapipe to detect. Mediapipe will return an array of hands and each element of the array(or a hand) would in turn have its 21 landmark points

min_detection_confidence, min_tracking_confidence: when the Mediapipe is first started, it detects the hands. After that, it tries to track the hands as detecting is more time-consuming than tracking. If the tracking confidence goes down the specified value then again it switches back to detection.

All these parameters are needed by the Hands() class, so we pass them to the Hands class in the next line.

4. Next we define the function findHandLandMarks() of this class where the main stuff happens

def findHandLandMarks(self, image, handNumber=0, draw=False):
originalImage = image
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # mediapipe needs RGB
results = self.hands.process(image)
landMarkList = []

if results.multi_hand_landmarks: # returns None if hand is not found
hand = results.multi_hand_landmarks[handNumber] #results.multi_hand_landmarks returns landMarks for all the hands

for id, landMark in enumerate(hand.landmark):
# landMark holds x,y,z ratios of single landmark
imgH, imgW, imgC = originalImage.shape # height, width, channel for image
xPos, yPos = int(landMark.x * imgW), int(landMark.y * imgH)
landMarkList.append([id, xPos, yPos])

if draw:
mpDraw.draw_landmarks(originalImage, hand, mpHands.HAND_CONNECTIONS)

return landMarkList

The function arguments:

the image on which hand landmarks would be detected. The handNumber in case the image has multiple hands, so our function would return landmarks for only the specified hand number. A boolean parameter draw decides if we want the medapipe to draw those landmarks on our image.

The next line does everything. This small line actually is doing a lot behind the scenes and is getting all the landmarks for you

results = self.hands.process(image)

We then create an empty list landMarkList which would contain the final result returned from the function.

The results.multi_hand_landmarks returns None if there is no hand detected, so you should use it as a fail-safe condition.

results.multi_hand_landmarks returns landMarks for all the hands that were detected, so passing the handNumber to it gives you data for the correct hand.

hand.landmark gives the 21 landmarks for the selected hand. So we iterate over these 21 points where id holds the id for each of the landmark

Now the important point to note here is that the landmark information returned by Mediapipe is not the pixel location of the landmark. Instead, it is the ratio of the image dimensions. So to get the exact x and y coordinate of the pixel of the landmark we do this simple calculation:

xPos, yPos = int(landMark.x * imgW), int(landMark.y * imgH)
landMarkList.append([id, xPos, yPos])

We then append the id(0…21) of the landmark, and the corresponding x and y coordinate to the empty list which we had earlier created. We return this list to the calling function.

This is what the findHandLandMarks() would return:

0th index →id of landmark, 1st index →x coordinate of landmark, 2nd index →x coordinate of landmark

The last part draws the landmarks on the image if the boolean variable draw says so

mpDraw.draw_landmarks(originalImage, hand, mpHands.HAND_CONNECTIONS)

That's pretty much from our custom class. So all you need to do is pass your hand image to findHandLandMarks() and you will get the list containing information about all 21 landmarks.

Volume Controller

Now comes the thing which was written in the main title of this article.

Before we write any of our custom code we need to install an external python package pycaw. This library will handle the controlling of our system volume.

  1. We start by importing the custom class that we created in the previous section and other required packages:
from handDetector import HandDetector
import cv2
import math
import numpy as np

Also, import pycaw related packages and class:

from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume

2. We then create an instance of our custom HandDetector class:

handDetector = HandDetector(min_detection_confidence=0.7)

Then the standard code for initializing our webcam:

webcamFeed = cv2.VideoCapture(0)

Then some standard initialization for our volume controller. There is nothing I can explain in this part:

#Volume related initializations
devices = AudioUtilities.GetSpeakers()
interface = devices.Activate(
IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
volume = cast(interface, POINTER(IAudioEndpointVolume))
print(volume.GetVolumeRange()) #(-65.25, 0.0)

Have a careful look at the last line. The volume.GetVolumeRange() gives the range of volume that your system supports, -65.25 is the minimum value and 0.0 is the maximum. There is no reasoning how these values are obtained or what they mean, they are just min and max values. We would be needing these values in the latter part of this article. So note these values.

3. The main volume controller stuff:

while True:
status, image =
handLandmarks = handDetector.findHandLandMarks(image=image, draw=True)

if(len(handLandmarks) != 0):
#for volume control we need 4th and 8th landmark
x1, y1 = handLandmarks[4][1], handLandmarks[4][2]
x2, y2 = handLandmarks[8][1], handLandmarks[8][2]
length = math.hypot(x2-x1, y2-y1)

#Hand range(length): 50-250
#Volume Range: (-65.25, 0.0)

volumeValue = np.interp(length, [50, 250], [-65.25, 0.0]) #coverting length to proportionate to volume range
volume.SetMasterVolumeLevel(volumeValue, None), (x1, y1), 15, (255, 0, 255), cv2.FILLED), (x2, y2), 15, (255, 0, 255), cv2.FILLED)
cv2.line(image, (x1, y1), (x2, y2), (255, 0, 255), 3)

cv2.imshow("Volume", image)

We start reading frame from our webcam, one frame at a time, and then send the frame image to our findHandLandMarks()

handLandmarks = handDetector.findHandLandMarks(image=image, draw=True)

We then extract the x and y coordinates of the tips of the thumb and index finger(check the youtube video and you will know that thumb and index fingers are of interest here).

x1, y1 = handLandmarks[4][1], handLandmarks[4][2]

The [4] means we are referring to the 4th landmark which is the tip of our thumb. The [1] means we want the x coordinate which resides at the 1st index of our response returned by findHandLandMarks(). You can figure out the rest by yourself.

We then calculate the length of the line which joins landmark 4 and landmark 8:

length = math.hypot(x2-x1, y2-y1)

What’s the use of this length variable? When the tips of our index finger and thumb are touching each other, there would be a value of length(let’s say L1 which would mean that we want to set our system volume to 0%).

When you open your index finger and thumb wide there would be another value of length(let’s say L2 which would mean that we want to set our system volume to 100%).

Try running your code at this point and print out the length. Note down the values of length for L1 and L2 for which your fingers feel comfortable. For me, L1 was 50 and L2 was 250. If you now remember from point 2, we had noted the minimum and maximum values of volumes, let’s call them V1 and V2.

So now we have L1=50, L2=250, V1=-65.25, V2=0

So the next line of code converts our length(L1, L2) to proportionate for our volume levels(V1, V2)

volumeValue = np.interp(length, [50, 250], [-65.25, 0.0])

Once you obtain the correct volume value, you just use it to set your system volume:

volume.SetMasterVolumeLevel(volumeValue, None)

The remaining part is just showing the tips of the index finger and thumb, drawing a line between those two points for a better visualization

This is pretty much the code. When you run this, you will notice that the volume of the system is changing and this is quite smooth without any lags. The code might look long and tricky but once you consolidate it in a python file, this is hardly a few lines of code.

Index finger and thumb controls system volume

Final words

See how simple it was to do something that creative. The idea of this article was to introduce you to this super awesome library. Rest everything is up to your will and creativity. You can find the code for this project here. Sorry for the untidy Github, I usually start writing the code right away and then feel lazy about tidying things up.

Again thanks to Murtaza Hasan from where I came to know about this library. You can check his website for some cool projects. This article is also inspired by his projects and videos.

Since this article became longer than I expected, I would be writing a separate article for the second project using the same library: Finger Counting. Don't forget to check the article.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

Leveraging an Ecosystem Integration Model for eCommerce Success

A definitive guide to Remote Config: Use cases and examples

Analytics on the edge — How Apache Mesos enabled ships to crunch data

7. Reverse Integer

Thoughts for People that Have Trouble Doing for Themselves

Kubernetes Network Policies with Cilium

Laravel: Dependency Injection on methods and closures

Living the Golden Rule (of Cloud Architecture) #3: Serverless Analytics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dhruv Pandey

Dhruv Pandey

A machine learning and computer vision enthusiast working as a web developer in Finland.

More from Medium

How to load multiple images and processing them?

Contrast Enhancement of Grayscale Images using Morphological Operators

A grayscale image

Install OpenCV on Raspberry Pi

Quick Tips #1: How to obtain environment information using PyTorch