How to control a drone using computer vision

5 min readNov 23, 2023

This article introduces how one can control a drone using computer vision in Python. It is possible to track an object or a person in the image to provide movement commands to the drone. Additionally, specific hand gestures can be identified to make the drone perform maneuvers. These algorithms can serve as a foundation for practical exercises or can be directly used to illustrate concepts in courses. A Python library has been specially designed to provide scripts and tools for various drone control tasks using computer vision (https://github.com/guilsch/tello-vision-control). Feel free to use it!

In this work, we use a Ryze Tello drone, known for its affordability, compact size, and availability of an SDK for interaction via WiFi.

1. Object detection

The initial step in ensuring drone tracking of an object involves determining its position in the image. Numerous techniques exist for accomplishing this task, and choosing the most suitable one can sometimes be challenging. The performance of various techniques can vary depending on their use, and some may be more suitable than others for object tracking in drone control. Throughout this project, I explored different methods to identify those most interesting in our specific situation.

Regardless of the chosen tracking method, the following obstacles should be considered:

Occlusion management: Tracking can become challenging when the object is partially or completely hidden. In such cases, there is a risk of losing the object entirely without being able to find it once it becomes visible again.
Too rapid movement of the object can result in its loss.
Optimal lighting and shooting conditions are often required.
Noise in video images can affect algorithm performance.
It is crucial to consider computation complexity. Some algorithmic methods are highly effective but require high computational power, making them unsuitable in certain cases.

I tested methods of varying complexity: object detection frame by frame using neural networks, object detection and tracking using neural networks, object tracking using keypoint comparison, object tracking using OpenCV library trackers, and detection and tracking of anatomical key points of the human body (neural network method). The last two methods proved to be the most effective.

For the first method, the initial video frame is displayed, and the object to track can be selected. OpenCV’s tracker then continuously calculates the object’s position in the image (in pixels) using a motion estimator. It considers the current image and information related to the object’s position in previous images to determine the current position as accurately as possible.

Tracking of an apple with OpenCV’s tracker

When tracking a person, it is preferable to use the neural network provided by the MediaPipe framework. This network uses a machine learning model called BlazePose GHUM 3D to estimate the position of anatomical keypoints of the individual in the video, with each keypoint corresponding to a part of the human body. The coordinates of the key points of interest can then be extracted from the image.

These two methods provide coordinates in two dimensions, but it is also possible to estimate depth using a reference distance in the image. This reference can be, for example, the size of the bounding box in the image (method 1) or the distance between the key points of the two ears (method 2).

2. Drone Control

With these two tracking methods, we can determine the position of the object or individual in the image. To make the drone follow the target, it is necessary to command the drone in speed along four axes (forward-backward, right-left, up-down, and rotation around the vertical axis) to ensure that the center of the image coincides with the target’s position. We use a simple PID control to regulate the drone’s speed so that movements are both fast and smooth.

We first compare the object’s position in the image to the center of the image to obtain the error. We convert the error from the image coordinate system to the drone’s coordinate system and then apply a PID controller for each axis. With carefully chosen gains, we can easily achieve fast, smooth, and precise drone movement.

3. Hand Gesture Detection

To go further, I also designed a script to detect and identify specific hand gestures to command the drone to perform particular movements.

For this, I use a model for detecting anatomical keypoints of the hand, also provided by MediaPipe. It detects and tracks 21 keypoints of each detected hand in the image.

Based on Kazuhito Takahashi’s work (https://github.com/Kazuhito00/hand-gesture-recognition-using-mediapipe), I trained a model to identify hand gestures from the detected keypoints. The model only requires a series of keypoint positions associated with a label (open hands, closed hands, thumbs up, etc.). Users are free to train the model to recognize a specific hand position as long as they provide training data. The model can then identify hand gestures from the keypoints, and the drone can be instructed to perform a maneuver like a flip, landing, or takeoff, for example, as soon as it detects a gesture.

Conclusion With the tello-vision-control library (https://github.com/guilsch/tello-vision-control/), you can control the drone simply using computer vision. We’ve seen that, initially, an object detection or tracking algorithm must be used. Subsequently, it’s relatively simple to guide the drone with feedback from the object’s position.

Feel free to share your observations and comments!

Sources:

Tello. (2023). Retrieved from Ryze Robotics: https://www.ryzerobotics.com/tello
Google. (2022). Hand landmarks detection guide. Retrieved from MediaPipe: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
Kazuhito Takahashi. (n.d.). Hand gesture recognition using MediaPipe. Retrieved from Github: github.com/Kazuhito00/hand-gesture-recognition-using-mediapipe

How to control a drone using computer vision

1. Object detection

2. Drone Control

3. Hand Gesture Detection

Written by Guilhem S