Design for Expression and Creativity — Creating a Gesture-Based Interface for YouTube Video Playback

Rafe Batchelor
13 min readOct 21, 2020

by Rafe Batchelor, Brian Gourd, Jordan Montero, Robb Alexander

Final product demonstration video.

To try this demo out yourself, click here.

The user interfaces most commonly tied to computer usage have remained surprisingly stagnant throughout history when compared to the evolution of computer programs and their input/output functionalities. Take for example, the development of the smartphone; this piece of technology is reliant upon several major developments in the world of computing, from the shrinking of parts themselves to the integration of a touch-based interface. We have seen how smartphones, such as the iPhone, have begun to incorporate advancements in machine learning and facial recognition through the Face ID feature included in recent models. Yet, in regards to how we interact with the majority of computing devices themselves, the methods for inputting information and extracting the results seem to be overwhelmingly consistent: some form of keyboard is pressed to deliver input and some form of screen is used to deliver output. Given the vast array of methods for interfacing with computational hardware, by means of movement-tracking cameras or other sensory receptors, why have we remained so true to the keyboard and mouse model?

Mouse and keyboard forty years apart.

One reason comes in the idea of consistency itself, providing a circular answer to this question. Computers are certainly complicated devices with a high learning curve for most individuals who have not interacted with them throughout their upbringing or professional lives. Functions that seem intuitive to one, such as expectations for what will occur when double-clicking or right-clicking some region, may be completely puzzling to another. Thus, through the standardization of input and targeting methods over time, a set of expectations and consistency is created across generations of users. An individual today should be able to sit down at a computer in 2050 and have some sense of communicating input and receiving output from the device given that the device still relies on the constructs of a keyboard and mouse. While this individual might not have any idea how the programs themselves function, at the very least they possess the fundamental skills needed to interact with the computer itself and unlock usage features. Introducing new schemes for receiving user input only enhances the already high gulf of execution that is present in computing.

Another reason comes in the sheer recency of emerging input/output technologies. It is clear that the fields of machine learning and computer vision are of great interest as of late, stemming from advancements in computational runtime and accessible, modularized frameworks for accurate and fast detection or classification. One assumption that can be made is that these means for communicating gesture-based input to a computer have not had the time to be both perfected and implemented into consumer systems in an intuitive and highly-usable fashion. This is precisely where our design enters the picture. In our design process, we sought to transform the traditional keyboard/mouse based UI space into one reliant on computer vision-based gesture recognition in a way that is beneficial for the user. In other words, we are not simply translating functionalities that can be performed with a keyboard and mouse to functionalities that are performed through gestures with the same degree of difficulty or usefulness; rather, we are transforming these functionalities into a gesture-space that enhances the user’s experience within a program.

One idea to highlight here is the contextuality of such a transformation. If one were to translate keyboard and mouse functionality to a hand-gesture based input for, say, typing a word document, one’s rate of delivering information through gestures or signs would dramatically decrease. However, in our case, such a translation of functionalities from keyboard and mouse to gesture results in an increase in the usability of the program itself. More specifically, we have designed a lightweight hand gesture-controlled web player for YouTube videos that promotes a natural, engaging interaction. Building upon browser-based pose estimation libraries to infer hand and finger positions, we have translated the functionalities of clicking and pressing keys to manipulate a YouTube video into a gesture-space defined by various hand positions.

Need finding and Sketching

As is the case in the majority of design processes, our design process began with a need-based use case. As our team sifted through potential ideas for a pose-based interaction environment that would enhance a user’s experience in some regard, we quickly converged towards the idea of using a form of gesture to control YouTube video playback. It is often the case that upon starting a video, an individual relaxes their hands from the keyboard and mouse and places them in a comfortable position near their body — perhaps on their lap or on the desk. However, a lawn mower suddenly drives by this individual’s window, causing them to miss several seconds of audio. At this point, the user must un-relax their position, return their hands to the keyboard and mouse, and adjust their playback to the point last audibly received, only to relax their posture once more. Such an occurrence is so frequent that our team quickly agreed that a gesture-based system for delivering playback adjustments would allow users to maintain a comfortable position, free from the inconvenience of significant and repetitive movement.

After we had come across this idea for a potential use-case for the lightweight browser-based pose estimation libraries, the next stage in our design process was receiving actual user feedback in terms of whether our expectations for how such a platform could benefit the general user were accurate. However, before demonstrating our earliest design prototype in the form of a Wizard-of-Oz demo, we had to develop an initial set of hand gestures to base our model around. Primarily, these hand gestures had to be distinct from one another and comfortable to make. A Wizard-of-Oz demo is an early concept demo where researchers can gauge user needs through their interaction with a product; in such a demo, users interact with an interface that they believe to be autonomous, while the actual responses are generated by a human pulling the strings behind the scenes, rather than a computer.

Distinct — Expressing input through a series of different gestures might seem feasible (think sign language!), but there are only so many unique formulations that are discernable on a 2D plane from another. In other words, computer vision does not have the luxury of natural depth perception; fingertips are detected, their position is mapped, but their location in front of or behind another finger is not so readily known. Thus, while the gesture-space of possible hand gestures is wide, the space for usable gestures in this application is not so much.

Examples of gestures that can easily be misclassified by the key point based recognition framework.

Comfort — Another key need when developing initial designs was the comfort of gestures and their ease of reproducing. This ties heavily into the previous point of distinct gestures; we’ve seen that, in order to be recognizable, gestures must be clearly discernible from one another. Placing two fingers up vs. one or three fingers is likely to be misclassified at some point; a thumbs up vs. an index finger up will probably lead to even more trouble. Thus, while the human eye allows for subtly differing gestures to be easily understood, this ability does not translate exceptionally to our lightweight gesture detection framework. This need for distinct gestures limits our gesture-space so tremendously that finding usable gestures that weren’t a pain to make — in terms of actual physical pain — was actually quite difficult. Due to the ambiguity of gestures relying upon the number of fingers, rather than the position of fingers relative to the rest of the hand, many of our gestures are directional “points”. The full pane of gestures is shown below.

All hand gestures that our system recognizes as unique with their paired playback function.

At this point, we were ready to demonstrate our initial idea through our Wizard-of-Oz demo day. The feedback from this experience confirmed our team-based expectation of the need for such a design: the YouTube playback experience is hindered by the de-relaxation of limbs for the sake of adjusting some aspect of the video. User feedback indicated that gestures made while maintaining arm and hand placement near the body vs. near the computer enhanced user comfort and ease of communicating input to the computer. Specific phrases spoken by users included how “annoying it is to have to keep clicking through a video” or how “frequently [they] have to move [their] hands back and forth.” Thus, through the demonstration of our approach for a gesture-based playback control, we were motivated by the support for the idea and confirmed that the expected need certainly existed. Resources used for informing our design and prototyping decisions can be found here and here.

Prototyping

After our initial “sketches” were developed for how gestures can control a YouTube video, the next stage was to begin designing a system for actually communicating these gestures to the computer in a recognizable fashion. In terms of the lightweight frameworks available for such a gesture-recognition task, we focused on two potential libraries: handpose, a subsidiary of poseNet and Google’s Teachable Machine for image-based pose recognition. We relied upon both Glitch and, eventually, p5 as the javascript web editors of choice for integrating these lightweight backends. Glitch is a great resource for working with others in real time and p5 serves as an excellent testbed for rapid execution and revision.

Teachable machine — Our initial design approach involved Google’s teachable machine for image recognition. We had theorized that by training the teachable machine with a large enough data set that our classification accuracy would be robust. To train the teachable machine, we had included several hundred images for each hand gesture made in various positions relative to our bodies and the webcam itself. Each gesture received a unique label, and when testing the classification accuracy of the output model, the teachable machine performed well in determining the gesture being shown. Thus, we proceeded to implement the model into our p5 prototype, such that the model produced a classification label in the javascript framework. These classification labels were then integrated with the YouTube Player API, such that each gesture could produce the desired outcome in video playback, as seen in the video below.

We, fortunately, had the opportunity to evaluate the results of this image-based model in our prototype demo day. Through the evaluation of other individuals’ interaction with our prototype, we quickly determined that an image based backend was far from robust enough to be used as a classifier for our design. User features such as clothing, background, lighting, or really anything that differed in their webcam’s frame of view from the information fed to the model upon training led to extremely poor classification performance. Thus, it became clear that we could not move forward with the teachable machine as a backend for gesture detection.

Demonstration of Teachable Machine backend prototype.
Teachable Machine Layout

Handpose — Fortunately, we had taken into account the likelihood that this would occur and also began implementing a skeletal-based gesture recognition framework into our design using a framework known as handpose. Here, rather than training a model on static images of individuals performing gestures, the handpose model is trained on the position of key points relative to each other. As we used a hand-based gesture network, versus a full skeletal-based gesture network such as poseNet, key points in this case refer to the four points along each finger, as well as the palm point, that are depicted in the images and gif taken from the handpose site shown below.

Examples of handpose backend recognizing position of fingers and hand structure.

To collect data for training our model, we simply recorded the coordinate data of each key point when in the desired gesture position. Through the coordinate positions of each key point relative to each other, handpose allows for the recognition of gestures, regardless of hand placement in the webcam frame of reference or changing background elements. The code for collecting such data can be found here; instructions for creating a set of gesture data to train on is included at the top of the script.

After a set of gesture coordinate data is collected and output in the form of a .json file, this data can be fed to the training script to produce a deployable model, which you can experiment with yourself here. This model is used as the backend for our integration with the YouTube player API. Similarly to the teachable machine framework, the handpose model outputs a classification result with an associated probability of the gesture being performed. Once any one classification exceeds the necessary confidence threshold ( > 0.5 ), the program outputs the label of the gesture being shown. However, this does immediately affect the playback of the YouTube video; rather, the gesture must be held in place for a brief period (~1 second), such that rapid fluctuations in classification output or the mere flashing of a hand in front of the screen do not impact video playback. With this, our prototype is born: we were able to produce a robust system for interpreting a succinct set of user gestures capable of controlling YouTube playback in a comfortable fashion.

Design Features

Throughout the process of our ideation and design implementation, one principle that we maintained consistent focus on was the idea of making conscious, intentional choices for our gestures and approach. We have already discussed several aspects of our intentionality in this design in regards to comfort and distinction in gestures, as well as robustness of interaction with the system itself. However, here we’d like to highlight these features and several others in detail.

1. Comfort — Gestures must be comfortable to make without requiring the user to contort their hands in any spectacular fashion. Additionally, gestures must be accessible to the common user. As our gesture-space does not rely upon any particular finger to be used relative to another, the user must simply have a finger to point. Thus, we feel that our small gesture-space that primarily relies upon variations of a “pointing” action creates a comfortable and accessible space for interaction.

2. Distinction — Gestures must be easily discernible from one another. By maintaining a small set of gestures that are manipulated in some fashion to indicate difference — for example, a left-hand point towards the right vs. a right-hand point towards the left — we reduce the likelihood of misclassification due to an overwhelming number of possible gestures. Through such a condensed set, we make it very difficult for the trained model to misclassify a gesture; if our gesture-space included a vast array of slightly varying gestures, then our model accuracy would likely be far lower. Thus, through an intentionally small set, we hope to create a highly-usable product.

3. Robustness in timing — While robustness in regards to recognition and classification is discussed in the previous point of distinction, robustness in terms of user interaction with the video player itself is still uncovered. Imagine for example, that the user watching a YouTube video temporarily adjusts their shirt or scratches their face. Obviously this user does not want the video to skip ahead 10 seconds or pause or mute or anything. The user simply wants to scratch their face or whatever it may be. Thus, to produce a system where hand gestures are assuredly intentional, the timing and duration of such gestures must be taken into account in our program.

We implement such a timing element in the form of polling. Every half second, our program evaluates the current classification label for change relative to the label in the previous half-second. If this label is different from the previous label, a counter begins. If this new label remains unchanged after the following two half-second polls (i.e. a total gesture display time of 1.5 seconds), then the classification label is allowed to change and the video playback can at this point be altered. Through such a polling scheme, accidental flashes of the wrist or fluctuations in the computer vision classification do not have an undesirable effect on the video playback. We hope that such a component creates a user experience that is free from frustration of undesired misprediction.

4. Continuous action — The next design element featured in our final product is the availability of continuous action. Take for instance, the skip +/- 2 seconds feature by pointing left or right. Rather than requiring that the user point, remove their hand from the frame to reset the classification label, point again, and repeat this process to continuously skip through the video, we allow for the user to simply hold the pose in place for continuous skipping in the desired direction. Actions that are available for continuous input are displayed on the control splash screen shown in the splash panel depicted in the third figure in this article. These include gestures for adjusting volume, video scrubbing, and the aforementioned skipping.

5. Locking screen — The last primary feature included in our design is the ability for the user to lock the gesture classifier, such that any gesture, regardless of duration, has no effect on the playback of the video. We felt that such a feature allows for the user to have full control over their playback experience, serving as an easy on/off switch to allow for full freedom of hand placement. The gesture to lock the handpose classification is simply a raising and holding of the left fist in place. The time requirement for locking the gesture control is increased slightly longer than the time required for the other gestures to produce effect to ensure the user’s intentions.

All of these features are demonstrated in the demo video linked at the top of this page.

Conclusion

Through this design process, we feel that we have created a computer vision-based interface that genuinely enhances a user’s experience when watching YouTube videos. Through a series of intentional steps in regards to ideation, prototype design, user feedback, prototype redesign, and the eventual expansion of design features and functionalities, we have produced an interface that breaks from the mold of the traditional keyboard and mouse in a highly usable and beneficial fashion. A focus on accessibility, a low gulf of execution, and robustness have allowed our design to feel like an improvement on current means of playback manipulation.

For those who wish to create their own deployable handpose gesture models like the one featured in this article, the code for data collection, model training, and integration with the YouTube API can be all be found here.

--

--