Tutorial: Making a Hands-Free Video Player in Unity

Ian Lavery
Picovoice
Published in
6 min readMar 29, 2021

Unity has proven to be a uniquely flexible tool for creating stunning visual experiences that can be deployed to an impressive list of platforms, without changing a line of code. However, there are some noticeable shortcomings when it comes to integrating voice controls into cross-platform applications. A quick look at the Unity asset store shows only a handful of results for speech recognition. Many only deploy to one or two platforms, and those that are cross-platform rely on calls to third-party cloud services, which have unpredictable latency and require a constant network connection. If you’re making an application or game in Unity and you want to integrate voice recognition, many of these options just aren’t going to work or are going to severely limit your final product. The good news is: things are changing in this space with the rise of offline voice AI.

In this tutorial, I’m going to make a voice-controlled video player using the Picovoice SDK for Unity, which is cross-platform, processes all audio offline, and has a low package footprint. A hands-free video player is particularly desirable for virtual reality, where using physical controllers has been historically cumbersome.

1 — Make a Virtual Video Screen

Turning game objects into video players is surprisingly straightforward in Unity. My preferred method relies on Unity’s Render Texture, which can receive frames of a video as they are generated by the video player and render them as a texture. By using Render Textures, we can turn any surface that can receive a texture into a video screen.

The first step is to import a video into your Unity project. Once you’ve done that, drag it into your scene — this will create a Video Player game object in the hierarchy, with the clip preloaded as the Video Clip property. Click on the Video Player object and change the Render Mode property in the Inspector to Render Texture. Below the Render Texture property is the “Target Texture” property, which is empty. Right-click in your Project panel and select Create > Render Texture. Give this new object a name, and drag it into the “Target Texture” property of the video player.

With that, we’ve created a video player that will generate frames of our video and render them to a texture. Now, to make our screen.

Create a new material with the shader type “Unlit/Texture” and drag the render texture to the empty texture box. Next, create a new piece of 3D geometry in the scene to apply the material to. I chose the least exciting, but entirely practical option: a plane. Drag the material onto this new object and hit the play button — you should now see your video playing on the surface object!~

A virtual video screen in Unity while cool — is not novel. Controlling a video entirely through voice commands, however, is a truly exciting prospect.

2 — Import the Picovoice Unity Package

To achieve an entirely hands-free experience, we’re going to need an always-listening wake word and a collection of voice commands. For this use-case, the Picovoice platform SDK for Unity gives us everything we need. Download the latest Picovoice Unity package from the Picovoice GitHub repository, and import it into your Unity project.

3 — Initialize Wake Word and Voice Command Platform

The Picovoice SDK encapsulates both the Porcupine Wake Word engine and the Rhino Speech-to-Intent engine. These two engines in concert allow you to say a wake word followed by a voice command, e.g:

Porcupine, skip ahead 30 seconds

In this example, Porcupine detects the keyword “Porcupine” and Rhino processes the command that follows. Instead of transcribing it to text and interpreting the result, Rhino infers the intent using an embedded grammar and returns an Inference object. For our example command, the Rhino inference would look like this:

{
IsUnderstood: true,
Intent: 'seek',
Slots: {
seconds: '30',
skipDirection: ‘ahead’
}
}

Picovoice has made several pre-trained Porcupine and Rhino models available, which can be found in the Picovoice GitHub repositories[1][2]. You can also train a custom wake word on Picovoice Console. For the video player, we’re going to use the trigger phrase Porcupine and the Video Player context, which will give us all the voice commands we need to control a video player.

We’ll also need a Picovoice AccessKey, which can be obtained by signing up for a free account on the Picovoice Console.

Once we have a Porcupine model (.ppn file) and a Rhino model (.rhn file), drop them into your project under the StreamingAssets folder — this ensures that the model files are accessible on every platform. Next, we’ll create a script called VideoController.cs and attach it to the video screen. In this script, we’ll initialize a PicovoiceManager with the keyword and context files, as well as a callback for when Porcupine detects the wake word (OnWakeWordDetected) and a callback for when Rhino has finished an inference (OnInferenceResult).

Recording audio from Unity can be painful. Fortunately, by using PicovoiceManager, audio recording is handled automatically. We simply call .start() to begin audio capture and .stop() to cease it. If you have a pre-existing audio pipeline, you can use the Picovoice class instead, which will allow you to control passing audio frames to the speech recognition engine.

4 — Integrate Voice Command Interface

Now that Picovoice is processing the audio from our microphone, we want to modify the Porcupine and Rhino callbacks to translate what we get from the speech recognition engines to changes in the GUI.

For the wake word, it’s easy — we just want to let the user know we’re listening. We’re going to do that by changing the colour of the border around the video screen.

Picovoice controls the switch from Porcupine to Rhino automatically, so we don’t have to worry about that.

The Speech-to-Intent engine will do the heavy lifting for controlling the video player. The video_player context has five main intents that will help us control our video in Unity:

  • changeVideoState — play/pause/stop/etc. the video
  • seek — skip forward or backward from the current time or from beginning
  • changeVolume — change the output volume of the video
  • changePlaybackSpeed — alter the playback speed
  • help — toggle for help dialog

In our VideoController.cs script, can filter by intent and pass the slots to the relevant function.

The slots can be thought of as arguments that are associated with the intent. For instance, if we receive the intent seek we’ll probably get minutes and/or seconds slots that will tell us what time to set the video to. Using the slots, our function for seeking through the video will look something like this:

To complete the project, we now just need to connect each intent to a change in the UI. Then we can launch the app and put down the controllers: the video player is now 100% hands-free.

The full source code from this tutorial can be found on GitHub. For more information regarding Picovoice’s SDKs and products, visit the website, docs or explore the GitHub repositories. If your project requires custom wake word or context models, sign up for the Picovoice Console.

--

--