By What’s On the Air Company — What’s On the Air, July 1930 (page 45)

How to create a musical instrument using Computer Vision in JavaScript

Piotr Jaworski
Jit Team
Published in
9 min readSep 23, 2020

--

Have you ever heard about theremin? If you’re into spy stories, you’ve probably heard of The Thing, a listening device developed by Leon Theremin. If you’re a fan of The Big Bang Theory, this name probably rings a bell (pun intended!):

Pretty neat, huh? And you know what’s even neater? It’s possible to build one in JavaScript! Cool, right? And that’s not even the best part. What do you think is the best way to control the pitch? The mouse? Come on, that’s so 1992. A true theremin should be controlled without the need to touch anything. We’re going to use the computer vision approach!

Wait, what? Computer vision? In JavaScript? In the browser?! Yes, yes, and yes! We’re going to build a motion-controlled theremin that’s going to run in the browser, using JavaScript! How? Read further to find out!

If you’re not a very patient person (like me), you can go ahead and check out the result here (although as of now, it does not work on mobile devices) — all you need to do is have the sound on, click the “Start” button and show something horizontal to the camera — like the edge of a sheet of paper or the edge of your phone.

If you’re curious about the implementation, please read on! First, we’ll need some kind of scaffolding for our solution — to get that out of our heads let’s use:

npx create-react-app theremin

What does that do? npx is a handy npm package that allows us to run other npm packages without the need to install them, create-react-app is a Facebook’s React project starter that will allow us to run a React app with pretty much no configuration needed, and theremin is our app’s name.

Once we got our project set up, we will need some help with image processing that we’re going to have to deal with. There are some nice, performant libraries that can help us with the basics of computer vision that we’re going to need here, jsfeat or gammacv (we’re actually going to use the former to build our theremin), but neither of them made it to version 1 yet. In production-grade solutions, it would probably make the most sense to implement all the needed image processing helpers from scratch. In order to use jsfeat we first need to add it to the dependencies:

yarn add -SE jsfeat

Next, we need to create a component that will serve as our theremin. In order to do that we will have to create a components/Theremin folder and create a index.jsx file inside:

At first, we need to establish the size of the video that we’re going to process. This is an important step, as every increase in resolution will lead to performance downgrades, especially on mobile phones. For simple transformations that we’re going to do for the theremin, 640x480 will do just fine, but for more complicated calculations it could be fruitful to fall back to 320x240 pixels.

Once we get that done, we can continue with preparing some variables that we’re going to use to store references to the HTML video and canvas elements, as well as the rendering context. We need to put those at the top of the component function:

Our theremin is going to detect the position of the most prominent line in the camera and it’s going to use it to control the sound pitch, so the last line is going to be useful for storing this value. In order for the references make sense, we also need to add the corresponding HTML elements inside of our root element:

Since we don’t need the video element visible (we’ll show the output in the canvas), it’s probably best to hide it using CSS — we can add a Theremin.css file to the component folder:

and import it in the JS one:

Next, we need to create the rendering context so it can be used for the image manipulation that we’re going to do. Since we only need to do it once after the component mounts, we’re going to use the useEffect hook:

At this point, it’s worth mentioning that jsfeat is fast because it doesn’t use immutable data structures — for performance reasons we’re supposed to declare a variable that is going to hold image pixel data only once as a fixed-length integer array. At each frame we’re going to reiterate over the same array repeatedly — this way the image data will use the same part of the memory for each frame without the costly need to reassign it over and over again. In order to make it work we need to declare some additional refs:

Wow, what happened here? Well, jsfeat provides its own data structures and all we need to do is provide the dimensions and the data type — in this case it’s going to be a single channel, unsigned 8-bit array. Still, what does that mean? In order to be performant (and we need to be performant to achieve a smooth video experience) jsfeat uses greyscale — so instead of the usual four channels (red, green, blue and alpha) we’re only going to use one — luminosity, which is represented by a single number for a pixel ranging from 0 (black) to 255 (white) with different shades of grey in between. And since we only need to store a number that may take 256 different values per pixel, 8 bits will be just enough.

We also need an additional array that will hold gradient data that will help us discover the edges — I’ll explain that in detail when we get there. All we need to know for now that we’re going to need a bigger array, as we’re going to have to store two values per pixel (which is why we need to double the width) and the values are going to be signed and a bit bigger — hence the signed 32-bit data type.

The final two variables that we need to declare as refs are going to be used to sum up the rows of our gradients quickly — the first one looks complicated, but it’s nothing more than a one-dimensional matrix the height of which is the doubled width of the original image and has ones at every even row. The second is a one-dimensional matrix that we will use to hold the results of matrix multiplication in.

One final thing that we need to do before starting to process each frame of the camera video is setting up the video itself. We’ll do that by using the UserMedia API and streaming whatever our camera captures to the previously hidden video element:

After that we can finally get to processing the image data itself! First we will need to declare a function that will run at each tick — since we don’t need it to be declared all over again at each rerender, we will use the useCallback hook:

As you can see, we’re using requestAnimationFrame to run this function as close to 60 times per second as possible in order to achieve a frame rate close to 60 fps (if possible).

Inside of the tick function the first thing that we need to do is retrieving data from the video element (provided there’s enough data for us to work with in the first place):

After checking if the video is rendered and assigned to the ref, and if it has enough data for us to process, we can proceed with drawing the video data into the canvas using the context.drawImage method, and then retrieving the RGBA pixel values from the canvas using the getImageData method.

Next, we’re ready to start processing the data with jsfeat. First, we need to take the data field, which is nothing else than an array of pixel values (4 of each pixel — one for the red color, one for blue, one for green and one for alpha) and convert it to greyscale:

As an attentive reader might observe, we’re not assigning the result to anything — as stated before, jsfeat operates on mutable data structures, so the last argument of that method is actually the array to store the output in.

Our next step will be calculating the gradients, which will be crucial to detecting the edges in the image. Details of this extremely interesting and potent approach can be found here, but for the sake of our theremin the only thing we need to know at this point is that applying a jsfeat method calculating Scharr derivatives on the image data results in an array that holds two values for each pixel, both ranging from -4080 to 4080. First of those is an x-gradient (which we’ll ignore for now), and the second is a y-gradient, which is much more interesting for us. The further the value of a y-gradient related to a pixel from zero (the sign doesn’t matter), the bigger the probability, that the pixel itself is located on a horizontal edge.

Calculating the gradients is easy with jsfeat, all it takes is calling a provided method:

The first argument is the input (which is our greyscale image data), the second is the output. In result, we have an array of 1280 by 480 pixels, in which every odd row member is a x-gradient and every even — a y-gradient. So, how do we know where in the camera-captured video the most prominent horizontal line that will control the pitch of our theremin is? It’s easy, it’s the row of the image with the biggest sum of absolute values of y-gradients! In order to calculate those easily, we only need to multiply our gradient matrix with another matrix that has a width of one, height equal to the width of the gradient matrix (so the doubled width of the original image), and has ones in every odd row and zeroes in every even one. Sounds familiar? It should, as we already have this matrix ready and prepared!

First, we need to manually set the correct number of columns —despite setting the correct matrix size, jsfeat doesn’t seem to get this right on its own. If we wouldn’t do that, the matrix multiplication would have incorrect results and the theremin wouldn’t work correctly. As for the method call, the first argument is the resulting matrix (which we’ve already prepared before) and the other two are the matrices that we’re multiplying.

Next, we need to calculate the absolute values of the sums — since the gradients can take both positive and negative values:

After that, it’s really simple — we just need to find the biggest value:

… and if the most prominent horizontal line is significant enough:

… we’ll find its index in the array (which is the row number in the image, counting from the top down), then we’ll draw a line at the position of the biggest edge and store it in the state for further sound use:

Otherwise, we’ll clear the line position.

Only one more thing to make our theremin work — we need to take the position of the horizontal line and produce a sound with a pitch based on it! Let’s create another component in components/Oscillator/index.jsx :

where pitch is the relative height of our sound, which takes values from 0 to 1. In order for it to work, we need to assume some frequencies — our theremin will produce sound ranging from 261.63 Hz to 523.25 Hz — and a duration for the volume change (it won’t play any sound if there’s no horizontal line detected):

Next, we need to establish some refs for the Web Audio API elements:

After the component is loaded, we need to set up all the elements that are necessary to play a simple sound using JavaScript:

Next, we check if the gain is fully loaded and if the pitch value is present — if so, we ramp up the volume and change the frequency based on the pitch itself, otherwise — we turn the volume down to a value that can’t be heard.

Since we don’t need any actual markup in this particular component, we can

The last thing needed to make the theremin play is to place the Oscillator in our Theremin component:

and the Theremin in the main App component. Et voila, congratulations! You just created a computer vision-powered musical instrument in JavaScript. As a reminder, you can check out the final result here and you can check out the complete code in my GitHub repository. If you have any thoughts or remarks, be sure to let me know in the comments!

I mentor software developers. Drop me a line on MentorCruise for long-term mentorship or on CodeMentor for individual sessions.

--

--