Virtual Reality & Voice: using A-Frame, React (with hooks) and Google Cloud Platform - Part I

The role of the avatar in virtual reality online communities have come to be much more meaningful representation than just a placeholder for a missing physical body — to the user, it’s an abstract form of expression on the individual operating a given digital shell.

In the future, our ability to distinguish between AI companions and human players within VR space will continue to blur until there is no different between the two. Whether robotic machinery or intelligence, our interactions with artificial creations are at an all time high.


In this four part tutorial series, we will be creating a virtual voice persona which will attempt to be approachable, evocative, and give a user a meaningful experience within this space. It is aimed at intermediate JavaScript users familiar with React and WebGL. However PLEASE don’t shy away if you are a beginner, you can leave a comment and I will be more than glad to answer.

In the first part of this tutorial series, we will be covering the ground work needed to get the scene for our voice visualization in place. Starting with a core React application with an A-Frame scene, we will utilize the browser APIs to get a hold of the microphone and have our virtual voice persona reciprocate audible levels with its geometry (specifically through vertex displacement).

Following with the second part, we will be fleshing out the scene with a surrounding, adding UX mechanics (think spatial interfaces) and bringing our voice orb and scene to life. We’ll explore various techniques like geometry instancing, procedural generation, etc, as well as recreate physical effects like audio reverb with user interaction.

For the third part, we will deploy a Node.js server on Google Cloud Platform (GCP) and tap into some of the APIs offered at GCP. Here we will connect our client to the REST API which we build - integrating speech-to-text and speech synthesis functionality. The goal is to reach a basic conversational capability with the user.

The fourth and final part will be focused on polishing the experience for production. We’ll go over where we can optimize performance, address any security concerns and make our web app is progressive (PWA) with the addition of a service worker (among other things). Analytics will be discussed as well.

TLDR;

For those that like to read code (part 1 branch):


The results of part I:

Overview

Let’s take a quick inventory of what we’ll be fiddling with.

The ingredients

NPM Packages:

Browser APIs:

If you have not had the chance to try out any of these APIs, no worries! I will go over each one and do my best to cover how each is used within this project.

Setting up the project

If you haven’t already and wish to follow along, I recommend cloning down the repository:

git clone https://github.com/Francois-Esquire/voice-vr.git
git checkout part-1

install npm packages:

yarn install
# or
npm install

and starting the development server:

yarn dev
# or
npm run dev

Parcel bundler will grab all the source JavaScript and vendor modules imported in our source code, to output a build to the public/ folder and serve it on localhost:1234. Dev mode will update the code on change in the event you wish to play with the code yourself. Let’s start!

The Core Application Using React

Logical compositions with React custom hooks

Before diving into the scene, there are a few features that we will tackle first:

  • check and handle permission state in the browser
  • capture microphone input
  • normalize the audio data from media stream
  • store and access the dataset for visual representation

The goal here is how we can use React custom hooks which takes imperative JavaScript browser APIs and form a declarative set of components which we can use throughout our scene.

I highly recommend checking out the src/hooks/ directory to view all the usage examples that we can do with custom hooks.

/* src/hooks/useAudioContext.js */
import React from 'react';
export default function useAudioContext({
latencyHint,
sampleRate,
}) {
const [ctx, setCtx] = React.useState(null);
  React.useEffect(() => {
if (!ctx) {
setCtx(new (
window.AudioContext || window.webkitAudioContext
)({
latencyHint,
sampleRate,
}));
}
    return () => ctx && ctx.state === 'running' && ctx.close();
}, [ctx]);

return ctx;
}

(Source)

(useAudioNode Source) — used in conjunction with useAudioContext

Component blocks

Thanks to our hooks, our AudioContext and AudioNodes can now be composed declaratively. Take for example this composition:

<AudioContext>
  <ScriptProcessorNode>
    <MediaStreamSourceNode />
  </ScriptProcessorNode>
</AudioContext>

The beauty of node based systems and how React is written is how well it intuitively articulates relationships between parent/child nodes and the shared context. The advantage here is at any time, if we wanted to swap our source (for example, when the voice assistant speaks rather than the user mic), we can swap our sources with state change and only need conditional logic in our JSX.

The same approach applies to MediaStream and Permissions components as well. When we work with accessing a privileged API such as MediaStream or Geolocation for example, it helps to keep track of permissions to get the current state of the users decision. Wrapping our MediaStream interface with a Permissions block gives us more insight on the browser state. This could be used to reflect in the UI or scene how to best approach the user to gain their trust and facilitate capability for what your app requires. For us, we only want to use the microphone.

Check out the src/components/ for other abstractions of the browser and pieces of logic. So, how are we going to use these core components?

Sub application entry

Right before we enter our scene, we scaffold together our components, hooks and scene in one meeting place.

/* src/voice/index.jsx */
import React from 'react';
import { MediaStream, Permissions } from '../components';
import AudioTextureSamplingCanvas from './AudioTextureCanvas';
import Voice from './Voice';
export default function VoiceEntryPoint({ autoStart = false }) {
const voiceCanvasId = 'voice-texture-map';
return (
<Permissions>
{({ permissions }) => (
<MediaStream
video
={false}
auto={autoStart && permissions.microphone === 'granted'}
>
{({ media }) => (
<>
<AudioTextureSamplingCanvas
id
={voiceCanvasId}
media={media}
/>
              <Voice
voiceCanvasId
={voiceCanvasId}
media={media}
/>
</>
)}
</MediaStream>
)}
</Permissions>
);
}

(Source)

Canvas Audio Texture

Using the microphone and AudioContext API

By passing in the media stream created by a call to getUserMedia, the stream is now available for distribution throughout the rest of the application.

Using the ScriptProcessorNode (from above) method onAudioProccess, it returns an event object with the inputBuffer where we can tap into each audio channels data. We will only be using the left channel for now.

function onAudioProcess(event) {
const left = event.inputBuffer.getChannelData(0);

// ...
}

Here is what you should know about the ScriptProccessorNode:

  • channel data is a Float32 typed array with values ranging from -1 and 1
  • it’s computationally heavy as a consequence of the specification design
  • it’s being deprecated in favor or AudioWorklet

Drawing PCM data

The technique of storing derived values in a texture or buffer is not new. Keep this point in mind. What we’re essentially doing here is keeping a historical representation of the recorded microphone input track as time passes, drawing in grayscale values, row by row for each reading.

If you saw in the short clip from the start of the article, did you see how the sphere filled (or inflated) bottom up? That was each register being drawn to the top of the canvas image and subsequent registers are drawn under the current one being drawn.

What we have is a texture made up of time and PCM values by its given dimension. In the next part, we will be using this to create some interesting effects like reverb and other cool effects we can come up with from sampling a texture that has the element of time.

Take note, the PCM data values range from -1 to 1, so we have it mapped to a value range of 0 255 for compatibility with the canvas context. These values will be passed directly to our shader as a texture which we can look up.

A-Frame Scene

Markup

In a nutshell, our scene will fundamentally look like this:

/* src/voice/Voice.jsx */
<a-scene>
  <a-camera>
    <a-cursor />
  </a-camera>

  <a-sky />

  <a-icosahedron />

  <a-entity>
    <a-plane />
    <a-text />
  </a-entity>
</a-scene>

(snippet — complete Source)

Vertex Displacement

This is where the magic happens. Let’s register an A-Frame component called “voice” and connect it to our canvas texture.

We will need a few things here. First, we will need to import a fragment and vertex shader (will discuss below) for a ShaderMaterial to consume. We will export a custom React hook and put the component registration there with an init() method.

/* src/voice/useVoiceComponent.js */
import AFRAME from 'aframe';
import React from 'react';
import {
vertex as vertexShader,
fragment as fragmentShader
} from './shaders';
const {
ShaderMaterial,
DataTexture,
RGBAFormat,
NearestFilter,
DoubleSide,
} = AFRAME.THREE;
export default function useVoiceComponent() {
React.useEffect(() => {
AFRAME.registerComponent('voice', {
init() {
const mesh = this.el.getObject3D('mesh');

this.canvas = document.getElementById(this.data.id);
this.ctx = this.canvas.getContext('2d');

const data = this.getImageData();
        const textureSize = Math.sqrt(data.length / 4);
        this.texture = new DataTexture(
new Uint8Array(data.length),
textureSize,
textureSize,
RGBAFormat,
);
this.texture.magFilter = NearestFilter;
this.texture.needsUpdate = true;
        this.material = new ShaderMaterial({
vertexShader,
fragmentShader,
uniforms: {
texture: { type: 't', value: this.texture },
},
});
this.material.side = DoubleSide;
mesh.material = this.material;
},
});
}, []);
}

(Source)

Here is where we query the AudioTextureCanvas canvas element used to draw the audio texture and get the data on every tick() on the A-Frame component instance.

Displacing a vertex on the normal

At this stage, we can finally access our audio texture from within our shaders.

To best way to learn some of the fundamental GLSL and about vertex displacement is to check out these two tutorials. The first is the original tutorial (which inspired me) and the other is an A-Frame adoption done for the community:

Vertex displacement with Three.js (original)

Vertex displacement with A-Frame.js (docs — adopted from original)

If you are new to WebGL or GLSL, I highly recommend you check out these two tutorials before moving forward.

vertex shader:

What’s happening here is the vertex shader is going through an array within a buffer stored on the GPU memory, which represents all the points (in chunks of 3 — x, y, z) in 3D of our object.

The texture is read for the texel which we set as a vec3 sound. It is based on the position of the vertex in the vertex shader. To make our vertex displace on a given orientation, we use the normal vector for the direction of the amplitude of the sound value. This scalar transform is then added to the current position before being calculated as the gl_Position.

varying vec2 vUv;
uniform sampler2D texture;
void main() {
  vUv = uv;
  vec3 sound = texture2D( texture, uv ).rgb;
vec3 newPosition = position + ( normal * clamp( sound, 0., 1. ) );
  gl_Position = projectionMatrix * modelViewMatrix * vec4( newPosition, 1.0 );
}

fragment shader:

The fragment shader is what draws the pixels to the screen (specifically the frame buffer). Here we used our varying vUv (the vertex position) to draw a color at the given vertex position.

varying vec2 vUv;
void main() {
  vec3 color = mix( vec3( vUv, 1. ), vec3(1), .1 );
gl_FragColor = vec4( color.rgb, 1.0 );
}

(Source)

Optimizations

The web engineer in me has a knack for dissecting apps at hand and find where real world situations like, slow or network loss during requests, or low tech devices operating the web app could cause bottlenecks or failures.

To top it off, the domain of real-time rendering has its own performance restrictions — which could ruin the experience for the end user. No good. In the next part of this series, we’ll take a look specifically at such problems. For now, let’s assume this aspect of our application is immaculate (it’s not).

Where else can we make improvements then? Let’s start with the JavaScript bundling.

Code splitting to the rescue

A-Frame and Three both cost us about 1mb in file size alone. If we haphazardly bundle our code in one main.js, or even split a vendors.js on entry as well, the time and network traffic needed for both resource on entry, would take away from the experience if we were intermittently in and out of the network (under the subway for example) or simply on a slow connection.

The goal is to keep our entry under 150kb for JavaScript, CSS and HTML all together. To remedy, let’s split our application at a strategic point and try to make our entry script as light as possible. The idea is, to stage the bulk of the application code and only install the system that will orchestrate the runtime on entry.

/* src/index.jsx */
import Loadable from 'react-loadable';
export default Loadable({
loader: () => import('./voice'),
loading: () => null,
});

(Source)

What this accomplishes, is only include the base startup script we wrote and only the react dependencies are bundled as the vendor scripts on entry; without A-Frame or Three library. This affords us a chance to load the base script, which in turn can initialize the web app (page lifecycle) and entertain the end user while the rest of the app downloads.

Instead of a 1.2mb JavaScript entry download, main.js is only 125kb, with source and vendor code combined.

Service Worker

An additive solution we will explore in the last part is using a service worker and the Caches API to manage our resources and ease server/client traffic (good for compute engines) in tandem to offer offline capabilities! The ultimate benefit from my experience building PWAs the past eleven months, is that loading a cached asset is nearly instant (predictable — proportional to file size) and is independent of network traffic.

Better memory management

Although the audio canvas texture is essential to our process, it could be offloaded to different threads. Specifically two cases in our project:

ScriptProcessorNode

First, the ScriptProcessorNode API by design, requires instancing two different arrays to use it. The catch 22 here is if we either replace the old array with the new one to update the texture or copy over the values to a fixed array of our own, both come with their own set of problems. If we replace the array, garbage collection will kick your app in the gut at the rate which it is happening. If we copy over values to our own array, the cost of performing this operation at the same rate which it fires, limits us to lower quality samples from the latency alone.

To solve for this, we need to adopt the AudioWorklet, check out this video:

Canvas Texture

Do you recall when I wrote that there are other techniques of storing a data texture, specifically on the GPU? If we delegate our texture processing and manage to efficiently move audio data to the GPU, we can offload this work from the main thread, and keep the data physically closer to where it needs to be. More on this in the next part.

Conclusion

For those of you that have survived this far, I truly appreciate you taking the time to read! Please feel free to leave any questions or comments below.

[Placeholder for part II]

Thank You