Applied Neural Nets in Virtual Reality

How to bring predictive power to your XR application

James C Kane
9 min readJan 3, 2023

[Author’s note: the original publisher and host of this article, VRFocus, was bought and shut down by something called “Good Morning Web3.” Reprinting a draft here for posterity.]

Forget the state of the art, even the basics of machine learning can be comically misunderstood by the general public and tech industry luminaries alike. (Full disclosure: I include myself in this indictment). We’re told the possibilities are endless, but how do we get there? At a high level, how does the average programmer in 2018 direct a machine to study massive quantities of data or perform tasks ad infinitum such that it learns to make accurate predictions later on? Specifically, how do you apply such a model within the context of an XR application?

These are big, multivariate questions. There’s linear algebra and derivative calculus to grapple with on the ML theory side, plus a tangle of languages and libraries to weave together on the application end. To explore these concepts, it would be helpful to define achievable goals. Ours will be:

  1. create a Unity VR drawing app with Leap Motion hand-tracking
  2. train a neural network to accurately classify shapes drawn in that app

This will produce a novel XR input or game mechanic, and we’ll learn about the intersection of ML and spatial computing along the way. Unto the breach!

VR Phase: Finger Painting

I’ve already written about the powerful UX patterns unlocked by real-time hand-tracking — and my belief that future generations of XR HMDs should feature this tech natively with wide field-of-view and low latency as soon as technically feasible. For this project I wanted to bring in the best available option: Leap Motion tracking to allow the user to paint with their hands (sans controller) in 3D space. I like this as a less VR-centric input mechanic, since AR will rely heavily on this kind of hand tracking and gesture-based input.

Leap Motion has a Unity VR starter project which gets you up and running quickly. On load, it instantiates the necessary prefabs for your headset’s SDK in no time. We want to be able to make simple drawings in VR, so a Unity TrailRenderer attached to the tip of the index finger will do nicely. After that, I scripted a container GameObject to only activate when the index finger is extended (check out the very hand Leap Motion docs for more).

There, we can draw now, and we can access that Vector3[] drawing data via scripting in real-time in several ways. We’ll have to flatten and normalize it later on somehow, but we’ll cross that bridge once we understand our required input format.

The virtual reality app is the easy part. That’s kind of weird, right?

Planning Phase: Some Supervision Required

Unity’s experimental machine learning package ML-Agents might seem like an option here, but it’s currently geared toward reinforcement learning. That’s an unsupervised learning technique where the model is fed no data, but instead is given only narrowly-defined behaviors it can use to achieve a goal. Over millions of iterative trial and error training sessions, these Agents experiments with their abilities to discover the “best” way to achieve the goals you’ve set — or at least some kind of local minimum or cheat.

The work being done on this project is fantastic, but reinforcement learning isn’t appropriate for our app. Glyph recognition in XR is a classification problem and I’ve got a few training datasets in mind for supervised learning. Second, ML-Agents essentially creates a Unity 3D game environment for physics-based training sessions to run. That’s amazing, but I don’t need it in this case, either.

What I may find handy is the ML-Agents’ accompanying TensorFlowSharp library, which will allow me access any pre-trained model with standard lower-level TensorFlow functionality in Unity’s native C# API. We’ll revisit this shortly.

Training Phase: Start Small and Scale

First, consider the humble MNIST dataset. This classic classification problem is the “Hello, World!” of machine learning, consisting of 50,000 28-by-28-pixel grayscale images of handwritten digits. The objective is to use this data to create a model that can accurately guess any digits depicted in new, unfamiliar drawings.

Sample MNIST data. That stands for Modified National Institute of Standards and Technology, which is one of the worst acronyms I’ve ever heard.

Some great TensorFlow documentation runs down a very basic process for analysis of MNIST data, which turns out to be fairly easy to grok (they also provide more advanced and accurate architectures in subsequent examples).

During the training phase, the program studies every pixel of every digit-image passed to it and assigns them values (arbitrarily between 0 and 1) depending on “ink strength” in that cell. After viewing thousands of examples, the program constructs a model or graph describing how likely or unlikely each pixel is to contain ink per digit. The generated model can be visualized as a heat map per digit, with red indicating negative correlation with each class and blue positive.

Consider the zero: given the tens of thousands of handwritten examples this model was trained on, it‘s very unlikely an unknown sample is depicting a zero if there are many strokes directly through a circle in the center of the image.
The heat map visualization of evidence per class is neatly summed up here, where Wi is the weights and bi is the bias for class i, and j is an index for summing over the pixels in our input image x.

We when we submit a new sample to our trained model, it converts the evidence tally for each digit zero through nine into predicted probabilities using the “softmax” function and declares a best guess (seriously, read the tutorials for a better explanation of the maths than I could proffer). The algorithms I’m describing may look fancy, but this is one of the simpler classification problems in deep learning. There are certainly more advanced and reliable methods, but let’s stay focused on how to apply an arbitrary machine learning model, if not the most advanced.

So numbers are good, but I want glyphs, baby. Enter Google’s Quick, Draw! dataset, which is not at all dissimilar. Google’s like-named game asks you to draw something with your mouse or touchscreen device and the app uses deep learning to guess what you’ve drawn. This data is then made public by Google for teaching purposes.

I could train a model to recognize any or all of the thousands of classes Google provides free for the public, but for brevity’s sake let’s limit our sample to some fairly primitive glyphs … zigzag, circle and square.

Web App Phase: Separation of Powers

In this project, I avoided Unity’s ML-Agents for the training phase because I didn’t need a physics-based reinforcement learning environment for my application. Now that it’s time to bring a pre-trained model into my VR app, we could do it in two ways.

I mentioned TensorFlowSharp earlier — if you wanted to implement the model directly in Unity C# scripting to run glyph recognition processing locally on the end-user device, take that route. But there are a lot of practical or business reasons you might not want to do that — plus if you’re new to TensorFlow or machine learning, the C# API exposed is fairly low-level. If we were instead running some kind of external Python-based app, we could utilize a more human-readable TensorFlow library such as @fchollet’s Keras.

There’s something to be said for separation of powers. In a production environment, it might make more sense to have a centralized group of highly-optimized machines taking requests and serving predictive output from your model to individual end-users over the web. For this project, let’s treat the ML/predictive component in that way, as a separate application. Now our Unity VR app is just a Unity VR app — one that happens to send extremely typical HTTP requests to our locally-running web app.

It’s alive — and it has internet access!

That app need not be complex (at least not at this proof-of-concept stage). Basically, I want to run a Python script predict.py that implements Keras/TensorFlow whenever I send a POST request containing image data to some arbitrary (and for now locally-running) URL. It might sound complicated, but it only took a few lines of Django and Keras code to accomplish once the environment was set up.

To be clear, this is terribly inefficient as it loads and trashes the model on every request. But for convenience sake, it contains every step necessary to load a model and serve predictions based on image data contained in a web request in less than ten lines of Python.

Once your predictive app and VR app are working separately, it’s time to make them talk to each other. I’d love to send drawing coordinate data in real-time to allow for the kind of ongoing predictive analysis Google Quick Draw offers, since that’s easily accessed through our TrailRenderer component. But our ultra-simple model currently accepts only image data (in the form of flattened arrays). We could:

a) choose another training model with a vector-based input format

b) flatten the drawing coordinates data in real-time (or every i seconds) via websocket connection to our app

c) for concept proving purposes, just capture a screenshot of the drawing from an HMD-relative camera triggered by my non-drawing hand

I like all of that, but to get this thing up and running, let’s chose the last option and iterate from there.

Web development experience comes in handy — I won’t bore you with too many details, but there were a few hiccups due to version misalignment across my array of ML, VR, web and Unity APIs. I ended up having to use the lower-level Unity HTTP API to generate my own boundary strings to fit the request format required by my specific Python/Django version.

Again, this is inefficient and doesn’t isolate behaviors at all — but for purposes of illustration, this script does everything needed to take a snapshot from your HMD camera, transfer it a web app, and deal with the response.

Once in place — bam! I’ve got proof of concept end-to-end functionality as the VR app pings the web app and reacts to the predictive response sent back.

Deployment Phase: Pretty Wizardry

So, the architecture of our multi-application VR/ML project is set. Previously it existed in a hopelessly boring grey void, but one asset store shopping spree later, we’re ready for the big reveal.

Okay, okay. *takes breath*

Conclusions: A Fun & Accessible XR Input Mechanic

I have to say, the first time this worked end-to-end, I was elated. We’re talking about a complex array of emerging tech that’s generally open source and fast evolving. This input style is new, but glyph-drawing in 3D space is bound to become a powerful input pattern as the spatial computing revolution rolls on. Think of the potential XR accessibility applications — it could be a custom, OS-level input style for people with different physical abilities to run common applications and more.

This also fits into a broader narrative: it’s high time we designers and developers seek a deeper understanding of both how ML tech functions, and how it can be applied to our world of XR applications going forward.

Thanks & Stretch Goals

Many thanks to the developer and creative technologist Samuel Snider-Held of Mediamonks for writing the inspiration to this article. He was also super gracious when I was having difficulty connecting the dots on some of these pieces. I appreciate Leap Motion’s community manager Alex Colgan for pointing me to the piece in the first place. Finally, thanks also to Blake Schreurs, who does great XR dev work on his YouTube channel and who was also kind enough to email some extremely helpful suggestions early on in this project.

I finished the proof of concept app described above about a week ago, but have since been preoccupied by contract work. What I hope to accomplish when I revisit this thing:

1a) enable a 2-way websocket connection between Unity and Django to enable real-time drawing processing

or 1b) integrate the Keras/TF model directly into Unity with TFSharp and enable real-time processing in that way

2) I’ve written about how voice recognition is low-hanging fruit for designers and developers in 2018 — I could toss in the newly released IBM Watson API for Unity into our XRI stew, allowing for expanded ML predictive capabilities

3) What if I want to use this mechanic to trigger precisely-targeted abilities? I’d strive to implement Martin Schubert’s great concept for shoulder-aligned distance targeting

--

--