Whatever you say, happens

Live creation of VR/AR experiences

Mike Johnston
15 min readOct 1, 2017

This post was written for the Live 2017 workshop at the ACM SIGPLAN conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH).

Moatboat is a way to create experiences in virtual reality and augmented reality using your voice. The goal is simple: “Whatever you say, happens.” It works like this:

Using Moatboat in augmented reality.

Tens of thousands of people have tried the first experimental release so far, and we’re now hard at work on the next version.

I’ll start by summarizing our vision and how Moatboat works. Then I’ll outline our motivations and go into more detail about how it’s designed and implemented. Along the way I’ll point out some challenges we’ve faced and some known limitations, and I’ll conclude with a summary of future work.

Vision

Like many before us, we believe humans and computers are better together. Technology can augment human creativity by empowering people to express themselves in ways that aren’t otherwise possible. To this end, Moatboat helps people quickly express their thoughts and ideas as experiences in VR/AR.

How it works

We’re building a software platform that you can use by wearing a VR headset or by looking through an AR device. By speaking naturally, you can create and control simulations. For example, you can create a small ecosystem by saying something like, “I want some wolves eating some sheep.” A scenario with wolves eating sheep will start happening in front of you. You can keep adding layers by saying more sentences like, “Sheep eat grass.” As you add more objects and behaviors, increasingly complex scenarios can emerge.

Unlike other efforts that focus on text-to-scene transformation for object placement, e.g. “An axe is next to a tree,” we focus on behaviors and interactivity, e.g. “The person cuts down the tree with an axe.”

Under the hood, we use natural language processing (NLP) powered by machine learning together with networking, simulation, and artificial intelligence technology from the video game industry. We transform your sentences into dynamic 3D environments populated by simulated characters called agents. Since this transformation typically happens within hundreds of milliseconds, creators experience a very tight feedback loop between ideas and results.

The vocabulary we support grows over time. We learn from our users, and then scale up the platform based on what they try to do. In particular, we use anonymous usage data to improve our understanding of intent and to decide what content we should add in new versions. By iterating in this way, we can support a growing variety of objects, behaviors, sentences, and audiovisuals.

Voice isn’t great for certain interactions like placing objects (“A little to the left!”), so we support multimodal input. You can use virtual hands to point, grab, and move things around. When inferring spoken intent, we include context based on your position, where you’re looking, and where you’re pointing. For example, you might lean in close to a character in the world while pointing and say, “Go pick up the shovel over there.”

Working from first principles, we aim to provide realtime transformation of ideas into dynamic, interactive worlds. Since Moatboat can also be used live with other people, it becomes a platform for communicating experiences.

Programming meets storytelling

Our team lives at the intersection of programming and storytelling—we sometimes call it storymaking—and we’re not alone in this space. There are many video games, tools, and other experiences that live here too. Specifically, our theme of using interactive simulations to communicate is shared with tools like Smalltalk and NetLogo, games like Dwarf Fortress, The Sims, and Scribblenauts, and inspiring work from people like Bret Victor and Nicky Case. The idea is also well documented in the excellent book Simulation and Learning. Constructionism, a framework for participatory learning, captures many of the potential benefits.

In programming terms, Moatboat is most like a live, declarative, natural language programming environment. But we don’t really help people to program in any traditional sense.

When you create with Moatboat, your intentions start at the concrete level of worlds and characters. By saying what you want to see happen, you’re transforming your intentions into an approximation of reality. “Wolves eating sheep” gives you exactly that, but with a predator-prey relationship that’s preconfigured with simple defaults. This is an instantly gratifying way to create virtual and augmented worlds. But it’s not really programming, since your initial results are fairly arbitrary defaults. We’re working on ways for users to achieve more specific results, like wolves eating sheep at a particular rate over time. Making Moatboat more deep and rigorous in this way is a challenge, much like trying to make programming more light and fun.

In terms of storytelling, Moatboat is most similar to an emergent story generator or systems for interactive fiction like Inform. But we don’t really help people tell stories in any traditional sense either.

Instead, like Dwarf Fortress, stories emerge from the dynamic systems that you create and run over time. Participants have agency, actions have consequences, and results can be influenced unexpectedly. This is a satisfying way to express and experience systemic causality. When a world feels alive and reactive, it invites agency in the form of curiosity, exploration, and experimentation. You want to play with it! But it’s not a great way to tell a specific story, since creators have too little control over what people will experience.

This tension between story and agency is well understood by video game designers. We’re working on ways users can be guided down particular paths when the goal is to communicate a more specific experience. In the meantime, we happily join others who explore this space between programming and storytelling, and by focusing on immersive experiences, we see opportunities to push forward from a different angle.

A medium for experiences

Our use of VR/AR is important. As a medium, it’s better at some things, but worse at others. For our purposes, it’s really good at making you feel like a creative badass who can create and control worlds with a few words and the flick of your wrist. It’s also really good at feeling social, at least compared to tools like collaborative text editors where people are disembodied cursors.

Although the basic idea of Moatboat can work in 2D as well—and that’s exactly how we started, on an iPad—you don’t get the same feeling in 2D. In 2D, you’re moving text and pixels around on a canvas. In an immersive 3D medium like VR/AR, you’re meeting with other people to shape reality together.

Immersive mediums, with the best example being reality itself, excel at communicating experiences: they’re dynamic, fully interactive environments. Non-immersive mediums like paper and film excel at communicating information: linearized, interwoven text and pixels. Both are useful, and there can be overlap, but with Moatboat, we’re focusing on ways to communicate experiences.

Shaping reality as a way to communicate was an idea suggested by early VR pioneer Jaron Lanier. Called postsymbolic communication, he suggested we might one day communicate by simply manipulating objects in virtual worlds. Though nebulous, and in some ways misguided, it was a valuable idea that only recently started becoming practical to implement. Moatboat is one way that communicating by shaping reality might actually work today.

We also see overlap with HARC’s Realtalk as a computational medium. Bret Victor and his team want to enable the creation and use of computational media using ordinary physical objects. We share many of the same values, but perhaps to Bret Victor’s horror, we’re embracing the incremental virtualization that he eloquently cautions against. We’re starting with VR/AR instead of physical objects because we think it’s the most dynamic, experiential medium we can offer people right now. When it becomes possible to more humanely shape reality with something like nanobot clouds or tangible holograms or some other future invention, everything we learn from Moatboat will carry forward to these improved immersive mediums as well.

Our goal may become clearer when we look at the user feedback loop in Moatboat.

The feedback loop

We aim to achieve a tight user feedback loop between intentions and results with an interface that is natural, efficient, and highly accessible, with broad applicability. The loop we focus on is:

  1. Ideate. The user has an idea. Users can start with their own roughly formed idea, or they may arrive at this phase after being inspired by an experience (their own or someone else’s).
  2. Express. The user speaks and gestures. We recognize the words people say, where they are, where they’re looking, and where they’re pointing.
  3. Understand. Given some words and context, we infer the user’s intent using natural language processing powered by machine learning.
  4. Represent. We transform the inferred intent into a dynamic, simulated environment. Appropriate audiovisual cues are rendered for the user to make it more obvious what’s happening.
  5. Repeat. Upon experiencing the result, the user can return to ideation.

The tighter this loop, the more satisfying it feels to create and experiment.

Using realtime networking techniques borrowed from video games, multiple people can create together live. Much like an improvised performance, their respective loops can feed into each other, allowing one person’s change to immediately inspire another’s:

Multiple people using Moatboat together in virtual reality.

Our technical implementation of this feedback loop happens in two broad phases: transforming intent, and representing results.

Phase 1: Transforming intent

To understand what people intend when they say a sentence, we use our own TensorFlow-backed recurrent neural network for natural language processing to convert language and context into structured output. This is a well-established approach, and we’re merely applying it to our domain.

When we were first getting started, we needed some way to seed our machine learning model. We used natural language generation based on patterns we identified through analytics, user testing, and tools like Mechanical Turk. Now we’re able to keep improving our model by training it with real sentences that people are actually using in the versions we ship. We’re gathering our own corpus because we want people to push the limits by saying things we don’t expect, and we need to support creative sentence patterns that are often absent from publicly available corpuses.

We can handle an increasingly wide variety of sentence variations as we feed the model with more examples. We can also handle multiple ways of expressing the same idea, as well as more advanced language features like adjectives, adverbs, conjunctions, and prepositional phrases.

The output we generate is specifically designed to create and control a simulation layer inspired by the video game industry. Simple sentences like “Wolves eat sheep” have a clear enough outcome. But more complex sentences like “I want delivery trucks to pick up red boxes and deliver them to all the houses” are more challenging. Our inference generates output that combines grammar like nouns and verbs with concepts like behaviors and goals.

Equipped with a structural understanding of the user’s intent, we’re able to proceed with representing the results as objects and behaviors in the world.

Phase 2: Representing results

We try to acknowledge your intent by making something happen right away. After you utter a sentence, we can infer intent and invoke a change in the simulation in under a second on average. It’s important for you to be able to understand quickly whether your intent was fully realized. Our three main channels for giving feedback are audio cues, visual cues, and text-to-speech synthesis (i.e. an artificial voice like Alexa or Siri).

For actions like adding more agents, the feedback can be self-evident: new agents simply appear with corresponding audiovisual effects to draw attention to their existence. Feedback for actions involving behaviors is more difficult. When you change behaviors, we make sure that some agents in the world react to what you said. For example, “Wolves eat sheep” will immediately cause some number of wolves to chase and eat some sheep.

By using hypernyms (“furniture” is a hypernym of “chair”), you can apply behaviors across entire categories of objects. “Clouds rain furniture” will cause a variety of furniture to start falling from the sky. Perhaps more usefully, “birds fly” will give pigeons, hawks, seagulls, and any other birds the ability to fly around.

To constrain our efforts in these early days, we deliberately sacrifice realism for extreme flexibility. Since our animations are procedural, we can allow any object to behave in any way. For example, you can tell your coffee machine to make donuts that run around and chase your dog if you really want. Or camels can deliver pizza.

Our simulation subsystem is designed with these requirements in mind:

  1. Modular. Sentences like “wolves eat sheep” use multiple components to create the desired behavior, e.g. Movement, Predator, Prey, Hunger, Survival. These can be applied to any noun; for example, you can make trees eat houses.
  2. Scalable. Modularity should promote scalability (more depth and breadth) by making it easy to add new objects and recombine behavioral components. We use data-driven techniques where possible.
  3. Emergent. Multiple simple components and behaviors applied to groups of agents should allow for more complex behaviors and relationships to emerge.
  4. Comprehensible. It should be possible for users to understand the layers of behaviors that have been added, both for their own creations and someone else’s.
  5. Reversible. All behaviors can be stopped just as easily as they’re started. “Donuts, stop chasing my dog!” should do just that.
  6. Compatible. Behaviors must be designed for and driven by NLP output.

The simulation allows behaviors to be layered. I might say, “the person throws the bone to the dog.” Then you say, “the dog brings bones to the person,” and “cats sometimes chase dogs.” We’ll get a living vignette: a person plays fetch with a dog, and once in awhile the cat will come over and chase the dog away. But the dog will eventually come back and play fetch again—at least until the cat returns. This will keep happening nondeterministically.

When an agent has multiple behaviors, AI planning determines how and when it will act. We use established planning techniques from the video game industry; there have been decades of research in this area, and video games use practical implementations that work well enough. You can adjust the probability of certain behaviors by using words like “sometimes” or “often.” Behaviors can be reactive, for example when an animal is being hunted, it can be configured to stop whatever it’s doing and run away when it sees a predator. Behaviors can also be chained, so a sentence like “build a campfire” can lead to multiple actions happening in a sequence. In this case, based on a preconfigured recipe, a character might go to some wood, pick it up, bring it to an open area, and light a fire.

Designing and implementing our simulation subsystem leads to some fascinating team discussions. Do we support inter-species breeding? Should eggs hatch into chickens by default, or should the creator have to add that behavior explicitly? When you say, “Donuts chase the dog,” should they do it forever, or only for awhile? If you make people drive trucks, sell lemonade, and dance, which should they do first? If you make rabbits breed, then should the system also make them die of old age to avoid a population explosion? And what happens if clouds rain clouds?

Many of these issues will be familiar to programmers, like choosing good defaults, classes vs. instances, always vs. sometimes, procedural vs. declarative, order of operations, ambiguity, and recursion. In these ways and others, some of our challenges overlap with programming.

A more unique challenge we face is that voice is transient. Spoken words are lost once they’re transformed into a simulation, which can make it hard to understand what’s happening after awhile. We’ve experimented with a variety of ways to store and show text and visual representations of the sentences you’ve used and the defaults you might configure. So far this has always felt like a departure from the medium’s strengths, as seen in this failed prototype:

A failed prototype for configuring defaults.

We call this issue the white box problem.

White box simulations

Sim City is an amazing video game, but critics point out that as an urban simulator, it’s an opinionated black box. For simulations to work well as tools for communication, they should be white boxes rather than black boxes.

A white box simulation lets users dig into how it works so they can discover assumptions and change the rules at a deeper level. Done well, this caters beautifully to curiosity and exploration. Yet, the deeper users dig, the further they’ll stray from that feeling of shaping reality. They’ll spend less time changing the world, and more time telling the computer what to do.

Nonetheless, we aspire for Moatboat to be a white box. Though we don’t yet achieve that goal, we also don’t want the main experience to feel like programming. We’re actively exploring how to provide more transparency and modifiability in a way that’s aligned with the strengths of immersive mediums, for example by letting you ask the world and its objects questions.

To help us approach this white box problem in a general-purpose way, we’re also committed to supporting a certain minimum breadth of content.

Breadth of content

The first version of Moatboat was modest with around 200 objects and behaviors represented by deliberately simple audiovisuals. We’ll reach 1,000 with our next version in early 2018, and in 2019 we’ll exceed 5,000 objects and behaviors.

We chose 5,000 as our target based on the following comparisons:

To help focus our development efforts and frame users’ expectations, we group objects and behaviors by themes. For example, a farm theme helps us focus on animals, machinery, and systems you might want to have on a farm. To ensure generality, everything should work cross-theme as well. If you want monkeys to milk goats on your deserted island, we need to support that too.

Learning to use it

“Whatever you say, happens” taken literally is a lofty goal that needs a large community effort over many years. In the meantime it’s more like, “Whatever you say, sorta happens (usually) as long as we support it.”

Until we can support a much wider variety of things you can say, we need to help you learn what works and what doesn’t. On top of that, we also need to help you overcome any blank canvas anxiety you may feel. We use typical videogame-style onboarding to teach the basics. Beyond that, we’re still experimenting.

One experiment used text-to-speech to acknowledge an unsupported request and offer an alternative: “Sorry, we don’t have any platypuses yet, but you could try adding a helicopter.” (It wasn’t very smart). We’re also trying an idea button. When pressed, a voice suggests a possible sentence for you to say: “You could try making people pick up axes.” (This one is a little smarter). Our latest experiment offers points-of-interest that guide you through a variety of vignettes you can create within some themed worlds.

Our forthcoming experiment, multiuser support, allows people to experience each other’s sentences and results inside shared worlds. You can learn what works and what doesn’t by simply being around others. We’re excited because this scales well as the vocabulary evolves, and it’s similar to how we all learn to communicate using language in the first place.

Future work

There’s a tall fence between most people and the ability to create dynamic experiences. Having worked on visual programming systems in the past, we appreciate the decades of effort that have gone into helping people try to climb that fence from the side of programming. With Moatboat, we’re approaching it from the opposite side: how can people tap the power of computational media using an interface that feels less like programming, and more like storytelling around a campfire? There’s a lot of work to do:

  • A larger vocabulary and greater sentence complexity.
  • Supporting the long-tail of words past 7,000 or so requires more flexibility. We’re exploring ways to foster a community by supporting user-generated objects and behaviors.
  • Ways to create more meaningfully deep experiences, including better ways to design for evolution and chaos.
  • Support for more VR/AR platforms.
  • Support for more languages besides just English.
  • More queryability, so you can just ask the system questions and have it answer you.
  • Ways to solve the white box problem with more transparency and modifiability, for example tangible 3D abstractions you can hold and manipulate with your hands.
Bringing user-generated content into Moatboat via Google Blocks.

Conclusion

The name Moatboat has some history. My co-founder and I actually picked it out of a hat!

We added it to the hat because a moat is a problem, and a boat is a creative solution. We know that Moatboat currently looks and feels quite playful, and this is deliberate. But our ambitions are bigger. “Creativity is about solving problems.”

Some of our very first Moatboat creations were quite serious, covering topics like wealth inequality, water shortage, and the spread of malaria. We may return to considering more purposeful topics like these when we better understand experiential communication in more casual settings first.

An early Moatboat world explored the effectiveness of malaria interventions.

We’re in the age of being able to shape reality as a computational medium. Though it’s not clear exactly how this medium will evolve towards mainstream use, Moatboat is our contribution to that end.

Technology changes quickly, but our needs do not — we need to entertain, communicate, experiment, convince each other, learn, understand, explore, make decisions, and solve problems. The ability to instantly shape reality together offers a powerful tool for meeting these same needs in more experiential ways.

Find me on Twitter for more thoughts and updates.

Thanks to Katrika Morris, Tom Witkin, Shiv Kumar, and the SPLASH Live 2017 reviewers for contributing to this post.

--

--