Digitize Reality with Sensor Stream Pipe

Moetsi
Moetsi
Draft · 8 min read
mixed reality infrastructure

TL:DR; We think the best mixed reality experiences will be powered by live data from multiple sensors. That’s why we built the Sensor Stream Pipe: so sensors can communicate at low bandwidth and latency with processing hardware.

Making a digital copy of the physical world is interesting enough that organizations are putting a lot of effort into ambitious world-scale projects (Magicverse, Live Maps, Open AR Cloud) but new enough that there isn’t a common name for the technology (AR Cloud, Digital Twins, Mirrorworld). A lot of smart people think this digital reality infrastructure will power the next wave of innovation.

The AR Cloud will be the single most important software infrastructure in computing, far more valuable than Facebook’s social graph or Google’s pagerank index — Augmented World Expo Founder and CEO Ori Inbar.

A big idea, like the smartphone — Tim Cook

Augmented reality and virtual reality technology will have a far bigger impact than smartphones ever did — Alex Kipman

There aren’t really a lot of details on how future Digital Twin/AR Cloud/Mirrorworld hardware/software systems will operate. Or how “killer apps” will make use of this future infrastructure. Taking a look at how mixed reality hardware and software development is progressing and current mixed reality system architectures, combined with an examination of how emerging technologies have affected similar architectures, can give us insight into how mixed reality system architectures might evolve and power experiences that actually deliver on the hype.

Hardware manufacturers have been updating their devices with better/new sensors and improving SDKs with APIs that do more. Apple has released 3 versions of ARKit and is rumored to be adding a depth sensor to their devices in 2020 (glasses apparently in 2023). Google has been updating ARCore (formerly Project Tango) with new functionality for developers. Microsoft has released 2 versions of the HoloLens, released an updated Kinect (and SDK), and updated their Mixed Reality Toolkit. New entrants like MagicLeap and Nreal are also creating wearables. There are also improvements in commodity spatial sensors from Intel and Structure. Chip manufacturers like Qualcomm are preparing to power a new generation of mixed reality devices.

Mixed reality services and SDKs help developers create great multi-platform applications. OpenXR, and Unity’s AR Foundation abstract device SDKs for developers. 6d.ai provides developers with a real time digitized environment for any device with a camera. Google provides the Cloud Anchor service, and Microsoft the Spatial Anchor service, to solve finding the relative position of devices across platforms.

A commonality with current architectures is that device sensor data is processed on device and results are periodically uploaded or downloaded. The data shared between devices is a snapshot of the past. Apple’s, Google’s, Microsoft’s, and 6d.ai’s solution for finding the relative position between devices requires:

  1. An initial device scanning a small area collecting raw sensor data
  2. Processing that sensor data to create a small “map”
  3. Triggering an upload of the map to either the cloud or another device
  4. The second device downloading the “map”
  5. The second device scanning the same area to collect raw sensor data
  6. The second device processing the collected data and “map” to find its relative position to the “map”

The hard “thinking” is done on the device in this architecture. The first device must take in all its sensor data (raw frame data from its sensors/IMU) and process it to create a lightweight map. This process takes in a huge amount of data, processes it, and results in a small fingerprint. The second device must also take in all its sensor data and the downloaded map to compute its relative position. It must also process a large amount of data to understand where it is relative to the area fingerprint. In this entire workflow there is only 1 small upload (initial device submitting the map) and 1 small download (second device pulling in the map). The current state of mixed reality architectures focuses on local processing with only light sporadic networking.

Emerging technologies like 5G and edge computing can open up new mixed reality infrastructure architectures. With 5G, devices can stream at higher bandwidth and lower latency than is currently available and edge computing can process this data and return results faster than current cloud solutions. A relevant case study of how these technologies can alter system architectures that previously relied on local processing and light networking is multiplayer video games.

Video games need to solve the problem of keeping sometimes thousands of players synchronized in dynamic worlds. Players interact and add their own data to the shared world and expect to always have the latest state on their device. Up until recently, video game experiences were powered by local processing power. When users would play a game on their PC/game console their device would run all the calculations needed to render the game on the screen. Only small state updates are sent to and from the authoritative game server, which decides what actually happened when there is conflicting data, to the devices (the communication is more complex than can be summed up well in 2 sentences, so check out this write up if interested). This approach is similar to current mixed reality architectures; the local device does most of the processing and has light-weight networking.

Edge computing has made it possible for games to no longer depend on local processing power. Stadia is a gaming solution that trades demanding device processing requirements for demanding device networking requirements. Instead of the Stadia edge server streaming only state updates (small data) to the client for further processing, it sends the entire rendering of the game (big data). With this approach the client device does not need to process a state update then render the game (big compute), it only needs to display the video sent back from the edge server (small compute). We think the same approach can be applied to mixed reality architectures (and be something that actually provides clear improvements to current infrastructure). Rather than processing sensor data locally with light networking, we can process sensor data remotely with heavy networking.

In this type of architecture mixed reality devices can stream their sensor data to an authoritative edge server for fusion. The server would do the computationally intensive hard “thinking” of creating a real-time digitized reality. The benefits of this type of architecture can be seen in the (only?) “killer app” making use of digital reality: self-driving. Tesla’s cars process input from eight cameras, 12 ultrasonic sensors, a front-facing radar, GPS and mapping data to make a digital representation of their surroundings. While people likely don’t need this infrastructure and AR glasses to tell them there is a pole in front of them like a car does, this infrastructure could power a life-saving alert that an out-of-sight car ran a red and is headed toward you. This would provide users with superior situational awareness because they would have the sight and understanding of any sensor connected to the infrastructure.

Tesla’s sensors are hardwired to their processing hardware which allows for huge bandwidth data delivery with low latency, but XR wearables will not have the luxury of being hardwired to an edge server and will need to use networking like WiFi or 5G, which is why…

we have developed and open sourced the Sensor Stream Pipe (SSP) to compress and stream sensor data to a remote server at low-latency and without throttling bandwidth. SSP is designed to be a modular kit-of-parts.

  • The on-device SSP Server takes in raw frame data and compresses it to reduce bandwidth
  • The SSP Client can receive multiple streams and decompress the data at low latency for further processing

We hope to evolve and extend SSP with the goal of it becoming the ffmpeg/libav of general sensor data (interfaces for all platforms/sensor types and supporting all compression/encodings). Streaming data is not “free” and compression/networking increases latency/battery consumption so it is not a silver bullet for the future. Users can adjust whether to compress data for lower bandwidth or to send raw SSP frames for reduced latency depending on their application.

A future with 5G devices and readily available edge computing is still a ways away, but current experiences can be improved by utilizing edge processing architectures today. We think there are 4 core problems that need to be solved in mixed reality applications

  1. tracking
  2. relative position
  3. environment digitization
  4. object digitization

and streaming sensor data to an authoritative server provides solutions with unique benefits to each of these problems. We hope that anyone who feels that an authoritative edge server architecture for mixed reality applications is the way to go will use SSP.

So why write this whole thing for a Github repo? We hope that by laying our cards out on the table on how we are approaching reality digitization system design, and the assumptions we are making, we can get some feedback on where we are probably wrong and maybe get people to row in the same direction for places where we are probably right.


Our Roadmap

1. (2019) Sensor Stream Pipe (SSP)

We have released the Sensor Stream Pipe. It has interfaces for Azure Kinect as well as interfaces for seminal spatial computing datasets (more interfaces coming).

Hopefully our explanation above of how we see things gave enough explanation of its purpose and what it does, but if not the README is pretty solid.

2. (2019) SSP Microsoft Body Tracking SDK interface

B-b-b-b-b-bonus release. We have created a Sensor Stream Client interface that works with algorithms for the Azure Kinect. This way when you stream your Azure Kinect data using Sensor Stream Pipe to hardware for further processing you can actually use the Microsoft Body Tracking SDK.

Not really interesting to stream data if you can’t do anything with it which brings us to…

3. (2020) Digitizing Reality with Azure Kinect Unity Sample Project

The Unity Sample project will be a multiplayer game where users will be able to join and navigate similar to a game like Minecraft.

The nifty part is that you will be able to connect an Azure Kinect through SSP to the game and the results of the Microsoft Body Tracking algorithm will be input into a game object that can be seen by all players.

Remote players will be able to interact with a live humanoid game object captured by the Azure Kinect.

This will be only an example of solving the problem of object digitization, but it’s a start…

4. (2020) Multi-sensor Support Unity Sample Project

This will be an extension of the previous sample project but updated to handle multiple sensor streams.

An Azure Kinect and another AR device (thinking RGBD iPad or HoloLens 2) will be able to join the game and be set in the correct relative position to each other.

This would allow the user with the AR device to always see the digitized people tracked by the Kinect, even if they are obstructed by a wall!

5. (2020) Relative Position as a Service

Stream sensor data with the SSP to our servers and we will send back the relative position of the streams. Now you can have superior multi-user experiences.

Read our post on the benefits of edge processing architecture.

6. (2020) Tracking as a Service

Stream sensor data with the SSP to our servers and we will send optimized tracking data. Now you can have more accurate mixed reality experiences in new environments.

Read our post on the benefits of edge processing architecture.

7. (eventually 🙄) Digitizing Reality as a Service

Stream us all the sensor data you got, and we will provide you with a digital representation of everything it covers.


Please reach out to let us know what you think of our approach or if you want to help out.