by Fabio Toste, Senior Creative Developer at Jam3.
Fabio has been working as a dev/software engineer for 18 years, working on projects from WebGL, to AR/VR, installations, apps, interactive experiences and games.
Jam3 Labs
Part of our bread & butter at Jam3 is to generate innovative ideas, separately from client work, that can sometimes be technically challenging and require intensive research. Jam3 Labs is where we take all those great ideas and make them happen in quick R&D sprints.
In this article we’re going to talk about AR, so let’s start by defining what it is.
What is AR?
AR (augmented reality) is a technology that merges 3D objects in the real world. It can go beyond it and enhance any real experience with multi-sensory stimuli. For this article, we are going to focus on the visual part of it. Today you can find in any app store a huge number of AR apps, from utility tools, passing through simple experiences to games.
At Jam3, we’ve become experts in this area. From recent games and experiences like a Baby Laser Tag for Amazon Prime Video’s The Boys, and cinematic moments from The Mandalorian, to dozens of AR masks.
What is Pixel or Render Streaming?
Pixel Streaming takes advantage of new streaming technologies such as WebRTC to stream content from a computer to a TV, mobile phone, or even other computers enabling high-quality content to be delivered from a powerful computer to less powerful devices.
Although we’ve had devices doing it for a while now (Stream link from Valve is a good example) the new frontier is streaming live through the internet as speeds start to increase. I did some tests with this technology back in 2019 when it was Available for Beta in the Unreal Engine v4.21, so it’s been around for a little while.
Cool! What is the Problem Though?
When I started playing with pixel streaming in Unreal I saw some potential.
We could create some cool experiences on the web and even overcome limitations in terms of graphics, but even with advances in WebGL graphics, it’s not close to the visuals you can get with a native graphics application. There were also limitations in this tech: you basically needed one computer per experience which increases the cost of productions and distribution, which could be interesting for installations or shared applications, but for real web replacement it is still very far from the reality.
After 2019 some uses of this tech started popping up, one good example is the Google Stadia and GeForce Now, both game streaming platforms to get high-quality content remotely and play on any device you want.
After that, another pixel of promising tech came, 5G, with higher speeds in mobile and download times, and the promise of reliable streaming tech with mobile connection. Back then I realized that combining pixel streaming and AR could bring some advantages.
Why use streaming to deliver AR?
For WebAR! Today AR through the web is very, very limited. You can use some API/Frameworks like 8thwall, which requires finding the right balance between quality optimization, performance, and AR tracking. It works okay and you can make some cool stuff, but not even close to the same level you can do in Native AR either for IOS or Android.
Even native applications have some limitations, either in graphics and features. So, enabling AR to be rendered in a server and then sent to the client application will enhance the graphic quality, reduce battery consumption, and could add new life and possibilities to WebAR.
With all of that in mind, I started a 2-week sprint at Jam3 Labs and dedicated some time to do some experimental tests trying to get a server to render AR.
Initial tests
As soon as I got the idea of using streaming to render high-fidelity graphics for AR, I started playing with it. The first test was making pixel streaming work, which was very easy using the Unreal Pixel streaming plugin. The plugin delivers a signaling server in the pack, basically installing everything, making a small config and everything worked as it should.
Unreal uses WebRTC to stream the content from the engine to a web server. It uses only websockets to communicate with the signaling server, different from the Unity version that supports both http and sockets.
What is WebRTC?
WebRTC or Web Real Time Communication as the name says is a web API that brings a very fast and reliable interface between computers to deliver data, audio and video in real time. It’s mostly used for video/audio conferences and stream content through the web.
How does WebRTC work?
WebRTC is a peer-to-peer Interface, meaning that the communication between the computers is done directly from one to another, but it also uses a server to signalize the interaction. So basically you just need the server to do the first handshake and also in case the negotiation changes in the middle of the communication, all the data exchange is done directly from one device to another.
It can also use stun and relay servers in a more sophisticated way if it is needed, but as WebRTC is not the main focus here, let’s keep it simple as possible.
Below, two examples of a WebRTC communication:
The most important thing about WebRTC is that it’s capable of exchanging data, video, and audio content in a very fast and reliable way. As it’s a direct communication, the latency is low or low enough to enable platforms like Stadia to stream 4K games running at 60pfs.
It’s also true they do a great job in optimizing the render, encoding and decoding process as you can see in this video, but it proves it’s also possible to deliver content in a very fast way.
Now that we understand WebRTC better and the first pixel-streaming test done, it’s time to make it AR.
To make it happen I started using 8thwall as the AR web API to send commands (camera position and rotation) to the server application running Unreal. I could do it using the communication data channel in the WebRTC. Using the video streaming I could receive the rendered 3D in the client application with the camera controlled by the client.
Here’s how it works:
Receiving the Camera Transforms in Unreal:
First Problem, Transparency (Alpha Channel)
With the communication done and the data going to Unreal to move the camera, it was time to combine the background camera in the client application and the render from Unreal. To make it possible we needed a video with an alpha channel, so we can blend both with transparency.
Everyone that works with 3D rendering knows that transparency is always a problem, blending the transparent and opaque geometry can be a real pain, setting the order of rendering, the material render order, setting the order of the buffers and so on. It can be even harder if you are using forward or deferred renders paths, so transparency itself in 3D is not that easy to handle, but when it comes to streaming it can add new layers of problems.
The first problem comes with the WebRTC as it doesn’t support alpha channel in video streaming yet, so even if you can render the scene with a proper alpha channel it will not be transmitted to the client.
To try to overcome these limitations and create a proof of concept, we came up with an old-fashioned workaround: chroma key. We set the background to green color, sending the streaming with pure green. Due to the compression of the video, the “pure” will be a kind of lost but should be ok for testing.
On the client side, it brought another problem, the Unreal WebRTC example uses a video HTML element to render the streamed scene, so if you render it directly it will overlap the camera feed with a scene and the green background.
To fix that we also built a WebGL video renderer with a special shader to remove the green background and render the final stream with a transparent background revealing the camera feed.
Looking at the video, you can see the 3D renderer is visible, and in the background, we have the 8thwall app working with a simple cube to check as a reference.
The first thing we can notice is the delay (latency) in between the camera feed and the final video render, which can come from the fact I’m using ngrok and may add some latency to the communication.
With the first findings, we can say it’s totally possible to make it happen, with the proper encoding and decoding for the video keeping the alpha and working on the buffer creation and the render engine to render all with the proper alpha information, graphically it could happen.
Working with Latency
After initial tests, it’s time to make it work smoother. Another option will be taking a different approach: what about sending the camera feed with the camera transforms? This can solve the transparency problem, as we can add the camera video as a background for the 3D or even better add it in a buffer before the 3D scene is rendered.
To do this we need first to configure the client (web application) to get the camera from the user and send it to the WebRTC. This is not hard, but adds more complexity to the system.
To get the user camera on the web we can use the mediaDevices.getUserMedia method in javascript and as soon as you receive the stream send it to the WebRTC. To do it though we first need to configure a secure connection, so instead of using http we must use https (a requirement now to get camera access).
It’s the “easy” part because we also need to make sure the WebRTC on the Unreal side is capable of receiving the stream, decode it and send to some kind of render texture or texture object. Looking into the source code it seems to not beimplemented yet and Unreal can send videos through the stream but not receive and implement it front he source code — which would add too much time/work as I would need to recompile the engine and the pixel streaming plugin.
Unreal has a Media Framework that could help me achieve what I need using the Stream Media Source class, but it seems it doesn’t support WebRTC yet. The only success I had using that was getting the video from an URL, maybe one option will be use a media server to get the stream from the WebRTC and send it to a video URl stream, but it will also add more complexity/latency to the system and a kind of break the peer-to-peer negotiation.
With all those solutions not working we jumped to another platform: Unity. Now Unity has its own pixel streaming package and it’s called render streaming. It works exactly like the Unreal version, it’s just a little bit easier to change as all the scripts are normal C# code.
So I started exploring Unity render streaming as I did with the Unreal version. I could easily make it work as it also has a webserver done already, so receiving and sending information was really easy.
The fist step was getting the WebAPP that comes with Unity render streaming and change the video-player.js script to send video instead of receive and on the Unity side change the RenderStreaming.cs to receive video instead of send. During the process I discovered some stuff about the unity package, as the SendOffer methods in the WebSocketSignaling.cs and HttpSignaling.cs have not been implemented yet. It’s okay for the test we are doing, because we could use the offer sent by the webclient, but we must know that the render streaming is not expecting yet to receive video/audio from the other side.
Luckily it at least has in the version 2.2.2 the method to receive the video stream and convert it to a render texture.
With all tools in place we could modify and make the camera from the web client to be rendered in Unity as a background before the 3D scene was rendered.
Below the flow that shows how it works:
With this flow all tracking and AR is managed by the web client application using 8thwall as the API, so we send the video stream to the Unity application that combines the background and the 3D scene using a command buffer (Legacy Pipeline) and send it back to the client with the 3D and video background combined. We are using the data channel to send the camera position and rotation from 8thwall.
You can see in the example above that we have some latency, mostly for the 3D scene sync (position and rotation), it’s happening because the camera data has been sent out of sync with the background camera render. In the second video we have the web app on the left and the unity view on the right.
First we are sending the frame using the stream channel and the position/rotation using the data channel (it may add some sync problems).The tracking has been processed over the camera that is client side, before being sent to Unity (this also will add some latency).
Looking into the feeling side of the tech I can tell you that camera latency itself is not a big deal, playing a while with the camera feed coming from Unity I must say the latency is not something that you can notice that much, your brain can’t process the difference between the video on screen and the real environment outside of it.
As a final proof of concept it worked well and it’s something that could be ready for production as it is right now. One big thing that I noticed is that after playing a lot of the 8thwall, I can say the thermal throttle starts to appear really fast and even fixing the camera sync, so maybe trying to sync the video frame and camera transforms, using the tracking in the client side can be also a problem.
Future Work
For future explorations, a good idea will be to work more on the 3D/Real Camera sync and make the latency in between the 3D scene and the video background reach almost zero.
Doing the tracking in the server side as well could help with that, so the next steps will be to use OpenCV to do a SLAM (Simultaneous Localization and Mapping) on the unity part of the process, with that we can also sync the frames with the 3D camera movement and just send it to the client when it’s ready.
Apart from technical future work we can also use WebRTC to help with other problems related with AR like emulating the phone behavior in the engine editor. Today debugging and testing AR apps is really painful as you need to build every time to make sure it’s working properly. Of course, the builds still need to fully test the app, but using data, video and audio to emulate it could help a lot in the development process.
The explorations can help improve the WebAR experience and bring more versatility and really high-quality to webAR experiences and it can be huge in terms of possibilities for the future. Using a desktop computer that has a lot of compute power and can be scaled easily so we can start adding a lot of features in AR without compromising performance as everything will be calculated in a much more powerful computer.
Conclusion
After all tests and prototypes we came to a conclusion that cloud AR is possible but needs a little bit of work. There is some minimal technical aspects that are out of the scope of this article, but the encoding/decoding process and best usage of the bandwidth can help improve the communication speed and reliability.
The latency at the end can be a problem, but if we can reduce it to a minimum as the human eye can not notice it as a problem, we can make it happen.
The scalability still will be a problem, so at the end the trade off will be between having better quality or easier scalability. As devices continue to evolve, we can see improvements in the graphic quality in AR, but I feel it will be always a step behind the graphics in full desktop and when it comes to webAR it will be even more behind. At some point, with the scalability solved, the render in the cloud will the real thing.
Being able to process everything in a more powerful computer and send it to a Mobile device will help the content creators deliver best experiences and have more control.
The better battery consumption and better graphics quality it’s a big deal for me, but being able to update the application, replacing assets or changing them without passing all App store processes can also be a big thing to consider.
Some platforms such as Microsoft Hololens and other VR have already taken advantage of the cloud rendering and I believe we’re going to see it soon in other platforms. I personally feel we’re gonna have development on both sides in parallel. The native AR and cloud based render will live together and open great opportunities for AR in the future.