UPDATE: Unity 2018.1 introduced a new asynchronous GPU readback API which should make this process significantly easier.
C apturing video or screenshots in-engine is a nice sharing feature for any game or graphical application. This can be useful for bug reports, social sharing, or just for tracking the progress of your development. In Unity it’s easy to capture frames of video directly from your game. Really easy. However, with easy APIs comes great responsibility. For those developing in VR and wanting to deliver a great user experience, maintaining great performance is key.
In this article, I’ll explain how to capture video using only C# and stock Unity APIs, while still maintaining a high refresh rate for a comfortable VR experience.
Our goal was to capture video suitable for sharing in Tilt Brush, in real-time, and maintain the 90Hz refresh rate required for VR.
Our first attempt was to use a third party plugin, which worked until we hit video encoding issues that couldn’t be resolved. The next logical option would be to write our own native plugin, however that comes with non-trivial technical weight, so instead we attempted to workaround the issues in the C# API, before dropping down to a native plugin. Happily, it worked.
To capture a framebuffer in Unity you will need two things: a RenderTexture and a Texture2D. After that, copying pixels is easy.
The naive approach:
// Setup a camera, texture and render texture
Camera cam = ...;
Texture2D tex = ...;
RenderTexture rt = ...;// Render to RenderTexture
cam.targetTexture = rt;
cam.Render();// Read pixels to texture
RenderTexture.active = rt;
tex.ReadPixels(rectReadPicture, 0, 0);// Read texture to array
Color framebuffer = tex.GetPixels();
And done! Do this on every frame and performance is also done! In fact, if you’re building a VR experience, you can’t do this even once.
Here are the underlying reasons this is slow:
- GetPixels() blocks for ReadPixels() to complete
- ReadPixels() blocks while flushing the GPU
- GetPixels() allocates a new array on every call, thrashing the garbage collector
The first issue is actually easy to avoid. We’ll put a one frame delay between ReadPixels() and GetPixels(), then whatever transfer happens will be complete by the time we need to access the values. So far, so good.
Much trickier is the fact that ReadPixels() will trigger a GPU flush. What does that mean anyway?
When commands/draw calls are issued to the GPU, those commands are batched into bulk command buffers in the driver. “Flushing the GPU,” means waiting for all remaining commands in the current command buffer to execute. The CPU and GPU can run in parallel, but during a flush, the CPU sits idle waiting for the GPU to become idle as well, which is why this is also known as a “synchronization point.” So why is this happening?
If you trace Unity’s use of DirectX with a profiling tool like NVIDIA’s Nsight, you’ll find that ReadPixels() is implemented by calling CopySubresourceRegion followed immediately by Map and Unmap. Map is effectively reading the result of CopySubresourceRegion.
As documented, the GPU copy can be pipelined and executed in parallel with the CPU. However if the data is requested before the copy is complete, the only way to return a consistent value is to complete all pending commands, thus forcing a CPU-GPU sync.
You can see this happening clearly in the following Nsight performance graph:
It would seem we are out of luck at this point — the Unity API is forcing a sync, which is going to be slow, what can we do? We could write a native plugin and implement this ourselves. This would likely be the fastest path, however there are still options to explore.
Since we know this forces a sync, perhaps there is some time when the sync is less expensive. What if the GPU were already idle? Then the sync time should be limited to the cost of the transfer, which would be considerably less than waiting on a full or partial frame to render.
We know there is a point at which the GPU is idle, because in our case SteamVR forces a sync as well. This requires some intimate knowledge of your render engine, but tools such as Nsight’s frame debuger or RenderDoc can help explore what’s happening under the hood when it’s a black box.
OnPreRender() seems promising, but as you can see in the trace below, this approach improves performance slightly, however the CPU is still blocking on some work to complete before starting the transfer:
The reason is because this camera isn’t the only camera in the scene. So we aren’t necessarily in an idle state during OnPreRender().
Ok, we know SteamVR forces a sync and we have the source, what if we hack their render loop? The render loop is just a coroutine into which we can insert a callback to our own code to copy the pixels.
Unfortunately, I was still seeing a 2ms sync, as shown in the previous screenshot.
At this point I carefully broke down a frame to see exactly what was happening. Sure enough, in the trace there was an early depth pass, shadow passes, etc. How do we ensure no work has been done on the GPU?
The real problem was the additional camera. The video capture camera was rendering outside of the SteamVR render loop, which is bad because the render loop implements the running start algorithm. So the additional camera was both messing up the running start and ensuring we had no idle moment on the GPU.
In the end, we moved both the additional render and the pixel copy into SteamVR’s render loop and in the following screenshot, the sync time has been reduced to the transfer alone:
Here is the final sequence of events:
- Render frame
- Blit to render texture as a post-effect
- End frame
- In the SteamVR render loop, copy render texture to Texture2d
- In the SteamVR render loop, render the secondary camera
- Wait one frame
- In the SteamVR render loop, copy the texture bits out to C#
Notice that we need three frames to implement this technique (limiting capture to 30FPS for a 90Hz display), however if your application is not memory constrained, these steps can be pipelined as well.
Now we’ve reduced a sync that scales with the size of the pending work in the GPU down to a sync that scales with the number of the captured pixels and the speed of the PCIe bus.
At this point, our implementation was running with 0.5ms overhead (5% of frame budget), which was acceptable. There was the additional CPU overhead of copying the pixels back into C#, which in total was about 3ms. This sounds bad (30% of our frame budget), however we have scheduled that cost to run when the CPU may already be idle due to the running start.
Happiness ensues… but wait… we’re still stuttering every 20 frames or so, what gives?! Looking in the Unity profiler, we find some spikes from the garbage collector, about 12ms each:
This leads us back to GetPixels(), which is allocating memory on each call and transferring ownership of that memory to the caller. Since it can’t be reused on the next call to GetPixels(), each frame capture generates heap garbage, which gets reaped about every 20 frames, depending on the framebuffer size.
Ok, so what can we do about this? What if we just preemptively force a garbage collection on every frame? If there’s only a small amount of garbage, then perhaps the collection cost will be small as well… maybe?
It turns out, running for garbage collection has significant overhead (finding roots, etc), which scales with the memory allocated, not just the garbage.
However, this did get it down to 7ms (70% frame budget), but that’s still way too slow.
Ok, here is a crazy idea: if the garbage collector is thread safe, maybe we can run it on a background thread and avoid blocking the main render thread. In fact it is thread safe, however if the render thread allocates any memory at all, it will again block.
In our case, Unity was the only call site allocating memory, so it worked! The overhead of garbage collection was now in the noise.
For the finishing polish, we applied blur and vignette post-effects to match the “Skillman-style” established for previous promotional material. In addition, 2x supersampling is applied for videos and 4x supersampling is applied for stills to produce high quality content for sharing. To counter the cost of supersampling, quality in the HMD is reduced while capturing video.