Streaming Video With FFmpeg and DirectX 11

Published in

The Startup

8 min readSep 13, 2020

A few months ago at work, I was tasked with developing a custom, low-latency video player. Prior to this, I had worked only briefly with FFmpeg and not at all with DirectX 11. But I figured it shouldn’t be that hard. FFmpeg is pretty popular, DirectX 11 has been around for a while now, and it’s not like I needed to create intense 3D graphics or anything (yet).

Surely there would be tons of examples on how to do something basic like decode and render video, right?

Nope. Hence this article.

So that the next poor soul who needs to do this without experience in FFmpeg or DirectX 11 won’t have to bash their head into a wall just to spit out some video onto a screen.

Okay. Just a few more housekeeping things before we get to the juicy stuff.

The code samples provided are very simplified. I’ve left out return code checking, error handling, and, well, a bunch of stuff. My point is that the code samples are just that: samples. (I would have provided more fleshed-out examples, but you know. Intellectual property and all that.)
I won’t cover the principles of hardware-accelerated video decoding/rendering because it’s a little outside of the scope of this article. Besides, there are plenty of other resources that explain it far better than I could.
FFmpeg supports pretty much all protocols and encoding formats. Both RTSP and UDP worked with these samples, as well as video encoded in H264 and H265. I’m sure tons of others will work, too.
The project I created was CMake-based and doesn’t rely on Visual Studio’s build system (since we need to support non-DX renderers as well). It made things a tad more difficult, which is why I thought I’d mention it.

Without further ado, let’s get started!

Step #1: Set up the stream source and video decoder.

This is pretty much exclusively FFmpeg stuff. Just a matter of setting up the format context, codec context, and all the other structs that FFmpeg needs you to. For the setup, I relied pretty heavily on this example and the source code from another project called Moonlight.

Note that you have to provide the hardware device type in some way to the AVCodecContext. I opted to do this the same way the FFmpeg example does: a basic string.

// initialize streamconst std::string hw_device_name = "d3d11va";
AVHWDeviceType device_type = av_hwdevice_find_type_by_name(hw_device_name.c_str());// set up codec contextAVBufferRef* hw_device_ctx;
av_hwdevice_ctx_create(&hw_device_ctx, device_type, nullptr, nullptr, 0);
codec_ctx->hw_device_ctx = av_buffer_ref(hw_device_ctx);// open stream

Once the setup is done, the actual decoding is pretty straightforward; It’s just a matter of retrieving AVPackets from the stream source, and decoding them into AVFrames with the codec.

AVPacket* packet = av_packet_alloc();
av_read_frame(format_ctx, packet);
avcodec_send_packet(codec_ctx, packet);AVFrame* frame = av_frame_alloc();
avcodec_receive_frame(codec_ctx, frame);

These are simplifications, but still, it didn’t take much to cobble something together. While I couldn’t render anything to the screen yet, I wanted to verify that I was producing valid decoded frames, so I thought I’d just write them to a bitmap file and check them that way.

There was one slight problem.

Step #2: Converting NV12 to RGBA.

To create a bitmap (and, as it turns out, to render to a DX11 swapchain), I needed the frames to be in RGBA format. The decoder, however, was spitting out frames in NV12 format, so I used FFmpeg’s swscale to convert from AV_PIX_FMT_NV12 to AV_PIX_FMT_RGBA.

Setting up the SwsContext can be as easy as a single function call.

SwsContext* conversion_ctx = sws_getContext(
        SRC_WIDTH, SRC_HEIGHT, AV_PIX_FMT_NV12,
        DST_WIDTH, DST_HEIGHT, AV_PIX_FMT_RGBA,
        SWS_BICUBLIN | SWS_BITEXACT, nullptr, nullptr, nullptr);

Of course, in order to use sws_scale(), we need to transfer the frame from the GPU to the CPU. I did this with FFmpeg’s built-in av_hwframe_transfer_data(). There are loads of examples of this.

// decode frameAVFrame* sw_frame = av_frame_alloc();
av_hwframe_transfer_data(sw_frame, frame, 0);sws_scale(conversion_ctx, sw_frame->data, sw_frame->linesize, 
          0, sw_frame->height, dst_data, dst_linesize);sw_frame->data = dst_data
sw_frame->linesize = dst_linesize
sw_frame->pix_fmt = AV_PIX_FMT_RGBA
sw_frame->width = DST_WIDTH
sw_frame->height = DST_HEIGHT

This worked fine for the time being, but there were two main issues with this as a long-term solution.

At this point, what I wanted from the AVFrame was a straightforward, no-nonsense byte array; using “d3d11va” as the hardware device name gives us something other than a simple byte array. So instead, I changed the hardware device name to “dxva2”. That way, the frame->data is just a bitmap in uint8_t* form. It works for now, but as a long-term solution, not using “d3d11va” basically misses the point.
In order to call sws_scale() and convert the frame to RGBA, we need to move the frame from the GPU to the CPU. Again, fine for now, but definitely something we want to remove in the future.

So not perfect by any means, but at least we now have decoded frames that we can throw onto a bitmap and see with our own eyes.

That’s it for the FFmpeg portion (for now). On to rendering in DirectX 11.

Step #3: Setting up DirectX 11 rendering.

In case you don’t already know, here’s your warning: DX11 is nothing like DX9. Nothing. At. All.

After many failed attempts to display anything other than a green or black screen, I copied and pasted this example just so I could start out with working code. After that, the disproportionately complicated task of turning the triangle into a square. (I went for the four-vertices-six-indices option.)

Additionally, rather than compile the shaders at runtime, I opted to compile them during, well, compile time. For a second, I thought I’d have to include a third party library to do this, but all it required was a couple of lines in the CMakeLists.txt file. Find the fxc.exe executable, and execute the command with the appropriate options to compile your shaders. (I used /Fh to compile them into autogenerated headers.)

Step #4: Swapping color for texture.

Once I got a rainbow square working, it was just a matter of switching COLOR for TEXCOORD in the defined input layout. Obviously, this meant changing a few things:

The vertex struct now has an XMFLOAT2 (x, y) for the texture coordinate instead of XMFLOAT4 (r, g, b, a) for color.
The pixel shader needs to sample the color from the texture rather than just using the provided color. This means needing a sampler.
Also, keep in mind that texture coordinates and position coordinates are different. I didn’t know this initially, and it caused me a ton of needless grief.

Once I was able to render a basic, static JPEG image, I knew I was getting close. All that remained was transferring the actual bitmap from the frame to the shared texture.

Step #5: Rendering actual frames.

Since our frames are still straightforward byte arrays in RGBA format, and our ID3D11Texture2D was in DXGI_FORMAT_R8G8B8A8_UNORM format, a simple memcpy did the trick. The array length we need to copy is just a calculation of bytes in our frame: width_in_pixels * height_in_pixels * bytes_per_pixel.

Note that we also need to call the device context’s Map() to get a pointer that allows us to access the texture’s underlying data.

// decode and convert framestatic constexpr int BYTES_IN_RGBA_PIXEL = 4;D3D11_MAPPED_SUBRESOURCE ms;
device_context->Map(m_texture.Get(), 0, D3D11_MAP_WRITE_DISCARD, 0, &ms);memcpy(ms.pData, frame->data[0], frame->width * frame->height * BYTES_IN_RGBA_PIXEL);device_context->Unmap(m_texture.Get(), 0);// clear the render target view, draw the indices, present the swapchain

Getting to this point and seeing live video on the screen was practically euphoric. Seriously, I raised my hands in the air and praised the coding gods for having guided me thus far.

But alas. My work was far from over. Now, it was time to go back and fix the two issues I caused back in Step #2.

Step #6: Rendering actual frames… but, like, properly this time.

I knew from the beginning of my research that providing FFmpeg with the hardware device name “d3d11va” should output the AVFrame in such a way that the DirectX 11 renderer can easily digest it. But how could I make this happen?

We need properly initialize the d3d11va hardware device context. Basically, the FFmpeg decoder needs to know about the D3D11 device it’s working with.

AVBufferRef* hw_device_ctx = av_hwdevice_ctx_alloc(AV_HWDEVICE_TYPE_D3D11VA);AVHWDeviceContext* device_ctx = reinterpret_cast<AVHWDeviceContext*>(hw_device_ctx->data);AVD3D11VADeviceContext* d3d11va_device_ctx = reinterpret_cast<AVD3D11VADeviceContext*>(device_ctx->hwctx);// m_device is our ComPtr<ID3D11Device>
d3d11va_device_ctx->device = m_device.Get();// codec_ctx is a pointer to our FFmpeg AVCodecContext
codec_ctx->hw_device_ctx = av_buffer_ref(hw_device_ctx);av_hwdevice_ctx_init(codec_ctx->hw_device_ctx);

It looks like a lot of setup, but ultimately, all we’re doing here is stashing a pointer to our renderer’s ID3D11Device in the decoder’s AVCodecContext. This is what allows the decoder to output frames as DX11 textures.

So now, when we send our decoded frames to the renderer, don’t need to transfer them to the CPU, and we don’t need to convert them to RGBA. We can simply do this:

ComPtr<ID3D11Texture2D> texture = (ID3D11Texture2D*)frame->data[0];

But are we done? Nope. Not even close.

We need to move the pixel format conversion to the GPU. Our swap chain didn’t start magically being able to render NV12 frames, which means the conversion from NV12 to RGBA still has to happen somewhere. Now, instead of happening in the CPU, it’ll happen in the GPU. Specifically, in the pixel shader.

This makes logical sense; we can’t just sample a location in our texture anymore because our texture is no longer in RGBA. For our pixel shader to return the right RGBA value for every pixel, it’ll need to calculate it from the texture’s YUV values.

What that means is that we need to upgrade our pixel shader to take in NV12 and output RGBA. You could derive such a shader yourself, or just use one that’s already been written.

Add another shader resource view. While the RGBA pixel shader takes a single shader resource view as input, the NV12 pixel shader actually needs two: chrominance and luminance. So we’ll need to split our one texture into two shader resource views. (Before this moment, I didn’t understand why DirectX needed to distinguish between textures and shader resource views. Boy, am I glad they did.)

// DXGI_FORMAT_R8_UNORM for NV12 luminance channelD3D11_SHADER_RESOURCE_VIEW_DESC luminance_desc = CD3D11_SHADER_RESOURCE_VIEW_DESC(m_texture, D3D11_SRV_DIMENSION_TEXTURE2D, DXGI_FORMAT_R8_UNORM);m_device->CreateShaderResourceView(m_texture, &luminance_desc,  &m_luminance_shader_resource_view); // DXGI_FORMAT_R8G8_UNORM for NV12 chrominance channelD3D11_SHADER_RESOURCE_VIEW_DESC chrominance_desc = CD3D11_SHADER_RESOURCE_VIEW_DESC(texture,  D3D11_SRV_DIMENSION_TEXTURE2D, DXGI_FORMAT_R8G8_UNORM);m_device->CreateShaderResourceView(m_texture, &chrominance_desc, &m_chrominance_shader_resource_view);

Of course, we also need to make sure to allow our pixel shader to access these chrominance and luminance channels.

m_device_context->PSSetShaderResources(0, 1, m_luminance_shader_resource_view.GetAddressOf());m_device_context->PSSetShaderResources(1, 1, m_chrominance_shader_resource_view.GetAddressOf());

We need to open our texture as a shared resource. The ID3D11Texture2D object we keep in the renderer is the true bridge between the FFmeg frame and the shader resource views. We copy the new frames into it and extract the shader resource views out of it. It’s a shared resource, and we need to treat it as such.

ComPtr<IDXGIResource> dxgi_resource;m_texture->QueryInterface(__uuidof(IDXGIResource), reinterpret_cast<void**>(dxgi_resource.GetAddressOf()));dxgi_resource->GetSharedHandle(&m_shared_handle);m_device->OpenSharedResource(m_shared_handle, __uuidof(ID3D11Texture2D), reinterpret_cast<void**>(m_texture.GetAddressOf()));

We need to change how we copy the received texture. It’s obviously costly to create new shader resource views every time a frame is rendered, and memcpy isn’t an option anymore since we can’t access our texture’s underlying data easily. I figured the right way to copy the received frame to the texture was to use built-in DirectX functions, like CopySubresourceRegion().

ComPtr<ID3D11Texture2D> new_texture = (ID3D11Texture2D*)frame->data[0];
const int texture_index = frame->data[1];m_device_context->CopySubresourceRegion(
        m_texture.Get(), 0, 0, 0, 0, 
        new_texture.Get(), texture_index, nullptr);

After these changes, I could safely kiss those av_hwframe_transfer_data() and sws_scale() functions goodbye, and at long, long last, say hello to a fully integrated FFmpeg-DirectX11 video player.

Fin.