iOS Realtime Video Streaming App Tutorial: Part 2

Jade
7 min readOct 3, 2022

--

Overview

In the part 1, we captured raw picture data and converted it video data then sent it to the server over the network in realtime. This time, on the server side, we will receive video data from the client and play it on the screen.

On the server side, we want to have some capabilities including ways to accepct connection and receive data from the client, parse received data to get NALU data, and encode it to raw picture data so that we can play it on the screen. Let’s do this step-by-step.

Listening

First of all, we need to listen to connection reqeuset from the client. Likewise on the client side, we will use Network framework for listening.

Our TCPServer class is just a wrapper of NWListener, which provides capability for listening.

In start(port:) method, we start listening on the given port and if new connection request is arrived we start establishing a connection and once the connection is established, we receive data from that connection.

Note that recieveData(on:) method recursively receive data amount of minmum 1 byte to maximum 65000 bytes(approximately 2¹⁶) which i chose just randomly.

If we receive data from the client, delegate method ‘recievedDataHandling’ will be called so all we have to do is just set closure to property ‘recievedDataHandling’ with what we want.

Parsing

Before we go further, we need to talk about NAL Unit’s types. In the previous post we looked several type’s NALU including SPS, PPS, i-frame, p-frame. The first byte’s 5 low bits of NALU represents its type. For instance, let’s say that first byte of NALU is 01000111, and the 5 low bits is 00111 which equals to 7 and the 7 indicates SPS.

And I declared a struct named H264Unit, which is just a wrapper of Data that represents pure NALU data in case of SPS or PPS and 4 bytes length + NALU data in case of i frame or p frame or others. And you now probably will understand how i calculated type number in the initializer of H264Unit.

The reason I made different data format(one is only pure NALU data and another is 4 bytes length data plus NALU data) for SPS and PPS and others is that when we create a CMFormatDescription from our SPS and PPS, they don’t need any other information than its pure data. But when we create a CMSampleBuffer, they require a (4 bytes length + NALU data) format.

With all this in mind, in this process, all we need to do is just extract NALU from data stream which is a seqeunce of NALU with start code and convert it to H264Unit we just declared.

Data stream from the client might look like a following image.

All we have to do here is iterate data stream and search start code, once you find start code then remove it and wrap NALU with H264Unit.

Picking NALU out from data stream

NALUParser is responsible for seperating NALUs from data stream. If you receive data from the client, all you need to do is just enqueue that data to NALUParser then it will pick out a NALU as a H264Units from the stream.

Note that we used primitive Array type for the dataStream, but it’s terribly inefficient because Array’s ‘removeRange(:)’ method takes a lot of cost. So I highly recommend refactor that logic with more efficient way such as a circular buffer by yourself and compare CPU usage.

Converting to CMSampleBuffer

So now we are able to get a H264Unit from NALUParser’s callback closure. H264Unit can be a SPS or a PPS or other NALU. The next step is convert H264Unit to encoded CMSampleBuffer so that we can decode that encoded CMSampleBuffer to the decoded one and display it. An H264Converter does this.

Remember that CMSampleBuffer is composed of a CMFormatDescription and a CMBlockBuffer? That means we need a CMFormatDescription and a CMBlockBuffer to create a CMSampleBuffer. And a CMFormatDescription consists of SPS and PPS. A CMBlockBuffer wraps media data like a i-frame or a p-frame.

We need to first create a CMFormatDescription which is refered by a sequence of CMSampleBuffers.

A CMVideoFormatDescription is reused for a sequence of CMBlockBuffers, we keep it in our ‘description’ property. ‘CMVideoFormatDescriptionCreateFromH264ParameterSets’ is a C function that takes some parameters as pointers, so we needed to allocate some memory area to contain SPS and PPS. Note that we deallocated that pointers to prevent memory leaks becasue Swift doesn’t care about Unsafe or Unmanaged prefixed types.

Next, let’s create a CMBlockBuffer.

We allocated memory like just previous one. In contrast to previous one, we didn’t deallocate because ‘CMBlockBufferCreateWithMemoryBlock’ function takes a dellocator which means you don’t need to care about deallocation.

Now we have a CMFormatDescription and a CMBlockBuffer so it’s time to create a CMSampleBuffer.

‘CMSampleBufferCreateReady’ function creates CMSampleBuffer and get it ready to use. It can take more than one samples but in our case, we seperated data stream to get a NALU by one by one, it contains only one sample.

And we set value for attatchment dictionary key ‘kCMSampleAttachmentKey_DisplayImmediately’ with ‘kCFBooleanTrue’ which indicates that the CMSampleBuffer literally need to be displayed immediately because we are streaming in realtime.

Now we can have simple method to create CMSampleBuffer with H264Unit.

Putting together

Now we have everything we need to receive data from the client, parse it and create a CMSampleBuffer, it’s time to put everything together into one facade to manage all the processes. And i call it ‘VideoServer’

Display

But wait, did we decode the CMSampleBuffer to display it? No, because with AVSampleBufferDisplayLayer, we just need to provide either encoded CMSampleBuffer or decoded one. Which means that if we provide encoded one, AVSampleBufferDisplayLayer will decode it on behalf of us.

All we need to do is just enqueue a CMSampleBuffer to the AVSampleBufferDisplayLayer.

But it’s also possible to decode a CMSampleBuffer manually. You can use ‘VTDecompressionSession’ for that, which is counterpart of VTCompressionSession we used for encoding a CMSampleBuffer.

Testing

It’s finally time to test what we build. But before we go in, there’s a several things you should mind.

  • The server should not be connected via a broadband router or other NAT(Network Address Translation) system(including cellular network)
  • If the server is connected via a broadband router or other NAT systems, you have to use port forwarding which is supported in most broadband routers or the client must be connected on the same network with the server.

In my case, I have only one real device so I used the simulator as the server and the server and the client are connected on the same network via a broadband router.

If you’re using a simulator, you can check your IP address easily on the terminal using a command ‘ifconfig’

IP address
The client side(left) and the server side(right)

Wrap up

Now, I hope you grasp the overview of video streaming.

I left some points that you can improve by youself, like

  • performance : we did some inefficient things for the simplicity.
  • feature : we only deliver the video, but it would be nicer if we can deliver audio together.
  • transporting : we used TCP for the transport layer, but we can use UDP
  • direction : we now only deliver video from the client to the server. We can do it bidirectionally like video conference apps.

So I wish you tinker with it and let me know anything that you did. I would be very happy if you improve it by yourself.

And if you want to learn more about video streaming, I recommend you check out technologies like WebRTC, RTMP, HLS.

You can find full code explained in this post in the following repository :
https://github.com/pinkjuice66/VideoStreaming

Ref.

--

--