iOS Realtime Video Streaming App Tutorial: Part 1

9 min readSep 25, 2022

Motivation

Have you ever been interested in how to make a video streaming app? You might have searched it on google, and found a bunch of SDKs that support it. But the problem was that, I was just curious about how it goes like under the hood, not SDKs.

So I searched it deeper and deeper but I couldn’t find any good references going step by step, and I noticed that it’s kind of tricky problem. After going through a lot of researches and tries, I finally get to understand it and be able to make a simple video streaming app.

In this step-by-step tutorial, I want to share everything I’ve learned about video streaming and want you to get better understand about media streaming.

Overview

So, we want to do here at the end is we capture video using our device(client) then send it to the another device(server) plays video in realtime. It’ll look like following image in the big picture.

The client side needs to capture video data then send it to the network, and the server receives it over the network and play it. I abstracted away some processes so one process above might consist of several processes.

Video Data

There’s a tones of conceptual stuff you have to know to understand video streaming technology. First of all, video data. You might know that video footage is a sequence of pictures. If a video has 30 pictures in a second, we say that it has 30 frames rate.

But that doesn’t mean video data consist of a sequence of pictures data. Why? Can you imagine how big it will be to have 30 pictures data in a second? It’ll be tremendous! So we need way more efficient strategy for delivering video data.

Basis idea here. If we have a previous picture data(we call it i-frame or key-frame) and data that represent difference between previous picture data and current picture data(we call it p-frame), we can calculate current picture data so that we can have sequence of picture data.

To recap, We can make a sequence of pictures data using one picture data and data that have difference between pictures. This is efficient in most case because a sequence of pictures data have duplicated data in common unless captured pictures doesn’t change quickly(imagine a man talking about something seating on a chair. The only changes will happen in his face.)

Getting Raw Picture data

Converting a sequence of pictures to the video data is called compression and called decompression in reverse. Before doing compression, we have to be able to receive raw picture data so that we can compress it to video data.

First thing we have to do is set property key for camera usage in our info.plist.

There’s four objects needed to receive raw picture data in realtime. AVCaptureDevice represents our iPhone’s input device, such as camera or mic. AVCaptureOutput can deliver captured data to a file or stream but we want here is raw picture data, we need to set its delegate so the delegate can take raw picture data. You can think AVCaptureSession as a bridge between AVCaptureDevice and AVCaptureOutput that let them have a connection.

We will create a class named VideoCaptureManager which is responsible to configure above processes.

Note that we have a serial queue called ‘sessionQueue’ because we don’t want to block main queue while configuring AVCaptureSession. ‘videoOutputQueue’ will be using on which to handle captured data output. And error and result type will help us to handle configure result.

First thing we need to do is request camera usage authorization to the user.

Next, we need to add AVCaptureDevice and AVCaptureOutput to AVCaptureSession so that AVCaptureDevice and AVCaptureOutput can be connected. It’ll be operated in following methods.

Put the processes that add input and output together in the method named ‘configureSession’ and we need a method to let the AVCaptureSession run.

All right. We’re now ready to set up AVCaptureSession. Add following code in the initializer of VideoCaptureManager to automatically set up AVCaptureSession when it is initialized.

And last, we need interface to set the object which will receive raw video data.

Compression

Now we are able to get raw picture data so next thing we have to do is convert sequence of raw picture data to video data. There’s a lot of compression method, but we will use H.264(or called MPEG-4 Part 10) which is most popular in these days.

First, we will create a class named H264Encoder which is responsible for converting pictures data to video data.

VTCompressionSession is an object that is able to compress pictures data, so technically speaking, our H264Encoder is just a wrapper of VTCompressionSession.

And our H264Encoder class conforms to AVCaptureVideoDataOutputSampleBufferDelegate protocol to receive raw picture data stream. Note that we receive raw picture data in form of CMSampleBuffer, which is CoreMedia object that is composed of media data and its description(such as picture’s width, height..)

All right. Now we want a way to create and set up VTCompressionSession.

A Function named ‘VTCompressionSessionCreate’ is a C function that create VTCompressionSession. And many of core library’s functions in Swift are C functions. You may not be familiar to using C function in Swift, but you don’t need to be scary. It looks a little different from Swift’s, but not that much.

‘VTCompressionSessionCreate’ takes some parameters:

outputCallback : If encoding task is completed, the outputCallback closure will be called. You can think of it as a completion callback. We will declare this closure later.
outputCallbackRefCon : The object that retain outputCallback closure. In this case it’ll be H264Encoder instance and we need to pass it as a pointer so we used some magic code here(Unmanaged.passUnretained(self).toOpaque).
compressionSessionOut : A pointer that will contain created VTCompressionSession. We could pass our _session as a pointer using ampersand(&)

In VTSessionSetProperties and VTCompressionSessionPrepareToEncodeFrames methods, we set some properties and got the session ready to use.

We are now able to convert raw picture data to video data.

‘VTCompressionSessionEncodeFrame’ method will encode raw picture data and call callback method we previously provided in VTCompressionSessionCreate method.

It’s time to declare encodingOutputCallback :

encodingOutputCallback is a type of VTCompressionOutputCallback which takes some parameters :

outputCallbackRefCon : Provided outputCallbackRefCon in method VTCompressionSessionCreate and in this case it’ll be H264Encoder.
sampleBuffer : encoded frame in form of CMSampleBuffer

Now, we finally get encoded frame(video data) in our ‘encodingOutputCallback’ closure. But what is it?

In H.264, there’s several types of data that represent video related data or format related data. There’s a lot of more types, it’d be no problem to know four types for the time being.

i-frame(key-frame) : Full picture data. we can represent an image without any further
p-frame : Predictive data. It represents difference between pictures and we can use it calculate next picture data from previous one.
PPS(Picture Parameter Set) : It’s not directly related to video but it represents some auxiliary data a picture refer to such as picture’s width and height info.
SPS(Sequence Parameter Set) : It’s similar to PPS, but a sequence of video frames refer to this data not a single picture.

Back to our encoded CMSampleBuffer, let’s see how it’s related to H.264 data type i just explained.

CMFormatDescription object contains SPS and PPS, CMBlockBuffer object contains other data type video data(which is mostly i-frame or p-frame). We set key frame Interval to 60 in method ‘VTSessionSetProperties’, which means we want to create one key-frame per every 60 frames. So for instance, if we have 60 frames, one frame will be a key frame and other will be p-frames.

Mind that we cannot play video only with i-frame and p-frames, they need additional information to refer to.(SPS and PPS) I hope below image get you better understand.

So, If a given CMSampleBuffer represents i-frame(key-frame), we need to extract SPS and PPS from it.

First, we will extend CMSampleBuffer easily tell it represents key frame.

And we need a method to extract SPS and PPS from CMSampleBuffer if it’s a key frame.

Note that we extract SPS and PPS data and add it to the naluStartCode. NAL(Network Abstraction Layer) is a way we easily handle video data for the network or our local file. We handle SPS or PPS or other frames as a NAL Unit, which is easily picked out by its start code. We usually use start code as 0x(000001) or 0x(00000001).

Back to encoded CMSampleBuffer, we were able to get CMBlockBuffer from CMSampleBuffer’s dataBuffer property. CMBlockBuffer contains one or more NALU(which is mostly i-frame or p-frame but could be other types except for SPS and PPS which is included in CMFormatDescription).

But here’s a trap. It doesn’t have NALU start code, instead it contains 4 bytes NALU length information in big endian. It’ll look like following image.

It’s because it could be more efficient if you do not send it over the network, let’s say you want to write it to a file, it’s way easier to pick out NALU from stream later because we just need to read 4-byte length first and read as much the length indicates. No extra effort to find start code. But we want to send it over the network, and that means it’s not always guaranteed to send NALU stream as way we want, we want to replace 4-byte length information with start code.

And 4-byte length information is in big endian. It means if we want to read length of NALU, we need to convert it little endian because iOS system uses little endian. That’d be happened in function named ‘CFSwapInt32BigToHost’.

Network

So far, we receive raw picture data and convert it to video data. Now we want to send it over the network. We’re gonna use Network framework, which provides straightforward ways to handle socket interface. We will first create a class named TCPClient which is responsible to behave tcp client. As you can see, it’s just a wrapper for NWConnection which will handle all network stuff for us.

And we need 2 methods. One is for trying to connect to the server with the server ip address and its port, another one is for sending data.

In connect(to:) method, we set up TCP connection with the server and if connection is established(3-way handshaking completed) ‘state’ will be set with NWConnection.State.ready. Of course if you want to have a connection via UDP, you can do it by initializing NWConnection with UDP option.

Wrap up

Now we have abilities to create video data and send it over the network, it’s time to put it together to manage all the processes we built.

You have just two interfaces, one is for connecting to the server, and another is for starting sending video data. And in its methods, it inject some delegate closure to deliver data the way we want.

That’s it today and I’ll be back with next post building server side. If you have any questions, feel free to ask it and please write comments if you find anything incorrect.

Ref.

Video Frames: https://blog.video.ibm.com/streaming-video-tips/keyframes-interframe-video-compression/
NAL: https://en.wikipedia.org/wiki/Network_Abstraction_Layer
H.264: https://en.wikipedia.org/wiki/Advanced_Video_Coding
Camera App: https://developer.apple.com/documentation/avfoundation/capture_setup/avcam_building_a_camera_app
VideoToolBox : https://developer.apple.com/documentation/videotoolbox
WWDC Session about VideoToolBox : https://developer.apple.com/videos/play/wwdc2014/513/
WWDC Session about Network framework: https://developer.apple.com/videos/play/wwdc2018/715