We’ve all seen custom cameras in one form or another in iOS, but how can we make one ourselves?
This tutorial is going to cover the basics, while at the same time talk about more advanced implementations and options. As you will soon see, options are plenty when it comes to audio/visual hardware interactions on iOS devices! As always, I aim to develop an intuition behind what we are doing rather than just provide code to copy-paste.
Already know how to make a camera app in iOS? Looking for more of a challenge? Checkout my more advanced tutorial on implementing filters.
The starter code can be found on my GitHub:
A tutorial on making a custom camera in iOS using AVFoundation - barbulescualex/iOSCustomCamera
If you run it you’ll see there’s very little going on. All of our logic will take place in
ViewController.swift.We just have a capture button, a switch camera button, and a view to hold our last taken picture. I’ve also included the request to access the camera, if you deny it, the app will abort 😁. The view setup and authorization checking is under a class extension in a seperate file
ViewController+Extras.swift as to not pollute our working space with less relevant code.
Setting Up A Standard Custom Camera
We will now look into how we can use the
AVFoundation framework to both capture and display camera feed, allowing our users to take pictures from within our app without using a
UIImagePickerViewController (the easy, barebones way, for accessing the camera from within an app).
Let’s get started!
AVFoundation is the highest level framework for all things audio/visual in iOS. But don’t underestimate it, it is very powerful and gives you all the flexibility you could possibility want (within reason of course).
What we’re interested in is Camera and Media Capture.
The AVFoundation Capture subsystem provides a common high-level architecture for video, photo, and audio capture services in iOS and macOS. Use this system if you want to:
Build a custom camera UI to integrate shooting photos or videos into your app’s user experience.
Give users more direct control over photo and video capture, such as focus, exposure, and stabilization options.
Produce different results than the system camera UI, such as RAW format photos, depth maps, or videos with custom timed metadata.
Get live access to pixel or audio data streaming directly from a capture device.
In this part we’ll be accomplishing the first point.
So what is this “Capture subsystem” and how does it work? You can think of it as a pipeline from hardware to software. You have a central
AVCaptureSession that has inputs and outputs. It mediates the data between the two. Your inputs come from
AVCaptureDevices which are software representations of the different audio/visual hardware components of an iOS device. The
AVCaptureOutputs are objects, or rather ways, to extract data out from whatever is feeding into the the capture session.
Section 1: Setting Up The AVCaptureSession
The first thing we need to do is import the AVFoundation framework into our file. After, we can create the session, and store a reference to it. But for the session to do anything, we should to tell it to start a configuration and then commit those changes using
commitConfiguration() respectively. Why do we do this? Because it’s good practice! This way, anything you do to the capture session will be applied atomically, meaning all of them happen at once. Why do we want this? Well, while it’s not necessary for the initial setup, when we switch the camera, if we queue up the changes (removing one input, adding another), it will lead to a smoother transition for the end user. After making all our configurations, we want to start the session.
This is all simple enough, but why do we execute the body of
setupAndStartCaptureSession() on a background thread? This is because
startRunning() is a blocking call, meaning that the execution of your app stops at that line until the capture session actually starts, or until it fails. How do you know it failed? Well you can subscribe to this NSNotification.
Now how do we actually configure the session and what does that mean? As you’re probably thinking, adding the inputs and outputs is definitely part of it but you can do more than that.
If you look through the documentation for
AVCaptureSession there exist multiple things you can do. While all of them are important I’ll only mention the ones you really need in a barebones case.
- Manage Inputs And Outputs — We’ll get to this in the following sections
- Manage Running State — Important for applications going into production, you want to be able to keep track of what’s going on with the capture session.
- Manage Connections — This is offers more fine tuning to the data pipeline. When you connect inputs and outputs to the capture session, they’re either implicitly or explicitly through a
AVCaptureConnectionobject. We will cover this later on.
- Manage Color Spaces — This offers you the ability to use a wider color gamut (if possible), iPhone 7s and up can take advantage of this as they have P3 color gamuts whereas older devices only have sRGB.
In this section, we care about presets, that is the quality level of the output. Since the capture session is a mediator between the inputs and outputs, it has the ability to control some of these things.
AVCaptureSession.Presets provide a higher level way to fine tune the quality coming out of the devices. Although as we will see in the next section, we can go into much more depth at the device level. Since we’re making a camera, the
photo preset makes the most sense. This preset will tell the devices (cameras) connected to it to use a configuration that returns the highest quality images.
Since we care about the quality, if we have access to a wider color gamut, we should use it by enabling the wider color space.
We’re now ready to setup our inputs ☺️.
Section 2: Setting Up Inputs
A device that provides input (such as audio or video) for capture sessions and offers controls for hardware-specific capture features.
So what’s the run down on this? This object represents a “capture device”. A capture device is a piece of hardware such as a camera or a microphone. You attach it to your capture session so it can feed its data into it.
For our purposes we need two devices; a front camera and a back camera.
func default(for mediaType: AVMediaType) -> AVCaptureDevice?
This takes in an
AVMediaType , and there are a lot of types… 🥴
As you can see, this struct isn’t just for
AVCaptureDevices, it includes stuff like subtitles and text. For reference, the one we care about is video, but we will not be going forward with this option because we have no way of specifying if we want the front camera or the back camera.
I said we care about video, but why video? Why is there no photo option? Well the camera (and digital cameras in general) don’t turn on when you try and take a photo, they turn on well before that. While you’re not taking pictures with them they’re acting as video capturing devices, taking a picture is actually just taking a frame from the video it’s already producing.
func default(_ deviceType: AVCaptureDevice.DeviceType,
for mediaType: AVMediaType?,
) -> AVCaptureDevice?
This method introduces 2 new parameters, a position (front, back or unspecified) and a device type.
If we look under
AVCaptureDevice.DeviceType, you may be surprised at the amount of options:
Want a microphone? Great there’s only one representation for it. Want a camera? Pull out Google because you’re going to have to figure out which devices support what.
Luckily, all devices have a builtInWideAngleCamera for both front and back. You’ll have no problem sticking with this camera type and it is what we’ll be moving forward with for simplicity. In a real world application, you may want to take advantage of the other, better, camera options that the user’s device may have.
Now that we know what we’re looking for let’s get both of them and connect the back camera to the capture session (since usually camera apps open up to the back camera).
setupInputs() from within our
setupAndStartCaptureSession() function. We now have our inputs 😁. Now earlier I said you connect a device to the capture session, while that is true, the way it’s connected is by turning it into an input object,
The abstract superclass for objects that provide input data to a capture session
These are useful for when you want to dive into ports for devices that carry multiple streams of data. You can get a better idea behind their utility here in the discussion section.
In the previous section, we discussed configuring capture sessions with some basic options and I mentioned you can get much more detailed with capture devices. Well here are your options:
- Formats — things like resolution, aspect ratio, refresh rate
- Image Exposure
- Depth data
- Torch — essentially a flashlight mode
- Transport — things like playback speed
- Lens position
- White balance
- ISO — sensitivity of image sensor
- Color spaces
- Geometric distortion correction
- Device calibration
- Tone Mapping
And most of these come with some sort of function to check if the configuration options are available. Different “cameras” have different configuration options. And remember, a “camera” is not one monolithic object such as “the back camera on my iPhone 11 Pro”. iOS devices, especially the newer ones, have multiple camera representations for their camera unit each with different capabilities.
Needless to say, once you start combining the different camera types with all the options mentioned above, your options become exponential.
It’s also worth mentioning that you have to check that you’re not stepping over your own toes when you try configuring
AVCaptureDevices. Turning on some options will disable other options. A notable example is that if you change an device’s
activeFormat , it will disable any preset you set on your capture session. Configurations made closer to the hardware, meaning lower in the data pipeline, will override configurations made up higher in the pipeline.
While these options are most certainly useful, this tutorial won’t cover them. But don’t worry, it’s not hard to configure them, Apple gives a good example in their documentation here.
Section 3: Displaying The Camera Feed
We have our inputs, meaning that the capture session is currently receiving the video streams from the camera, but we can’t see anything!
The AVFoundation Framework, thankfully, provides us with an extremely simple way to display the video feed:
This is very simple, the preview layer is just a CALayer you can create from a capture session, and add it as a sublayer into your view. All it does is present to you the video that is running through the capture session.
You can finally run your app and see something!
We now get to the discussion of sizing and aspect ratios. Different configurations will give you different dimensions. For example if I’m running the code on my iPhone X with the photo capture session preset, my back camera’s active format’s dimensions are 4032x3024. This changes depending on the configuration options. For example if you had chosen to use the max frame-rate option for the back camera on an iPhone X, you’d get 240FPS but with a much less impressive 1280x720 resolution.
The front facing camera, as you will soon see can also have different dimensions. For example if you’re running on an iPhone X with the photo capture session preset, your front resolution will be 3088x2320. That is almost the same aspect ratio as the back camera, meaning that the user won’t notice the change in size. Depending on your configurations, your aspect ratios can be all over the place. Your UI should work with all the aspect ratios the resultant preview might give.
If you want to play around with how the frames fill the preview layer, you can look into the
Section 4: Setting Up The Output & Taking A Picture
What is an output again? It’s what we attach to the capture session to be able to get the data out of it. In the previous section we explored the built in preview layer. That object, by definition, is also an output.
We have 2 options here, both are
objects that output the media recorded in a capture session.
That is they provide us with the data that the capture session mediates from its input devices.
This is a super simple option called
AVCapturePhotoOutput . All you have to do is create the object, attach it to the session, and when the user presses the capture button you call
capturePhoto(with:delegate:) on it and you will receive back a photo object which you can manipulate/save with ease. Now don’t get me wrong, this is a powerful class, you can take Live Photos, define a ton of options for taking the photo itself and the representation you want it in. If you are looking to just add your own UI to a regular camera to fit your app, this class is perfect.
This is options returns back raw video frames. This is just as easy to implement but you can take your custom camera in all sorts of directions with this option, so we will be moving forward with this.
A capture output that records video and provides access to video frames for processing.
If you pay attention to the object name, you’ll notice this one references video rather than photo like the previous option. This is because you get every single frame from the camera. You can decide what to do with those frames. That means that when a user presses the camera button, all you have to do is pluck the next frame that comes in. It also means that you can ditch the preview layer we’re using, but that’s in my next tutorial 😉.
As always, there are a bunch of configuration options for this output object, but we’re not concerned with them in this tutorial. Our method of interest is:
func setSampleBufferDelegate(_ sampleBufferDelegate: AVCaptureVideoDataOutputSampleBufferDelegate?,
queue sampleBufferCallbackQueue: DispatchQueue?)
First parameter is the delegate, on which it will callback with the frames and the second parameter is a queue, on which the callbacks will be invoked. It will be invoked at the frame-rate of the camera (if the queue is not busy) and it’s expected that you will process that callback data so it’s important for usability that it does not take place on the main (UI) thread.
If the queue it’s running on is busy when a new frame is available, it will drop the frame as it is “late” according to the
As you can see, setting up the output was rather easy. We now focus in on the delegate function
captureOutput and its 3 parameters.
The output specifies which output device this came from (incase you are mediating multiple
AVCaptureOutput objects with the same delegate). The sample buffer houses our video frame data. The connection specifies which connection object the data came over. We haven’t touched connections yet and we only have one output object so the only thing we care about is the sample buffer.
object containing zero or more compressed (or uncompressed) samples of a particular media type (audio, video, muxed, etc), that are used to move media sample data through the media pipeline
The full extent of this object gets rather complicated. What we care about is getting this into an image representation. How do we represent images in cocoa applications? Well there are 3 main distinct types, each part of a different framework representing different levels of an image
UIImage(UIKit) — highest level image container, you can create a UIImage out of a lot of different image representations and it’s the one we’re all familiar with
CGImage(Core Graphics) — bitmap representation of an image
CIImage(Core Image) — a recipe for an image, on which you can process it efficiently using the Core Image framework.
Back to the
CMSampleBuffer. Essentially it can contain a whole array of different data types, what we expect/want is an image buffer. Since it can represent many different things, the Core Media framework provides many functions to try and retrieve different representations out of it. The one we’re interested in is
CMSampleBufferGetImageBuffer(). This once again returns another unusual type, a
CVImageBuffer . Now from this image buffer, we can get a
CIImage out of it, and since a
CIImage is just a recipe for an image, we can create a
UIImage out of it.
As you can see we’ve added a boolean flag to determine wether or not we need to use the sample buffer that came back. If we run it, we’ll get back our first picture 🎉.
Unfortunately the orientation is incorrect. While the video preview layer automatically displays the correct orientation, the data coming through the
AVCaptureVideoDataOutput object does not. We can fix this in 2 places, on the connection itself (that is the connection between the output object and the session) or when we create the UIImage from the CIImage. We will change this on the connection itself.
A connection between a specific pair of capture input and capture output objects in a capture session.
Earlier I mentioned that connections are formed through the capture session when you attach inputs and outputs. Well those connections are objects themselves, we’ve implicitly created them through
addOutput() on the capture session. The connection objects can be accessed anywhere throughout the data pipeline (the inputs, the capture session, and the outputs). Under Managing Video Configurations in the documentation, we have the option to set a
videoOrientation on our output connection.
We now have the correct video orientation for when we take a picture.
Section 5: Switching Cameras
Okay, so we’ve established the whole capture pipeline enabling us to both display and take images, but how do we change the camera?
If you recall, the capture session mediates inputs to outputs. We’ve just covered outputs (for taking the picture) and before that we got the input devices and formed the input objects for them. Since we’ve stored 2 references, to both the back and the front camera, all we need to do is reconfigure the session object.
The only caveat here is that since we’ve change the inputs, the connection objects have changed so we need to reset the video’s output connection’s video orientation back to portrait.
If you run it and take a picture, you might notice that the preview layer we use shows the video from the front facing camera mirrored while our output object does not return a mirrored video. This is the same case as the orientation. The preview layer automatically handles it since it is a “higher level object” whereas it is not handled in our “lower level” output object. To fix this, you can set the
isVideoMirrored property on the connection based upon which camera is currently showing.
Let’s fix that up really quickly, just how we fixed the video orientation earlier.
And we’re now done 🎉. We have a template for taking pictures using a custom camera which allows us to fit it inside whatever type of UI we want.
The complete part 1 can be found on my GitHub:
- Explore more camera features. Remember the “Configuring Capture Device” section? Well it wasn’t just there to overwhelm you, you can use those to expand your camera’s functionality.
- Videos! We only use the frames we get from our output object when the user wants to take a picture. The rest of the time, those frames are going unused! Capturing video, while not trivial, is not too difficult either. All it implies is that you bundle together the video frames into a file. This is also a great opportunity to explore using audio devices in iOS.
- Implement filters into your camera! This cough cough is a plug for my follow-up tutorial.
If you’ve enjoyed this tutorial and would like to take your camera to the next level check-out my tutorial for applying filters!
Using CIFilters & Metal To Make A Custom Camera In iOS
Leveraging Metal and Core Image to implement fast and efficient filters for your app’s camera
Interested in learning about graphics on iOS? Checkout my introduction to using Metal Shaders.
Already familiar with Metal, but want to see how you can leverage it to do some cool things? Check out my tutorial on audio visualization.
As always, if you have any question or comments, feel free to leave them below ☺️.