Getting start with Structure sensor

This is a tutorial to guide you through the process of how to build an IOS application by making use of Structure sensor. Obtain coloured point cloud from the sensor.

Previous Articles:

Background Knowledge

Before we start coding, we need to understand what outputs are expected from out sensor, in other words, what does the sensor do. There is a great post named Accuracy and utility of the Structure Sensor for collecting 3D indoor information that illustrates how does Structure Sensor work in great details. Please take a look at it if you are interested. I will give you some necessary gists.

If the sensor is equipped on our iPad, there will be two cameras on it, the native RGB camera and an external Structure sensor. On a Structure sensor, there are Infra-Red (IR) projector and Infra-Red (IR) sensor on two sides respectively, like below:

When we do the measurement, Structure sensor will give us the depth information only like the graph below. In English, the IR projector will shoot a red light to an object, then the IR sensor will receive the reflection light. So that the depth will be measured by the angle of the reflection light. Since the Structure sensor will provide depth data only, we will take advantages of Sensor Fusion technique to mix up the RGB data and depth data.

Figure 2 shows the geometry of converting disparity to depth when the target point (black dot in the figure) is projected at depth Z on a plane farther from the IR speckle pattern reference plane (green dot). The blue circle shows the location of the IR camera at a distance b = 65 mm from the IR projector, the red circle. The purple circle represents the iPad’s RGB camera, at a distance c = 6.5 mm from the IR camera (Occipital 2014Occipital. 2014. Inside the Structure Sensor By Occipital. Accessed May 10, 2016. [Google Scholar]), as allowed by the precision bracket accessory mounted on the iPad.

With those knowledge, thinking about one scenario of how would you use it, we can start coding now.

I will outline the whole pipeline first:

  1. Initialization
  2. Sensor start streaming
  3. Application receives streaming output
  4. Processing for your specific usage
  5. Stop streaming

Then we will address each step one-by-one.


Add STSensorControllerDelegate on your view controller, that enables you to use the built-in methods in Structure SDK. Then we can declare a global shared instance named sharedController to handle all upcoming functions.

This is how we obtain the application-level shared sensor (sharedController) instance.

Then an initialization is necessary by simply one line, you can also put it in the initializer if you prefer to.


Start Streaming

Firstly, we need to switch on our sensor. For doing this, you need to specify what kind of output you want to receive. This is controlled by the streaming options. After it started, you would see the red-light from the IR sensor.

The most important attribute is kSTFrameSyncConfig which includes three options:

  1. — — No RGB data will be used.
  2. STFrameSyncConfig.depthAndRgb — — Use depth and RGB data.
  3. STFrameSyncConfig.infraredAndRgb — — Use infrared and RGB data.

Hence, we can advance our data by fusing out data with RGB camera if we are not setting STFrameSyncConfig to off. Or we can use only depth data if it is sufficient.

Application receives streaming output

After the sensor triggered, there are four main methods for getting the output from sensor:

I suggest you to implement all of those methods to see what happened by yourself at first.

If you have tried to run your app now, you may find that you are receiving outputs from sensorDidOutputDepthFrame method than sensorDidOutputSynchronizedDepthFrame which contains only depth information. If you want to use the RGB data as well, the latter method is triggered by following method:


Hence, for sensor fusion, we will need to activate our RGB camera (for getting sampleBuffer) as well. If you do not know how to use it, you can also refer to other posts like this one. While we will be using the receiver method like (we need input buffer type as CMSampleBuffer, you may also convert it yourself if you are using other methods):

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
// Feed it to Structure sensor

We simply pass the sampleBuffer to frameSyncNewColorBuffer() method, then the sensorDidOutputSynchronizedDepthFrame will be triggered by this.

func sensorDidOutputSynchronizedDepthFrame(_ depthFrame: STDepthFrame!, colorFrame: STColorFrame!) {
let alignedDepthFrame: STDepthFrame = depthFrame.registered(to: colorFrame)
// YOUR OWN PROCESSING ON alignedDepthFrame

After we receive the depthFrame along with colorFrame, align our depthFrame to colorFrame is necessary for an accurate result.

You can think that a depthFrame is an array of depth, in which you can locate the depth z of one point by its coordinates (x, y). Similarly, a colorFrame is also an array of RGB colours, you can locate the colour of one pixel by its coordinates as well.

For details, you can refer to Align Depth and Color Frames — Depth and RGB Registration.

Then you can start doing your own work on the alignedDepthFrame.

One little example on how to extract point cloud data from type STDepthFrame. According to the documentation, alignedDepthFrame.depthInMillimeters is an 1-D array, shaped as alignedDepthFrame.width * alignedDepthFrame.height. We will iterate through to obtain all the depth information.

// Get the depth data as an array we can iterate over
let pointer : UnsafeMutablePointer<Float> = UnsafeMutablePointer(mutating: depthFrame.depthInMillimeters)
let theCount: Int32 = depthFrame.width * depthFrame.height
let depthArray = Array(
start: pointer,
count: Int(theCount)

Then we can read the depth value from the array.

// Iterate through the depth array
for r in 0...depthFrame.height - 1 {
for c in 0...depthFrame.width - 1 {
    let pointIndex = r * depthFrame.width + c
let depth = depthArray[Int(pointIndex)]
    // Sensor cannot detect the objects which is too far away or too close, it will be NaN in that case.
    if (!depth.isNaN) {
let x = r
let y = c
let z = depth
// Store it to somewhere you like
} else {
// Do as you wish to empty values
// you may want to ignore them or give a placeholder

Till now, you have obtained a pixel image with depth rather than colours. To make it as RGBD images, we need to get the colour information from the colorFrame.

// Extract PixelBuffer from ColorFrame
let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(colorFrame.sampleBuffer)!
// Lock on Buffer
CVPixelBufferLockBaseAddress(pixelBuffer, CVPixelBufferLockFlags(rawValue: 0))
let baseAddress = CVPixelBufferGetBaseAddress(pixelBuffer)
let buffer = baseAddress!.assumingMemoryBound(to: UInt8.self)
// Unlock Buffer
CVPixelBufferUnlockBaseAddress(pixelBuffer,CVPixelBufferLockFlags(rawValue: 0))

The pixel values can be therefore read from the buffer. It is important to do the lock and unlock when reading data from pixel buffer, or you will see a lot of errors. Then we can get the RGB data by the coordinates. Remember it happens within the locking block.

let bytesPerRow = CVPixelBufferGetBytesPerRow(pixelBuffer)
let index = (x * bytesPerRow) + y * 4
let b = buffer[index]
let g = buffer[index+1]
let r = buffer[index+2]

The whole buffer looks like […, r, g, b, a, r, g, b, a, …], that is why we are skipping y value every 4 bytes. Therefore, an RGBD image has been obtained.

The next step is to obtain the actual point cloud from the RGBD. How do we get the exact measurement of the real world? Since we are getting data from a camera, we can also measure what have been scanned as well, by using the camera settings, which is called intrinsics.

This depends on what camera you are using which is various from devices. The intrinsics information I am using is for iPad Air 2 which comes from here. Code snippet is from here.

let _fx = Float(VGA_F_X)/Float(VGA_COLS)*Float(depthFrame.width)
let _fy = Float(VGA_F_Y)/Float(VGA_ROWS)*Float(depthFrame.height)
let _cx = Float(VGA_C_X)/Float(VGA_COLS)*Float(depthFrame.width)
let _cy = Float(VGA_C_Y)/Float(VGA_ROWS)*Float(depthFrame.height)
// Calculate x, y by:
let x = Double(depth * (Float(r) - _cx) / _fx)
let y = Double(depth * (_cy - Float(c)) / _fy)

Till here, you are able to get the point cloud from the sensor.

Stop Streaming

After we finished our work, it is pretty simple to shut down our sensor by one command:


By far, the pipeline of how to use Structure sensor are all addressed.