Adventures in the machine-learning land of drones & lidars, part I
This is the first part of a two-part text. It is written for an audience with a background in computer graphics, but to make the text more accessible, I have written a high-level TL;DR below with key takeaways and added links throughout the text that explain key concepts.
Two years ago, I had a chat with KDI’s CTO about the use of our in-house rendering engine COGS to produce training data for machine learning. KDI has a 3D visualization group and a machine learning group sitting next to each other, so pulling off something like this would likely fuel some interesting synergies. I was encouraged to investigate this, and I did, and it ended up in me having a little side-project.
To make “training data for machine learning” a bit more concrete, I focused on cases that had been discussed in KDI. In particular, I wanted to simulate an aerial drone, fitted with cameras, sensors, and a rotating lidar. One case was to navigate such a drone inside a metal construction without lights, compass, or GPS. Another case was to navigate along a power-line and inspect insulators on transmission towers. My aim was to do minimal proof-of-concepts, getting an idea of what was possible. I wanted a test bench for quick experimentation, with direct access from Python, where one could experiment with machine learning algorithms.
This text is a kind of development journal written in retrospect. There are quite some bits and pieces that had to fit together, and my intention here is to go through some of the interesting bits and discuss what the challenges were, how I chose solve them, and provide some observations where I found some.
TL;DR Microsoft AirSim is a simulator for autonomous vehicles, which can run in both Unreal Engine and Unity to provide sandboxes for experimenting with autonomous vehicles and machine learning. I have added AirSim to our in-house rendering engine, named COGS, in a similar fashion, making our own sandbox. In this sandbox, I added an emulator for a rotating lidar. The approach here was to enclose the lidar’s 360° field-of-view using four cameras rendering depth images, and from this calculate the lidar returns. The lidar returns are broadcasted over the local network the same way as a real lidar, so actual lidar software can receive and process the data. The sweeping motion of the lidar is visualized using dots at the immediate returns. In addition, returns are also aggregated to build a 3D map of the environment as seen by the lidar. Figure 1 shows all pieces put together.
The COGS engine
To have some structure to peg the presented pieces onto, I’ll start with a very brief overview of relevant parts of our in-house rendering engine COGS. A valid question might be why we have a custom engine when excellent game engines like Unity and Unreal exists — which are widely used. One answer is vertical integration: COGS is a small engine where we have overview and control over all the parts; it is easy to extend or tweak COGS to accommodate support for novel data sources, like for example, echosounder data or reservoir models. And it is designed to smoothly embed into larger applications. And that in a way sums up the modus operandi of the KDI 3D visualization group: create industry applications that creates new insight into industrial data-sources through visualization. The obvious downside is of course that we have to fix everything ourselves and don’t have the support of a large existing infrastructure.
Onto more technical details, everything managed by the engine is typically either a resource (model, texture, shader) or an entity (e.g. things in the scene, things with behaviour). COGS is based around an entity-component system, where an entity is a set of components, and the components carry the state of the entity. Components come in different kinds, where each kind handles some aspect of entity behaviour, are stored contiguously in memory, and are processed by its own system. The systems run one by one, once per frame, in a fixed order. When the systems are done, the renderer builds renderlists for that frame, which it then subsequently executes. The concept of a scene-graph is limited to the SceneComponent, which contains pointers to child entities and is used to compose transforms.
Tasks may be offloaded to pools of worker threads through a task system. There are two typical patterns: One pattern is that a system issues a set of tasks at the start of the frame, and synchronizes to these tasks at the end just before the renderer takes control. Another pattern is that a system issues long-running tasks, and polls every frame on the state of running tasks before potentially issuing new tasks. Resources are loaded and processed in background tasks as well, and once per frame there is a slot where resources can be swapped into the foreground.
The core of the COGS engine is relatively small. Project and domain-specific functionality is typically provided by extensions, which defines new components and systems. These extensions are typically dynamic library, which can be loaded and initialized on demand e.g. from scene description files. Most of the functionality presented in this blog-post has been added to COGS as such extensions.
Integrating Microsoft AirSim
Microsoft AirSim is an open-source simulator for drones and other autonomous vehicles. The actual simulator resides in AirLib, and AirSim has separate projects that integrate AirLib into Unreal and Unity. My job was to integrate AirLibs into COGS in a similar way. An AirSim simulation is set up by instantiating an AirLib physics engine, which is used to instantiate an AirLib physics world. The AirLib’s MultiRotor class implements a flying drone, and can be put into the world by adding a connector object to the world that implements the UpdateableObject, where it returns a MultiRotor object in the getPhysicsBody function. To let the drone be managed through AirSim RPC-interface, the connector object must implement VehicleApiBase.
An entity in COGS gets drone behaviour by adding an AirSimComponent. The AirSimComponent has properties that enable or disables the simulation, sets what kind of firmware to use (AirLib’s SimpleFlight or a Pixhawk flight controller), enable forwarding COGS inputs (from e.g. a keyboard or an XBox controller) as RC control signals, and references to entities that pose as rotors and should spin according to the simulation. AirSimComponents are managed by AirSimSystem, which owns an AirLib world with a physics engine. The system pairs the AirSimComponent with a connector object that manages a MultiRotor object. Once per frame, the drone’s position, attitude, and rotor state are fetched from the AirLib simulation, and the transform of the drone and rotor entities are updated. In addition, events from the COGS input system and collision events are forwarded to the connector, which sets the corresponding AirLib state. I chose to keep sensor simulation (lidar and various camera captures) outside the AirSim integration.
The physics integration was interesting. COGS has an extension with the Bullet physics engine. AirLib has its own physics engine that handles the drone flight. The tricky part was to mediate collisions between these two concurrent engines that run on vastly different frequencies. I added a btGhostObject to represent the drone in Bullet, with its transform controlled by AirLib. Each frame, all relevant contact manifolds from Bullet get traversed, forwarding collision events to AirLib and applying impulses in Bullet. Thus the drone could crash into objects, moving them or tipping them over, while objects pushed back at the drones. Quite fun, but I won’t vouch for the accuracy. Nevertheless, the case when either drone or obstacle was immovable worked quite well, as this makes collisions much easier to resolve.
However, full physics interaction wasn’t necessary. I needed interaction with an immovable ground during take-off and landing, and I wanted to know if the drone crashed into something. I added a mode that forwarded the ground plane below the drone position instead of collision events. The ground plane was provided by a GroundRefComponent, which either held a fixed plane or was updated by the vegetation provider (discussed in the second post). AirLib’s UpdateableObject interface includes an update method that is invoked once every time-step, and in this method I checked if any of the drone’s legs pierced the ground, and if so, set collision events. This approach worked very well, providing a stable interaction in the critical phase where the flight controller has to determine whether or not the drone is resting on the ground using sensor readings.
The drone movement produced by AirSim is quite smooth. To create more challenging scenarios for pose estimation, I added a WobbleComponent, which makes its entity oscillate, twist and turn. This motion is configured by magnitude and frequency. Typically, the wobble entity is a child of the AirSim-entity, while the drone model and sensor rig are children of the wobble entity. The motion is formed by keyframes for position and rotation axes. The position is calculated using a Catmull-Rom spline on four keyframes, while the orientation is calculated by first slerp'ing two rotation axis keyframes, and then rotate the entity around this axis an amount of magnitude times the cosine of animation time. New position keyframes was set to three normally distributed random numbers scaled by magnitude, and rotation axis keyframes was set using sphere point picking. If the angle between two axis keyframe axes was close to π, the new axis was negated to avoid degenerate slerps. The WobbleComponent can reference a drone entity and disable the wobbling when the drone sits on the ground.
Emulating a lidar puck
The VeloDyne’s VLP-16 lidar puck is power efficient and compact, thus a good fit for drones. Both datasheet and manual is available at the product’s website. It has a vertical fan of 16 lasers with altitude from -15° to +15° producing 300,000 samples per second (twice that with dual return). The fan rotates at either 5Hz, 10Hz or 20Hz, yielding azimuthal resolutions of about 0.1°, 0.2° and 0.4°. The maximum distance of returns is 100m. As the beam has a width, the returns are not point-samples, but samples over an area. A guesstimate of this width is 0.05°, half the minimum angular resolution, which at 100m is ∼9cm. The laser might partially penetrate a thin surface, and there might be multiple surfaces that reflect inside the beam width area, so the result is not a single measure but a sequence of returns of different magnitude. The VLP-16 can return either the strongest or the last return of this sequence. Multiple returns are particularly prominent when measuring through foliage. Data are broadcast as UDP packets via its 100Mbps ethernet port, and the protocol is documented in the manual. VeloDyne also provides the VeloView software to visualize samples received from the lidar in realtime.
The emulator needs to cast and process at least 300,000 rays per second. This work was done well before RTX was announced, so GPU-based raytracing wasn’t the immediate solution to such a problem, and I tackled it using rasterization. The rays are quite coherent, but I had to resample from the rasterization’s planar projection to the lidar’s equiangular projection. There is a subtlety with reconstruction depth from four adjacent samples: Bilinear interpolation works well along relatively planar surfaces, but around discontinuities we get samples far away from any geometry, see Figure 5. These samples appear as returns from non-existing geometry, and are very noticeable. To overcome this, the reconstruction filter also takes normals into account. If the four samples are close to coplanar, bilinear interpolation is used, otherwise it uses nearest neighbour. Experimenting with more samples would be interesting, as well would dithered transparency and using the distribution of samples in the reconstruction to estimate both strongest and last return.
A single lidar-enabled camera is created by adding a LidarEmulatorComponent. Parameters are the number of samples and the azimuthal and altitudal opening angle, and from this it calculates a frustum that covers these beams so that every beam is fully surrounded by four samples. The mismatch between planar and equiangular projections grow with increasing aperture, and 90° is a practical maximum. The component also configures a special lidar-emulator pipeline, which renders a depth image that is resampled to the lidar’s equiangluar projection and read back to CPU memory, all operations are pipelined using multiple sets of buffers to avoid flushing the pipeline.
The full lidar puck is emulated by PuckSystem, where PuckComponent has properties for rotation speed, enabling of UDP broadcast, and configuring added noise and drop-outs of returns. PuckSystem creates and manages four entities with LidarEmulatorComponent to provide full 360° azimuthal cover. The system sets a callback for when lidar data has been downloaded, which inserts the new data into a common buffer covering all laser directions. Optionally, Gaussian noise can be added to returns and returns can be randomly discarded. To quickly generate enough random numbers, uniform random numbers are generated with a simple linear congruential generator. Gaussian random numbers are approximated by adding 12 uniform random numbers and subtracting 6: The sum of n uniform random numbers has the Irwin-Hall distribution, with mean n/2 and variance n/12. With n=12, the variance is 1, which makes scaling slightly simpler, and the shape of the distribution is close enough to Gaussian for this use.
The puck system always has the full set of samples for all directions, but three frames delayed. Thus, PuckSystem also keeps track of the poses at the time a capture was taken. This data is available in the Python client (detailed in the next post). In addition, the PuckSystem can optionally keep a thread running broadcasting lidar returns over UDP. This thread keeps track of the azimuth direction an actual rotating lidar would have, transmitting a datagram with 24 fans and sleeping until it is time to send a new datagram in a loop. This allows the emulator to be used with actual lidar software, e.g. Figure 4 right show the VeloView application visualizing lidar returns from the emulator.
In retrospect, now that GPU raytracing is available, would I approach this differently? Rasterization is just a very fast way of raytracing a set of beams spread uniformly over a planar projection, and is often employed as an optimization for the first hit; the actual shading process is basically the same. Raytracing allows rays to be cast in any direction, and thus avoids the resample-to-equiangular pass, with its associated error. A question is if the resampling error is significant compared to the error from approximating an area sample. Adding samples reduces both errors. An interesting thought for rasterization is to use variable rate shading to sample more where the distortion is large. Raytracing allows refraction and reflection, which might approximate returns from vegetation more faithfully, where light may bounce around. However, this would require a stochastic path tracer with a significant amount of rays per lidar beam. And since the continuum of the lidar response is quantized into one or a few values, the impact of refraction and reflection might be marginal in many cases. And partial occlusion and translucency are manageable with rasterization. The cost of raytracing, besides casting rays, is that meshes must have an acceleration structure in addition to bounding volume hierarchies, and these must be maintained. So, if the engine already employed raytracing for e.g. shadows, reflection or ambient occlusion, and thus the cost of acceleration structures already was paid, I would definitely use raytracing. If only the lidar would use raytracing, I’m undecided if I would introduce it.
Visualizing lidar returns
I also wanted do visualize the lidar returns directly in COGS. A convenient way to visualize point data in COGS is to use the two components PointDataComponent and PointVisualizationComponent. PointDataComponent turns the entity into something that hold a view into a buffer of point positions, and PointVisualizationComponent has a reference to a material resource and a reference to an entity, and if this entity contains a PointDataComponent, it sets up a mesh resource with these positions as vertex data, which gets rendered with the referenced material.
In PuckSystem, if the puck entity has a PointDataComponent, the system populates this buffer: From current time and time since last frame, it calculates the set of azimuth directions sampled since last frame. Only visualizing these returns illustrates the sweeping motion of the lidar. The world-space position of these returns are calculated from the set of distance samples and the corresponding pose. Adding an entity with a PointVisualizationComponent referencing the puck entity visualizes the returns as points, see Figure 4 left. This gives a nice feedback when adjusting lidar noise as one sees the point cloud widens away from the wall.
Another way of visualizing the lidar data was to aggregate the positions of lidar returns over time and visualize this. This aggregation will over time form a sort of probability field that describes the likelyhood of an obstacle to be present at a particular spatial location, that is, similar to the mapping part of SLAM, and in fact, this code started out as an experiment with SLAM. I modelled this probability as a grid of cells. Each cell holds the number of samples added to this cell as well as the average position of the samples. To visualize, I add a dot at the cell position average for each non-empty cell, sized and coloured by the number of samples in the cell. Figure 6 shows such a visualization. The average position gives the most likely position to find an obstacle in a cell, giving a sort of sub-cell precision; rendering cell averages instead of centres allowed much larger cell sizes without sacrificing visual fidelity.
The number of cells grows with the cube of the resolution, that is, the number of cells grow rapidly. Though, the obstacles that provide return are surfaces, so the number of occupied cells grows only with the square. Thus, there are a lot of empty cells, and the relative amount of empty cells grows with the resolution. To avoid storing empty cells, I grouped M³ adjacent cells into a block, and these blocks were stored in a sparse grid. The cells of a block are arranged contiguous so nearby cells are nearby in memory. If blocks are too large, they get sparsely populated and memory is wasted, and if they are too small, the overhead of managing blocks starts to cost. After a bit of experimentation, I found that the number M=16 seemed to strike a good balance. The size of blocks are optimized for processing incoming lidar returns updating cells. Rendering prefers even larger batches, since there is a significant cost to manage a buffer and issue a draw call. Thus I grouped N ³ blocks into what I called a chunk, where a chunk corresponds to a GPU buffer of point positions and a draw call. Yet again, the number N=2 is just the result of some experimentation, balancing the cost of a draw call versus the amount of work involved in updating the buffer and how likely a buffer needs to be updated.
So I have three levels of spatial subdivision, chunks that contains blocks that contains cells, the actual bearer of data. Updating of blocks run in a background task. The task is provided with a set of new lidar return positions. First, the block index of each point is found, maintaining the minimum and maximum block indices. Then, if any block indices is outside the current grid of blocks, the grid is resized to accommodate this. Then all positions are processed again, this time updating the cells. Each block where a cell is updated is flagged as modified. This flagging is done by inserting the block into a single-linked list, which allows quick iteration over modified blocks. The next-pointer also serves as a flag, so a block is inserted only once. The chunks are updated after all the incoming points have been processed. First, all modified blocks are visited, which amounts to traversing a list, to find the minimum and maximum chunk index. If any chunks are outside the grid of chunks, the grid is resized. Then, each chunk with modified blocks are processed by running through all blocks of that chunk populating a buffer of positions from non-empty cells.
In the frame loop, the system that manages this visualization checks if the background update for the component is still running. If so, the system does nothing for this component this frame. Otherwise, the processed buffers for the chunks are swapped in. Then the data sources are processed, the same PointDataComponents used for visualizing the lidar returns, as depicted in the left image of Figure 4. If any of these components have any new data, the new data is copied into temporary buffers, which are given to a new background task that updates blocks and chunks.
The end for now
This concludes the first part of this two-part text. In the second part, I’ll cover how Python can interact with a running COGS application and capture synchronized images at full performance, and I’ll discuss my approach to generating scenery with vegetation from maps of existing sites.