Adventures in the machine-learning land of drones & lidars, part II
This is the second part of a two-part text, the first part can be found here. The text is written for an audience with a background in computer graphics, but to make the text more accessible, I’ve written a high-level TL;DR below with key takeaways.
In the first part, I gave the background and motivation for this project, followed by an overview of our in-house rendering engine COGS, which is the basis for this work, and then details about the AirSim integration and the rotating lidar emulator.
TL;DR Python is the prevalent programming language for experimenting with machine learning, and I’ve added functionality that allows a Python program to connect remotely to a running COGS application, manipulating state and capturing images and data. Numerous sample images paired with ground-truths are used to train machine learning algorithms, and new rendering modes and a new efficient image capture system makes acquisition of such images convenient. I needed vegetated landscapes in which the drone could operate. To create virtual representations of actual places, and to minimize the labour-intensive task of modelling large scenes, I augmented the terrain rendering of COGS with a system that generates vegetation on-the-fly, see Figure 1. The input to both the terrain rendering and the vegetation system are regular map data obtainable from various authorities.
Interacting with COGS directly from Python
The TCP server runs in its own thread listening to a port, and when a client connects, it creates a reader and a writer thread that handle the communication with the COGS main thread. Serialization and deserialization was written by hand in C++ and Python, as the API is quite small. The communication between reader and writer threads and the main thread is buffered by a pair of queues per client, so the client cannot indirectly halt the main thread. The TCP server inserts itself into the frame loop, handling requests once per frame. Fire-and-forget type requests gets queued up during the frame, and are processed in one go. However, chains of back and forth request-responses have a performance problem since the client is unlikely to receive its response and manage to send a new request before the main thread is finished with that frame’s request processing, and any new requests won’t be handled before the next frame. Only one response per frame makes traversing the scene graph from Python noticeably slow (as this amounts to a lot of queries for properties and children), and renders multiple captures at a single instant tricky. Pipelining requests and handle them asynchronously in Python is one solution, but adds a lot of complexity.
A simpler approach is to let the main-thread spin in the request-processing stage until the client considers itself done. I added HALT, RESUME, and NEXTFRAME requests. The HALT request keeps the main-thread spinning in the request processing loop until a matching RESUME request is received. These requests can be nested. The NEXTFRAME request breaks out of the request-processing loop, resuming spinning the next frame. On the Python side, this is exposed as a context manager called FrameLock, where the enter function sends a HALT request and the exit functions sends a corresponding RESUME. Thus, statements wrapped in a with-statement are processed in the same frame, see Figure 4. A danger here is again that the main-thread may get hogged to such an extent that it interferes with running simulations. To avoid this, the request processing loop has a timer limiting the time it spends processing requests before continuing with the frame.
This functionality is wrapped into a single self-contained COGS extension, and thus any COGS-based application can easily be enabled to be managed from Python in this way. Using FrameLocks, the client can get multiple image captures and property readouts in a single frame, quickly traverse the scene graph, or step frame-by-frame.
Non-blocking image capture
To train a machine learning algorithm, sets of representative samples of input data paired with a ground-truth data are fed into the training process. What kind of ground-truth is needed depends on the problem that is to be solved. E.g., segmented images, which are images that encode what kind of object are present where in the image, can be used to train object recognition. Depth and surface normal images can be used to train spatial comprehension. The representative image and the ground-truth should be synchronized, that is, they should depict the same view at the same time. And one usually needs quite a lot of data.
To make image capture convenient, I created an CaptureComponent that can be added to any camera, overriding the render mode of the camera and retrieving images. This component have a set of pre-defined render modes: Normal rendering, depth images, images with surface normal and segmented images, see Figure 5. COGS entities have an object id field, which is a 32-bit integer that can be set by the user with no restrictions, and segmented images are produced by using the object id as colour.
Basic capture was already possible in COGS via an API call that draws the current view and directly read backs the resulting image. The problem with this approach is that it flushes the graphics pipeline. A single capture and a single flush every now and then is no problem, but continuously capturing multiple cameras will hurt performance severely due to an avalanche of pipeline flushes. Any capturing approach that immediately returns the image will have this problem.
So, to make image capture efficient, I decoupled the task of requesting an image and receiving an image. Capturing is either continuous or enabled for just a few frames, triggered by a property of CaptureComponent, which can be set either from Python, scene setup, or via the inspection GUI in the application. The captured image is available a few frames later and can be automatically stored to disc, or be retrieved by the Python client. I added some convenience functionality in the Python client that let Python wait for the correct frame to be available, yielding an illusion of immediate capture. This supports capturing from multiple cameras at the same time, so input data with matching ground-truth data can be generated with multiple identically placed cameras, see Figure 6.
The actual capturing is handled by the CaptureSystem. If capturing is enabled, it sets up a render pipeline that, after rendering the image, issues a compute task that packs the image data and triggers downloading this result from GPU to CPU memory. It uses multiple sets of buffers, so that when frame i is rendered, frame i-1 is packed, frame i-2 is downloaded, and frame i-3 is ready for the CPU. On the CPU-side, two pipelined asynchronous tasks are issued to the task system, one optional task to encode the image data to PNG, and another optional to store the image to disc. If either the image encoding or storing the image to disc takes more than a frame to complete, a property on CaptureComponent chooses whether to just discard new frames until the offending task is finished, or wait for completion — which may hurt framerate.
Terrain and vegetation on the fly
I needed a scene where the drone could fly along a power line. I fired up Blender and started to model a little dell with a powerline with trees on both sides. As I quickly realized that placing trees by hand was very tedious and time-consuming, I decided to automate this task, and base it on real data so I could easily create scenes that was similar to existing places. To that end, I needed maps with elevation and vegetation data. The Norwegian Mapping Authority (Kartverket) provides elevation data and orthophoto online, the Norwegian Institute of Bioeconomy Research (NIBIO) has a multitude of maps that describe vegetation kind and quality, and the Norwegian Water Resources and Energy Directorate (NVE) has maps over power-lines and positions of power-lines and masts as geojson.
COGS has a built-in terrain system, which can pull elevation and imagery data directly from e.g WMS. It is an an implementation of geometry clipmaps, and out-of-core datasets are handled by on demand fetching, level-of-detailing and caching. It is the usual go-to tool for us when we need map-based visualization. In addition, we have a very basic terrain system with minimal setup, which assumes that elevation and imagery can be fit into textures. The domain is dynamically refined based on the camera position using a 2D quad-tree tiling, where each tile is rendering using the same M×M grid. T-junctions between adjacent tiles of different refinement are handled with the approach of semi-uniform tessellations. As a side-note, skipping tessellation shaders allows this approach to run readily in WebGL.
To handle vegetation, I created a system that create vegetation on the fly as the camera moves. It is based around dividing the horizontal plane into a set of tiles, see Figure 7. The VegetationComponent has a list of relevant cameras, from which potentially visible tiles can be deduced. The elevation range for new tiles is not known, and thus the conservative approach of an elevation range of ±∞ is used: The frusta of the relevant cameras are projected down onto the horizontal plane, and the tiles that intersects such a projected frustum in 2D is potentially visible.
The VegetationSystem communicates with a data provider via TileRequests and TileResponses. A TileRequest contains the tile id and extent and a suggested sample count. From this the provider returns a TileResponse with the tile id and extent, a sample count and gridded samples with elevation, ground normal and vegetation layers (e.g. grass, pine, birch). Each layer contains the conditional probability for vegetation of that particular type to be present at a given position. Thus, the first layer contains just the probability for grass. The next layer contains the probability that, if the position does not contain grass, that it contains pine, and so on. For each vegetation layer, there is defined an ordered list of models (as there can be multiple models for e.g. pine), each with a probability, a footprint, and an occlusion flag. This probability is used to weight the models within a vegetation layer, the footprint is the extent that the model occupies, and the occlusion flag tells whether or not spawning a model of this type will prohibit other models to spawn within its footprint. Typically, models do occlude, which aligns with interpreting probabilities as conditional, but I found in some case, e.g. for grass models with a number of straws, it looked better to allow footprints to overlap freely.
The BuildTile task spawns vegetation onto a tile, issued with a TileResponse as input. First, an initial random seed is created using Wang’s multiplicative hash function on the tile id. Consecutive random numbers for this tile are produced using a linear congruential generator. Thus, tile construction is reproducible and can be done in any order. The tile is uniformly split into a set of cells, where each cell is flagged as occupied or not, initially set to unoccupied. With the TileResponse as input, the builder iterates through the model list of the vegetation layers present in the tile. For each model, the extent of the tile is subdivided into segments of that model’s footprint’s size. For each piece, the cells covered are traversed aggregating the average probability. If the cell is occupied, the cell’s probability is set to zero, otherwise it is sampled from the model’s vegetation layer. Then, a random number is drawn. If this number is less than the aggregate probability, an instance of that model is placed at a random position inside that segment with a random azimuthal rotation and a slight random scale. If the model has its occlusion flag set, all cells covered by the segment is set to occupied. The output for that tile is a list of ModelInstances objects, where each object can hold a fixed number of model instance transforms for a single model as well as an overall bounding box. Splitting the data into such fixed blocks allows efficient memory management, as tiles continuously gets created and destroyed as the camera move.
Each frame, the OrganizeInstances task culls populated tiles with the current frustum, and bakes the positions of ModelInstances into a buffer suitable for instanced rendering. The task is issued early in the frame and is waited upon by the renderer. Each model item has a level-of-detail sequence, with steps at set distances. Each step has a version of the model with decreasing detail. The OrganizeInstances task calculates the camera distance for all instances, figuring out which detail level to use, and culling instances that are too far away the camera. Everything is organized so we end up with one buffer per specific model detail level per camera.
To tie the pieces together: The VegetationSystem first issues OrganizeInstances background tasks using the current set of populated tiles and current camera transforms. It also issues BuildTile background tasks for all the TileResponses that have arrived since last frame. Then it finds the set of potentially visible tiles, and issues TileRequests for new tiles. Later, at the rendering stage the VegetationRenderer inserts a draw command with a hook for its render routine. Then the COGS renderer filters and organizes these tasks, figuring out the render order. When the render hook of VegetationRenderer is called, it waits on the OrganizeInstances task and then sets up view transform data and the model mesh and material and draws the instance buffers. There is a slight overlap between the levels of details where a model is rendered twice with adjacent detail level. In this region, the pixel shader discards pixels using the relative distance inside this region and a Bayer matrix. This makes the transition between levels of detail smooth, alleviating popping artefacts.
When setting up the scene for the drone, I restricted my self to a 4km×4km area, and with ½m resolution, I could fit a map into an 8192² texture and use the basic terrain system. In addition I created a corresponding basic vegetation provider that read geotiff maps and populated TileResponses with that. With some editing in GIMP, I created the source maps based on geographical data: The border between different forest types was blurred, and I create a ground texture by replacing site index with different kinds of green noise, and orthophoto for places without vegetation, see Figure 8.
And this concludes second part of this two-part text. Figure 1 in the first text shows all pieces coming together: Drone, ground-truth images and lidars. Nothing here is rocket science, but there are some interesting engineering problems, and some are quite generic, and code from this project has found its way into other projects. The adventure isn’t concluded, and I revisit it from time to time.