Batch LiDAR processing made easy with Python + PDAL + Entwine + AWS
--
Recently I was presented with an opportunity at work: Leverage USGS LiDAR to collect building heights for nearly half a million locations spread across the United States. As one of the few people in the company with LiDAR experience, this was my project to solve and an opportunity to learn some new geospatial tools. Specifically I wanted to use PDAL to answer this question.
My previous experiences with LiDAR had been with datasets collected from a UAV (Unmanned Aerial Vehicle) or TLS (Terrestrial LiDAR Scanner), which meant I operated directly on very large point cloud .las files and spent a lot of time in CloudCompare, or even working with LAS Tools in (gasp!) ArcMap. This was (and remains) a time and machine intensive process. Loading and rendering tens of thousands of points is a real drag on an operating system and while looking at the LiDAR point cloud is pretty cool, the end result was typically a LiDAR-derived product like a raster Digital Elevation Model (DEM), raster Canopy Height Model (CHM), or just a height above ground value to assign to a point or polygon. Downloading and processing hundreds of point clouds was simply not a realistic approach to my building heights problem. There had to be a better way!
A Better Way: PDAL + Entwine
I have attended FOSS4G the last several years, so I was loosely aware of PDAL, but because I hadn’t worked with LiDAR for several years, I had not gotten around to exploring it in depth. The Pointcloud Data Abstraction Library (PDAL) is like GDAL but for point clouds. The entire USGS LiDAR collection is available on a public S3 bucket in entwine format. Wait, what is entwine? Don’t worry, I had to look it up too. Entwine is an indexed, lossless tile structure for storing millions/billions/trillions of LiDAR points similar to XYZ tiles for imagery. An added bonus of entwine the ability to send range requests with a bounding box to only pull LiDAR points within an area of interest for processing and have processing steps simply cascade down to a desired output. This is a massive timesaver!
These days I do most of my geospatial work in python. PDAL has python bindings so to get started I set up a new conda environment with PDAL. PDAL can read, write, and explore metadata from the command line, but more complicated data processing requires constructing pipelines. Pipelines are JSON objects made up of readers, filters, and writers which are fed to PDAL to interpret and execute. I constructed a pipeline that did the following:
- Read an entwine within a specified bounding box
- Calculated Height Above Ground (USGS LiDAR is already classified with ground points as part of the data release)
- Changed the Z dimension to the new HAG values
- Limited the HAG values to within 0 to 150 meters above ground
- Clipped the bounding box to the polygon footprint
- Found the max Z value within my polygon
- Write the result to a csv
The actual syntax of the pipeline looks something like this:
[
{
"type": "readers.ept",
"filename":"LINK TO PUBLIC BUCKET ENTWINE",
"resolution": 0.1,
"bounds": "BOUNDS IN EPSG 3857 CHANGE BY CSV"
},
{
"type": "filters.hag"
},
{
"type": "filters.ferry",
"dimensions": "HeightAboveGround=Z"
},
{ "type": "filters.range",
"limits": "Z[0:150]"
},
{
"type": "filters.crop",
"a_srs": "EPSG: 3857",
"polygon": "WKT POLYGON TO CLIP"
},
{
"type": "filters.locate",
"dimension": "Z",
"minmax":"max"
}
]
Easy! This pipeline took about one second to run per location. Things get a little slower with a bigger bounding box or if I were to write the results to a .las or a .tif, but that wasn’t necessary in my case. And I never have to open a single point cloud in a desktop software! Pipelines are a flexible and powerful way to quickly process LiDAR data.
This is all pretty standard stuff so far, most of which I was able to learn from the great online workshop on the pdal.io website. What wasn’t clear to me though was how to batch this process for hundreds of thousands of polygons all needing the same pipeline treatment. Time to get creative!
Adding Python to the Mix
Because PDAL Pipelines are JSON objects, I decided to leverage python to batch manipulate them for me. I created and filled extra columns on my geopandas dataframe for each unique field that would need to change: the readers.ept filename, bounds, and the polygon in Well Known Text. Next, I wrote a python function to run PDAL’s pipeline.execute() in a loop with the JSON object updating each new row and the results returned as a numpy array and written out to a csv.
def run_pdal(csv):
# Open the csv
_csv = pd.read_csv(csv)
# Open the JSON
with open('S3/path/to/pipeline.json') as json_file:
the_json = json.load(json_file)
# Open an empty csv on S3
with open('filled_by_pdal.csv', 'w') as fp:
fp.write('id,x,y,z\n')for _id, _filename, _bounds, _polygon in zip(_csv.id, _csv.filename, _csv.bounds, _csv.polygon):
the_json[0]['filename'] = _filename
the_json[0]['bounds'] = _bounds
the_json[4]['polygon'] = _polygon
pipeline = pdal.Pipeline(json.dumps(the_json))
try:
pipeline.execute()
xyz = pipeline.arrays[0][['X','Y','Z']][0]
fp.write(f'{_id},{xyz[0]:.2f},{xyz[1]:.2f},{xyz[2]:.2f}\n')
except RuntimeError as e:
print(e)
# RuntimeError: filters.hag: Input PointView does not have any points classified as ground
print('RunTime Error, writing 0s and moving to next bounds')
fp.write(f'{_id},0,0,0\n')
pass
This function within my script submits a new pipeline to be executed by PDAL until it reaches the end of the csv. Later, another function to join these results to my original dataframe.
Running it on the cloud with AWS EC2
Even at roughly one second per record, with just under 500,000 records it would take my MacBook almost six days to finish this task. Possible, but not practical. Time to call in some backup and take this to the cloud. With the help of the largest instance available on AWS and some additional code to split up the jobs to individual cores, the same process that would have taken 6 days on my local machine finished in just over 2 hours! Of course unlike PDAL, taking my project this route cost money — just over $20.
Open Source and the People Behind it
I can’t think of a way I would have solved a problem at this scale without the processing power of PDAL, the accessibility of the data in entwine format, and the added muscle of AWS. And of course python, the grease that makes everything go. Now I can do all my LiDAR work with PDAL and keep the viewing of LiDAR point clouds on slide decks!
PDAL and Entwine are open source software maintained by HoBu, Inc and is in continuous development. If you are interested in using PDAL, take the time to run through the great tutorial workshop on the website. I’d also recommend subscribing to the mailing list should you get stuck, or just to follow along with the occasional question that comes through. http://lists.osgeo.org/mailman/listinfo/pdal
Lastly, the methodology and code described here is just one of what I’m sure are many ways batch processing with PDAL could be achieved. If you have thoughts or alternatives I’d love to hear about them!