Among the challenges we face at Picterra, how to set up an application and infrastructure that enables the processing of large quantities of GIS data has to be the most complex but also one of the coolest. Since we recently unleashed detection cells, which enable detection on much larger images (much more than our previous 900MP maximum), we wanted to challenge ourselves with something larger-scale: the detection of solar panels on the whole city of Zurich, Switzerland.
Our typical workflow involves our user creating a detector using our UI. Once trained, the detector can be applied on images to detect objects in those images. Some users will upload their own orthophotos (GeoTiff) and run the detector on those images.
For such large datasets though, the images are typically hosted in a Web Map Service (WMS), such as ArcGIS or GeoServer (we’ve even compiled a list of awesome open services that you can import directly into Picterra). With WMS, the difference with our traditional workflow is that the images are not present on our servers when a detection starts: we need to download them first. This means a lot of HTTP GET queries that need to be done before we can start the actual detection. Depending on network conditions and on how much the server likes to be hit with this many queries, this might fail and/or take a considerable amount of time, with timeouts and failures likely.
In this article, we’ll look back at some of the hurdles we faced and the solutions we came up with to get detection on large WMS working at scale.
The first hurdle we encountered was resource limitations. Processing these large areas meant at some point the data had to be present in memory. This meant either extra large machines with huge pools of memory, or many smaller machines processing data in smaller batches. We went for the latter and thanks to the fact our machine learning models are highly parallelizable, we simply split the detection area in what we call “detection cells”, which are rectangular areas with a maximum size of 30912x30912 pixels (our previous ~900MP maximum). Anything more than that will generate new cells that will be processed independently. (Note that the number of cells does not affect the pricing, we’re charging only for the amount of pixels that we process, which is what we actually pay for).
Our scale test got split in 50 cells, for a total amount of 18996MP.
Then, we looked at how time was spent during detection on these cells: most of it was downloading tiles from the WMS, with just a fraction of the total time actually using the (costly) GPU. To better utilize our resources (and cash) we added a “WMS Prefetching” job. This job, which runs on high memory, no GPU nodes, is responsible for pre-downloading and caching all of the tiles we will need for detection. Networking is always tricky, with failures being common and servers sometimes not responding, so our solution also features a simple retry mechanism. Once downloaded, the tiles are put in a cache so the detection job can access them as it needs. A particular technical point is that we’re using GDAL datasets behind the scenes to manipulate our GIS data. The download happens during GDAL’s
Using GDAL’s great WMS support, we can indeed transparently open a WMS using a XML description file like this:
To make sure our prefetching downloads the correct tiles (that is, the same ones our detection will need), we had to ensure that both the prefetching and the detection logic were performing the same
ReadAsArray calls. In this case, a simple python generator common to both modules solved our issue:
def enumerate_tiles(ds, detection_area):
# Perform the shared splitting logic based on the GDAL dataset
# ‘ds’ and the detection areas ‘detection_area’. This results
# in an array of `row x cols` cells. Then, yield the computed
# coordinates for each cell.
# Prefetch and Detection jobs both use enumerate_tiles, which
# guarantees they will both query the same (x_from, y_from,
# width, height) tuple of coordinates. ... for r in range(rows):
For c in range(cols):
yield x_from y_from, width, height
Then we scaled things up a notch. Our platform runs on top of a job queue: jobs are picked up by workers (pods in a Kubernetes cluster), so it’s easy for us to scale the number of workers and increase the platform’s parallel capabilities.
With higher parallelism however we had more failing jobs. After investigation, it turned out that some workers were picking up the same job at the same time, which resulted in SQL constraint violations of all kinds as they went on their merry ways. To ensure no two workers picked up the same job at the same time, we patched our job selection logic to use django’s
select_and_update(skip_locked=True) function (read this great article about the subject). Now, whenever a worker tries to find its next job, it will lock any row it selects. Other workers won’t see these rows until the original one unlocks them (which happens fairly quickly). This serializes the order in which workers pick up jobs, and ensures they never pick up the same job.
Scaling up also meant playing along with Kubernetes’ pods scheduling. Our workers run in isolated units of work called Pods. Kubernetes deploys Pods onto Nodes (machines) in a way that best fits the available nodes, and bases its decision on information we provide it.
With our current detection cell size of 30912, the WMS prefetching job eats up roughly 3GB of memory. Without setting explicit memory requests on pods specifications, Kubernetes cannot know this beforehand and might schedule too many pods on the same node, which would create “memory pressure”. Memory pressure happens when pods running on a node consume all of its memory: the node runs out of memory, leading to failures. Under memory pressure, Kubernetes terminates pods according to their QoS class and reschedules them. They might get reassigned to the same node, to a new node with enough memory, or to another node without enough memory. This isn’t necessarily bad, since a job picked up by a worker that gets terminated would simply be put back in the queue. It’s still pretty bad though, in the sense that in the best case it leads to delays and wasted resources, with jobs needing to run multiple times until they can finish, and in the worst case the workloads never reach a stable state, resulting in no job being able to complete.
To easily prevent this, we let Kubernetes know how much memory our worker is expected to need, using Kubernetes’ memory requests. This lets the scheduler make an informed decision on where to place pods, as well as warn us when we want to scale too high (pods unschedulable) or too low (low resource requests). We also took this opportunity to take a closer look at how our nodes were utilized and reconfigured our pools to ensure a better memory utilization.
As you can see, we still overestimate a fair amount. We’ll be monitoring usage across the coming weeks and fine-tune our clusters according to what we observe.
The last brick to our iteration on the scaling aspect of the platform was easing up the load on our database instance. We use a postgresql database, which (among other things) lets jobs store and retrieve all the data they need while running. With this many jobs running in parallel, our database was getting hit pretty hard, so we had to include a connection pooler, namely pgbouncer.
Each new connection to a postgres instance comes with a new process, some memory and time overhead, as well as more chances of seeing a transaction being locked by another one. To mitigate this, postgres defaults to a maximum of 100 simultaneous connections, and will refuse to open any new connection when these are opened (actually by default it reserves 3 connections for superusers, which means it will gladly open 97 connections, but refuse the 98th). For these reasons, it is best practice to maintain and reuse a small number of connections to a postgres instance while ensuring these connections are highly used most of the time. This is exactly what pgbouncer does: it acts as a proxy in front of the database, pre-establishing connections to postgres and serving them to any client that needs one. After redirecting all of our services to pgbouncer instead of postgres, it was just a matter of tweaking pgbouncer’s settings to fit our needs, and this really unlocked all of the platform’s scaling potential.
One thing we haven’t talked about is how we’ve set up our HPA (horizontal pod autoscaler) and scaling metric. We’ll do a separate article just on that later on.
It was a mad team effort, and the whole team deserves props for pulling this one off. We’re very happy with the results, and are now only limited by our imagination.
Got a ridiculously large image you want to detect something on? Get in touch.
Scaling is fun.