As you might have guessed after reading some of GeoPhy’s blog posts: we deal with a very large amount of geospatial data. In the case of GeoPhy, we are referring to two distinct types of geospatial data:
- Point based, having a location on the planet (latitude/longitude) and one or more attributes we typically want to aggregate. A typical example would be points representing bars (Lat/Lng + Attribute::Bar)
- Polygon based, like the points but with multiple locations describing a closed boundary (technically speaking we also have MultiPolygons like a boundary with holes). A typical example would a county or metropolitan area.
Since data integration and maintaining a flexible, well-described data structure is one of the key activities of GeoPhy, we made the choice to use RDF as our main means of storage. There are several great extensions on the standard query language for RDF (Sparql) for performing the queries we need to do like: GeoSparql and STSparql, so we felt confident going in this direction.
After some initials tests we determined that it would indeed be a great fit for us with one caveat: performing geospatial queries at this scale is really slow. Nevertheless, we felt like RDF would be a good fit and started looking for other ways to satisfy our hunger for geospatial queries.
To give you some sense of the scale of the dataset some high level numbers of the dataset at moment of testing:
- 350,000 Polygons describing geographical regions
- 10,000.000 Points describing amenities
- 100,000,000 Points describing buildings
- 1,200,000,000 Polygons describing catchments (see below), 12 per building
We consider this a pretty large dataset, with a focus on the US. But in the near future, we will start expanding resulting in an increase of amenities and building. As you can see, we will have 12 extra polygons for each building with an expected target next year of 250,000,000 buildings… So scale needs to be taken into account!
When to query and how much?
Our Data Management Platform [DMP] takes data as something that is always in motion: coming at different moments and in different sizes. Therefore the DMP is built as an event-driven streaming platform in its core: all our services for matching and enriching data operate automatically upon the data flowing into the platform. The DMP itself has 2 distinct phases: ingestion and enrichment. The geospatial challenge we are describing in this post is confined to the enrichment phase alone. In this phase, one of the core activities is to take a building location and enrich it with numerous datapoints describing its surroundings, i.e. the number of restaurants within a 5mins walking distance. The common way of doing this is using the buffer technique: take the lat/long of a location, draw a circle around it with a certain radius, and count all points in the circle. Early on, we have concluded in our research that using the buffer does not result in a fair representation of the actual world: there might be a river next to the building that is blocking access.
To solve this, we developed the GeoPhy Reach catchment polygon: taking all data known about the road network, we construct the polygon describing the actual accessible area. This is why every building has 12 extra catchment polygons: driving, walking and public transport catchments for various travel times. Since a circle is an easy-to-describe geospatial feature, the impact of the query is marginal compared to a geospatial query involving a polygon.
Often, the query filters on distance from the circumcenter instead of intersecting two polygons, making this much more efficient (a circle can be described in a single mathematical function where a polygon requires the actual points in the correct order). Nevertheless we favour the catchment over the buffers, since its impact on the final result can be significant.
Within our architecture, we have an Aggregation service responsible for retrieving all aggregations we would require. Since these aggregations serve as an import input for our models, there are quite a lot, more than 300 per building. There are 3 types of aggregations that we perform:
- Count the number of points-in-polygons given various filter criteria;
- Sum/average the attributes of point-in-polygons;
- Sum the area of intersecting polygons.
These queries are performed on every new building inserted in our database or on every change in the aggregated data (i.e. a new restaurant is added). Our typical ingestion size is between 10.000–100.000 buildings per day with peaks hitting 1 million. To keep up with the influx of new data, we are aiming to enrich 5,000 buildings/hour, resulting in 1.5 million queries per hour or 420/second as a target value.
The shear size of data and its constant changing nature makes that there is very little to gain from caching: keeping it up to date would be equal to the amount of queries for doing it “live”.
One advantage is that the DMP is considered an internal platform with no direct access for our clients. Its main purpose is to feed our models and datastores we have for our products.
This makes queries per second more important than latency as long as we can have concurrency.
A small but important distinction, also considering the service-oriented nature of our architecture, where we can scale the number of running aggregation services.
Summing up the requirements
To sum up the requirements, the final geospatial storage engine must be:
- Able to support indexing and querying geospatial shapes, in particular points, polygons and multipolygons at scale;
- Able to filter based on the geospatial features’ attribute;
- Able to perform aggregation methods like SUM(), COUNT() and AVG();
- Able to sum up the area of intersecting polygons;
- Able to handle a constant inflow of data (important for considering the indexing mechanism).
For testing purposes we had to define which engines to use — there are many to choose from. To limit the scope, we decided on using engines that were at least known to use to reduce the amount of complexity: PostgreSQL, ElasticSearch and SOLR. SOLR might be an odd choice, but it turned out to be a pretty good engine for doing the buffer-style queries in a previous project. PostgreSQL being an obvious contender having excellent geospatial support using PostGIS. ElasticSearch was recommended by several people as being an interesting contender because of its proven way of scaling and integration of geospatial search queries.
- PostgreSQL 10 + Postgis 2.4
- SOLR 7.2.0
- ElasticSearch 6.2, ( Index settings: 1 shard, 0 replica)
Since running the benchmark on the full dataset would result in creating several terabytes of data, we decided to come up with a smaller representative sample that would illustrate the issues we had in the past. The following 4 scenarios were defined:
- test-a1: Amenities 1
Real world spread throughout the United States having 1 million amenities
- test-a2: Amenities 2
Local stress test clustered in the county of Los Angeles having 1 million densely packed points simulating amenities
- test-b1: Landuse 1
700,000 Landuse polygons throughout the United States
- test-c1: Census 1
33,120 Census tracts throughout the United States
The target data consists of a dataset with 10,000 random properties in the United States, enriched with 4 polygons that represent a buffer and 4 polygons that represent a catchment. The buffer radius is determined by the average catchment size to represent a similar type of aggregation. The catchment sizes are a combination of two modes of transport (walking and driving) and a 10 and 30 minute commute time.
The test environment
For each of the databases we set up an c5.2xlarge instance on AWS (8 vCPU, 16GB memory) and an additional instance for coordinating the tests and running jMeter. Each of the 10,000 buildings would be used in the test, having either 4 buffers or 4 polygons to query resulting in 40,000 samples. The buffer tests were done 1 and 100 threads so see how it would handle concurrency and the catchment test just at 100.
Not really much to tell from a single thread query. It does not even come close to the required 420 queries/sec
This does start to show some differences. Clearly the simple buffer style queries work really well in ElasticSearch and SOLR. ElasticSearch really shines doing point-in-buffer queries and SOLR outperforms all when it comes to buffer-polygon intersections and fast enough in the point-in-polygon. Postgresql holding the middleground being fast enough on all query styles
As expected the more complex queries are the true torture test for all databases. SOLR could not handle the large volume of points doing point-in-polygon queries. PostgreSQL remains a good performer, only having minor struggles doing the sparse (A1) queries.
As you can see ElasticSearch is missing in this result set. At the moment of testing, and even a couple of days after, ElasticSearch was still doing importing and indexing data… At low precision (> 100 meters) it all went fine but lowering it to the required < 10 meter put the engine to its knees. The high precision would be fine for high level aggregations but since we are typically dealing with highly localised differentiators this disqualified ElasticSearch as an option.
- Very very fast at indexing (and a small index disk size);
- Weak performance for point in simple polygon (buffer), but still not that bad;
- Similar performance for polygon in buffer. Best performance for anything to do with multipolygons (catchments).
- Extremely slow at indexing;
- Consumes large amount of disk space with high precision (<10meter) polygons;
- Best performance for point in simple polygon (buffer);
- Worst for polygon in polygon.
- Very fast for buffer-polygon intersections;
- Extremely slow for point-in-polygon queries;
- Indexing also fairly slow.
SOLR and Elasticsearch are really good at handling tons of concurrent requests for simple spatial queries. But it becomes problematic when they need to index complex polygons because it takes a very long time and the index size balloons out of proportion. This is solved by increasing the precision to 100m. Even with the lower precision Postgresql still outperforms them for multipolygon spatial queries, while also having better results. There were some cases where SOLR was wrong in the aggregation due to the 100 meter precision round off. Postgresql was chosen as the engine to move forward on.
One more thing…
One of the possible challenges we brought up during the process is the scaling of Postgresql to handle the 1 billion+ datasets we described in the technical requirements. There are known solutions for setting up larger PostgreSQL clusters with several read-replicas to spread the load whilst diving also into sharding the data (sharding by geographical region is a relatively simple solution here) but they are known to be tricky to setup and maintain. In the last week of the benchmark we decided on doing a “quick” test on the, then newly, released Amazon Aurora PostgreSQL cluster since this claimed to be a perfect solution to our problems
“Aurora is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases.”
“Aurora features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance.”
We decided on doing a large scale test feeding it with over 1.2 billion polygons (12 polygons for 100,000,000 locations scattered throughout the US) and run the same tests. The results were almost identical to the test we did on a single instance but with 120x the amount of data. This result was astounding. Yes we compared a full cluster to a single instance… but this is a fair comparison in our minds since it took as the same amount to setup and will not cost more to maintain. Plus the promise of seamless scaling is a very nice bonus.