Going Global — Part II

This is the follow up from my previous post, Going Global — Part I, where the challenges associated with assessing feature requests are discussed.

The creation of a map of Flyover Country’s requests to the Macrostrat API made it clear there was a large demand for global coverage of our geologic maps, but why has there not been more extensive coverage from the beginning? To answer this, a bit of background on Macrostrat and Flyover Country is required.

When the concept for Flyover Country was being explored in Fall 2014, Macrostrat only had two geologic maps — the geologic map of the US (GMUS), and the geologic map of North America (GMNA). Given the research objectives of Macrostrat, and the fact that Macrostrat primarily covers North America, this was sufficient for most of our needs. However, when we came across additional valuable maps, it became obvious that we needed a coherent workflow and homogenized schema in order to integrate more maps in a coherent manner. From this need a new geologic database for Macrostrat was developed, internally referred to as `Burwell`.

Burwell is intended to accommodate diverse, multiscale geologic map data, so long as it contains, at a minimum, an age and lithology for each map unit. Each map source retains all original attributes, while also being placed into a scale-specific homogenized table. Four scales exist — tiny (global), small (continental), medium(state/province), and large(county/quad).

Flyover Country has been using the North American geologic map, which belongs in the small scale in Burwell, but the API service they rely on predates this newer database. To make Burwell work for Flyover Country, a few challenges needed to be overcome.

Challenge #1: coverage of a source in Burwell is not guaranteed to be unique. For example, in the small scale layer both GMNA and an arctic dataset cover Canada north of about 70 degrees, meaning overlapping geometries would be returned for that area if queried in the same way as GMNA.

Burwell small scale sources in blue, on top of the tiny source in orange

Challenge #2: Given #1, how does a consumer of our API become aware of overlapping sources and choose which one they would like?

Challenge #3: If an API consumer knows which sources they would like, and in which order, how do we efficiently execute the spatial operation necessary to return the requested geometry? Often times this will require intersecting tens of thousands of complex polygons, which can be a very slow process.

Challenge #4: The geometries returned by continental flight paths can be large, but are still small enough that they can be delivered and rendered quickly. However, global flight paths can be much longer, resulting in possibly much larger returns to the client — how can we make sure that no matter the flight route, the geometries are quickly cut up and returned to the client?

Challenges #1 and #2 are the most difficult to solve (at least for me) because they include both technical and user experience problems.

I have always felt that there are two main approaches one can take when designing a user experience — sensible defaults and customizability. While both can be valid solutions to a problem, I almost always start with a sensible default approach, and if necessary in the future, expand in the direction of customizability. There are a few reasons for this: the first is that I find it cognitively easier to develop quickly when I don’t have to consider every possible combination of options. The second is that sensible defaults lead to a simpler user experience that allows the developer to guide users with their expertise. I like to think of this in terms of two different design approaches: mobile first and the cartographer as the abstractor of reality. Mobile first is in essence creating the simplest interface first, and the only way to do that is to make decisions and assumptions about users’ needs. Likewise, cartographers are taught that maps never perfectly represent reality, and that it is their job to generalize reality in an appropriate way for the users’ task. These principles essentially agree that in order to have a enjoyable and productive user experience, the expert must use her expertise to mold a coherent and simplified experience for the user.

With that in mind, I decided that the best way to approach challenges #1 and #2 were through the use of sensible defaults. In this case, this meant making decisions about which geologic map sources had priority in the event of an overlap with another source, instead of revealing all options to the user and allowing them to decide. After all, as the curator of the data I should have the knowledge to make a decision that the overwhelming majority of users would make if they had the same information available. To do this, when a new map source is ingested into the database, a priority can be assigned that is used as a tie breaker when intersections occur in the same scale.

With this decision, the API consumer no longer has to worry about which sources are available, the relative quality of each, or which order they should be rendered in. All they have to know is a general scale they would like to visualize. Additionally, because the sources and priorities are known a priori, a table of these intersected geometries can be built, saving processing time on each client request.

Challenges #3 and #4 are more technical in nature. To solve #3, the first step is to union the geometries of each source to create a single polygon (or multipolygon) that represents the maximum extent of a source. With sources that contain hundreds of thousands of complex polygons, this can take hours per source, so upon ingestion into the database this process is run in the background to create and store it once. By creating this unioned geometry for each source, spatial intersections and differences become significantly faster because only a single polygon needs to be run through those expensive algorithms as opposed to tens or hundreds of thousands.

(Sidenote: because some geologic map sources contain so many complex polygons, a simple ST_Union in PostGIS cannot always be used because an error will be thrown that indicates a maximum array size has been reached. To get around this, I use Shapely and Python’s Multiprocessing processing module. For each source, I create groups of polygons that are no larger than 10 million vertices each and use Shapely to union the polygons in each of those groups. In order to speed things up, Multiprocessing is used to distribute these jobs across multiple CPUs. Once each group of 10 million vertices has been unioned, all the resultant groups are unioned to form a final geometry.)

To create the preprocessed table of geometry, a few steps need to happen. For each source involved, we first take the unioned geometry of each high priority source and use it to clip the lower priority sources from that scale and then combine that result with the geometries from the high priority sources. This gives us a nonoverlapping set of geometries that represents a given scale.

Next, we union the unioned geometries of an entire source and use that resultant geometry to clip the lower priority source. This results in a nonoverlapping set of geometries that combines multiple scales and takes into account the quality of sources within a given scale.

Combination of the small and tiny scales of Burwell

Finally, to speed up the delivery of this data, the result of each query is run through mapshaper, an excellent tool for simplifying geometries. While it adds negligible overhead to the request (around 200–300ms on average), the response can be cut down to about 17% of its original size with an imperceptible loss in accuracy, especially at the scales this data is intended to be viewed at. This means a response that used to be 2.3MB gzipped is now 485Kb gzipped, which is a huge difference for data that is typically consumed on a mobile device.

The most important lesson I learned from this exercise is that there is nothing wrong with having an alternate version of your data for strictly visualization purposes. If you are frequently creating maps of large, complex datasets, storing simplified or generalized versions can make the design process significantly easier. Additionally, if the data is only being used for visualization purposes, especially at small scales, simplification of the geometry will provide significant performance gains and may even be more aesthetically pleasing.

Going forward, these visualization-specific variants of the data will be useful for applications such as Flyover Country, and also open the door to producing vector tiles.