Native Support for British National Grid with Mosaic on Databricks — The Performance Catalyst
British National Grid — BNG shaped keys to the kingdom
With the version 0.3.2 update to Mosaic framework we have introduced British National Grid (BNG) as a first class citizen. We continue to work on making geospatial data processing more scalable and simpler. And in this particular case our focus was on the UK Public Sector and UK based organisations that consume geospatial data from Public Sector data providers like Ordnance Survey, DEFRA, Met Office, and others.
BNG is a geospatial indexing system that overlays a grid of squares across all of Great Britain. Each square is subdivided into 4 smaller ones, with each sub-square being continually divided. This provides a set of squares that cover GB ranging from 1000km2 down to 1m2. BNG is a central concept for bringing the geospatial data together in the context of the UK Public Sector. Many departments are using geospatial data at the centre of their data analytics, insights production and statistical work; but there is a lack of standardisation in tooling across teams, and many organisations are struggling to scale their workloads to the Big Data world of today. With the addition of BNG support, Mosaic now provides an easy to use, scalable, and open source toolkit for working with British geospatial data. This allows geospatial experts to work on big data problems that were previously intractable by providing standard transformations in Apache Spark on Databricks, without requiring prior expertise in Big Data processing.
National data strategy (NDS) clearly outlines the need for driving the better quality of data, more standardisation and interoperability of the data to drive economic growth and to enable public good outcomes. These concepts all come into play when one is required to establish a standard way to store geospatial data that simultaneously ensures best performance and interoperability both internally and externally.
By using BNG one can make sure that they are standardising to the representation many UK Public Sector organisations have adopted. This means you can preempt many challenges that naturally arise with geospatial data processing at scale. Integration and combination with external data sources will be much simpler, since both the external data and your internal data will adhere to the same indexing standard. This is paving the way towards a geospatial data marketplace within the UK Government.
A hard byte to chew on
There are many layers — when it comes to geospatial data one can expect many joins and joins tend to bring the most time consumption in large scale computing.
Some slices are larger than others — most of the data sources have hotspots, i.e. data is clustered around specific areas. This makes partitioning hard, parts of the distributed joins will run much slower than other parts — known as the join skew.
Some bites have more custard — different rows in the data can represent very different things. In the vector sources complex shapes tend to be very diverse: in the same data source one row can describe a river (a LineString with a few dozens of vertices) and a lake with a few small islands (a Polygon with multiple holes that is represented by hundreds of vertices).
Some layers don’t stick together — Combining vector and raster data isn’t an obvious task due to different storage formats and different indexing patterns.
The approach that addresses all of the above problems is centred around grid index systems such as H3, S2, geohash and, of course, British National Grid (BNG). The Mosaic framework was originally inspired by a joint project from Ordnance Survey, Microsoft and Databricks (link) and it continues to be inspired by examples such as the great work of Thasos (link) who are processing petabyte scale of geospatial data using Mosaic on Databricks.
Why Local Grid
Up until now Mosaic was supporting the H3 grid index system as a universal indexing system. H3 as a grid system brings a lot of value to the the framework:
- Databricks has native implementation for many H3 expressions in Photon that brings unrivalled performance for grid indexing capabilities.
- Mosaic uses these capabilities on enabled workspaces in Databricks.
- H3 forms an envelope around the globe, i.e. we can use it for any locality.
- H3 is hierarchical systems, which means we can index our data at different resolutions. This helps us address data skew differently in different circumstances.
- There are many 3rd party solutions (e.g. Carto, Safe Software, etc.) that have already adopted H3 as an indexing system and centering around H3 makes integrations much easier.
However, there can be disadvantages to consider when working in a local context. Global grid index systems such as H3 are focused towards a global grid which can introduce unnecessary complications to generating non-global geospatial data products. Organisations like Ordnance Survey have been producing geospatial and cartographical data for centuries now with a clear focus towards national geospatial data needs. The disadvantage is that the data needs to be provided in EPSG:4326 coordinate reference system. If we take Great Britain, most of the geospatial data will be provided in EPSG:27700 (OSGB36) coordinate reference system. So in order to use H3 we would need to perform coordinate reference system conversions that are costly and can introduce computational errors into the data.
Another very important consideration is that of catering for already existing data consumers and the delivery of data in the expected formats. Using a global index system would introduce changes into the data supply chain that other than standardising to another system brings little to no value. Taking all of these things into account, the natural choice is to keep BNG as the indexing standard and focus on delivering a stable and performant implementation of BNG.
British National Grid & Mosaic
We have implemented BNG natively as a part of Mosaic offering. This means that all of the Databricks users in the UK can easily access BNG implementation that is robust and scalable. This applies to organisations both in the Public and Private sector enabling a much wider adoption for the standard and easier integration of both open and commercial data assets.
Mosaic brings all the necessary abstractions needed to use BNG on spark with little to no additional effort to the end users while introducing several optimisation at the same time. Mosaic implements BNG natively within Apache Spark by extending the Catalyst Expressions. This means no user defined functions are needed (UDFs). One can use these expressions in python, SQL, R and Scala which provides unification of the geospatial tooling across different user groups that may have preference towards certain programming languages.
The availability of BNG is paving the way to implement other coordinate reference systems based on planar concepts similar to northings and eastings. The next planned extensions to the supported grid index systems will include Irish Grid Reference (IGR) to ensure complete United Kingdom coverage.
Mosaic is introducing some optimisations to the way BNG can be used. The traditional way of using BNG is referencing string typed cell IDs — TQ2323 resolutions in base 10 metres and TQ2323SW for resolutions in base 5 metres. Mosaic brings a way to encode these strings as long type identifiers. Long type identifiers are more optimal for operations like joins and ZORDER of your geospatial delta tables to ensure best performance.
While Mosaic implements the opinionated long based approach, the traditional approach of encoding BNG cell IDs as strings is supported natively. One can easily switch between the two approaches by a configuration on the Mosaic execution context.
Finally, Mosaic supports easy plotting of BNG cells in KeplerGL using mosaic plotting magics in Databricks notebooks. This brings an interactive development experience to the end user.
For a full set of capabilities please visit our documentation page (link).
We provide a full walkthrough notebooks (link) that demonstrate how to configure Mosaic for BNG and how to perform your first optimised massive scale geospatial joins based on BNG.
Special thanks to:
Steven Kingston (Ordnance Survey) for inspiration for long type encoding and constructive feedback during the implementation
Dr Linda Sheard (Microsoft) for being an amazing Mosaic champion
Ed Fawcett-Taylor and Dan Lewis (DEFRA) for many hours discussing BNG and how it can be applied to both vector and raster data
Erni Durdevic, Robert Whiffin, Michael Johns (Databricks) for all the work they put in helping BNG capability being built and tested
For the curious — Space filling in BNG using long cell IDs
Space filling curves are a very important concept when it comes to optimising storage of your data for big data use cases. Delta implements this concept as a part of ZORDER operation. ZORDER is defined as a multidimensional clustering that ensures similar data is collocated in the same set of files in the storage location. Given that geospatial data represents spatially correlated rows of data and most of the geospatial queries rely on the vicinity of geospatial features ZORDER-ing a table based on spatial grid index can significantly improve the performance of queries.
The way BNG is set up naturally brings the idea of space filling. The cell IDs are defined as a zig zag pattern between 2 coordinates. The first letter is computed based on the northings coordinate, the second letter is computed based on the eastings coordinate. The numerical bins follow the same pattern, the first N digits are based on the northings value and the second N digits are based on the eastings value. The number of digits N is determined by the resolution at which we are indexing our data.
This approach means we are drawing a specific space filling curve.
This concept applies in the case of base 5 metres indexing resolutions as well. In this case we are adding a quadrant encoding as a letter pair (SW, NW, NE, SE). We have added an encoding number in our integer packing to address this need.
Finally putting everything together we get the space filling curve for the long type cell IDs encoding in 1(NL)(EL)(NB)(EL)(Q) format where:
- NL are 2 digits encoding the northings letter
- EL are 2 digits encoding the eastings letter
- NB are N digits encoding the northings bin (N is determined based on the selected resolution)
- EB are N digits encoding the eastings bin
- Q is 1 digit encoding the quadrant
Using either of the two available encodings for BNG can combine well with ZORDER operation in delta. The long typed encoding is slightly more performant due to more optimised data type and better space filling, while the string typed encoding is more widely used and it is more interoperable. Depending on the application one can convert between the two formats easily using Mosaic functionality.