Simpler OpenStreetMap Data Exchange Formats
Simpler OpenStreetMap Data Exchange Formats
The PBF format appeared at the end of 2010 as a compact binary alternative to the voluminous XML representation of OpenStreetMap data. It was a major step forward, reducing the size of files by half or more and making OSM processing much less resource-intensive. I have made heavy use of PBF and certainly appreciate the benefits of the format, but having now written several pieces of software that consume and produce it I have come to the conclusion that PBF is significantly more complex than would be necessary to meet its objectives of compactness and speed.
While working on Vanilla Extract I created a plain text debug output format with one OSM entity per line. On a hunch I applied a couple of the most effective tricks from PBF (delta compression and variable-byte zigzag encoding), piped the resulting binary stream through a general purpose compression algorithm, and was pleasantly surprised to discover output that was consistently about 8% smaller than PBF and faster to process despite requiring substantially simpler code.
The columns labeled “vex” represent the size of OSM data in this format, in comparison with equivalent PBF and bzipped XML files, as well as the text format (more on that below).
Observations on PBF
Over several years of working with the PBF format I’ve encountered the following issues which motivated and informed the design of my format:
- PBF has a multi-level structure. Optionally compressed Blobs contain PrimitiveBlocks, which contain PrimitiveGroups, which finally contain lists of OSM entities. While breaking the file into blocks is reasonable, this layering strikes me as unnecessarily complex. A single flat stream is much simpler to work with.
- There is a separate “dense nodes” data type and heavy use of parallel arrays to deter Protobuf from adding integer tags to each stored field. This need to circumvent the natural mapping of one Protobuf message to each OSM entity (i.e.one Protobuf type to each OSM entity type) is a sign that Protobuf is not an ideal fit for OSM data where compactness is desired.
- String tables are used to replace highly repetitive strings such as tag keys with integers. Because PBF blocks can be (are usually are) gzipped, these string tables are redundant: compression amounts to building a very efficient lookup table for all data in the stream, strings or otherwise. The additional code and computation are not neccessary.
- There are two separate Protobuf specifications used in PBF, one for the file block structure and one for the OSM data. Actually encoding and decoding the format requires the implementer to intersperse quite a bit of hand-written auxiliary code into that generated from these specifications. In my view this defeats the purpose of using a schema-driven declarative code generator like Protobuf.
Applying the most effective elements of PBF
The aspect of Protocol Buffers that the PBF format benefits from most is the variable-byte zigzag coding of integers, which PBF applies to delta-coded fixed-precision coordinates. This technique used in isolation yields a better compression rate than PBF when coupled to a general purpose compression library like zlib. Interestingly, even storing the delta-coded values as fixed-width integers yields output roughly the same size as PBF. Gzip is quite effective at compressing out all the empty space and repeated small numbers.
Armed with this knowledge, we can achieve compression ratios and processing speeds markedly better than those of PBF with file formats that are much simpler to implement. Both of the OSM exchange formats in widespread use today are complex to parse, and this is likely impeding the development of OSM infrastructure components. Therefore I encourage the OSM community to consider employing much simpler binary formats. I have included an example implementation of such a format in Conveyal’s OSM library. The diagram below summarizes the format. The colored rows represent the flat data stream (all numbers are variable-byte coded), and the nested boxes above those rows show how the stream is structured.
Each block contains entities of a single type (nodes, ways, or relations). This diagram shows one block of each entity type, but there could be zero, one, or several blocks of each type. Every entity begins with the same common segment consisting of an ID and a list of tags, which is followed by a type-specific segment: nodes have a pair of coordinates, ways have a list of their constituent nodes, and relations have a list of references to other entities. Delta coding and variable-byte coding are applied throughout. This example does not contain author or timestamp metadata (we omit it in routing applications) but it could easily be incorporated into the common segment of every entity.
Note that the grammar for this format is LL(1) and it can therefore be consumed top-down using nothing more than a few clearly named functions that call one another.
The potential of text formats
In light of these results, I gave my original text OSM format a quick makeover and performed some similar tests. Here’s an example of this basic line/field based format:
W 26201468 name=Larch Mountain Trail 441;highway=footway;sac_scale=yes
N 491901232 45.575881 -122.113032
N 1540261796 45.575889 -122.112751
W 35567365 oneway=yes;highway=service;service=driveway
N 416781319 45.578931 -122.116013
N 416781320 45.578946 -122.116195
N 806988988 45.579015 -122.117271
N 806989064 45.579041 -122.117595 highway=crossing
N 697817499 45.579072 -122.118158
N 697817504 45.579103 -122.119243
Each line represents an OSM entity. The first field gives the element type (W=way, N=node, R=relation). Node lines have the fields [N nodeID lat lon tags] and way lines have the fields [W wayID tags]. Each way is followed by an ordered list of the nodes that make it up. The full details of each node are included inline in this list. This text is piped through gzip to compress out repetitive tokens and number fragments. To avoid redundancy, once a node has appeared in the stream it could subsequently be referenced by ID alone. However this optimization has only a minor effect on file size that is not really worth the increase in fragility and complexity.
In experiments, this format is about 40–50 percent bigger than eqivalent PBF, but still 18% smaller than bzipped XML (and keep in mind that it is using faster zlib compression). The code to write and consume it is obviously simpler than that for PBF and XML, but the VEX binary format remains clearly superior for most purposes given its compactness, lack of string/numeric conversions, and only marginally higher code complexity.
EDIT: The text format described here is reminiscent of the OPL format. However, OPL considers each line in isolation. An important characteristic of the format described above is in-lining a full representation of each referenced entity (nodes) after their parent entities (ways).
Comparisons with O5M
In the charts above the VEX format is also compared to the O5M format which has some similar goals. While O5M possesses a flat binary structure, like PBF it still requires consumers and producers to manage string tables. While the string tables in PBF are merely redundant due to the optional zlib compression usually applied on top of them, O5M uses string tables specifically to avoid relying on general purpose compression. The tables are of a fixed size with a least-recently-used eviction strategy, making them more complex to implement and prone to backfiring on certain input. I question the effectiveness of this design decision considering that zlib compression is quite resource efficient, can operate on streams, and is available practically everywhere. Use of general purpose compression seems particularly justified in light of the above test results: O5M is actually larger than a basic text dump piped through zlib, and gzipped O5M (in which string table construction is redundant) is larger than both PBF and VEX. In one case (the state of Oregon) the O5M output was actually 12% larger than bzipped XML.