Tapalcatl: cloud-optimized tile archives

Update: I’ve posted a follow-up to this with some clarifications, implementations, and more formalized specification here: https://medium.com/@mojodna/tapalcatl-2-updates-8eb827d969c1

With all of the discussion and energy around Cloud Optimized GeoTIFFs (COGs) over the past year or so, it periodically makes me think about efficient storage of and access to data that’s already been tiled. This is a problem that’s becoming more and more personal as projects like OSMesa create massive numbers of tiles that I’d like to provide easy access to (while managing them efficiently).

One of the many interesting (and under-rated) things that came out of the great Mapzen experiment is Tapalcatl. Its README describes it as

.. a ‘metatile server’, which takes requests for a specific single format and tile coordinate and makes a request upstream for a ‘metatile’ or bundle of tiles which contains the tile originally requested.

What is Tapalcatl?

Strictly speaking, Tapalcatl refers to the tile server. However, for our purposes here (and as a general reference), let’s treat “Tapalcatl” as shorthand for the format that‘s used under the hood, which is a ZIP file containing a compressed directory of tiles. Tiles may be present in multiple formats (GeoJSON, TopoJSON, Mapbox Vector Tile) and are grouped into separate archives according to the “metatile” that they belong to. This reduces the number of files required to store a map style tremendously (even more so when storing multiple variants together).

3x3 metatile, incorporating 9 regular tiles

If the number of files at a given zoom level is 4^zoom(it is), a metatile size of 2 (each metatile contains 2^size tiles) will reduce that to 4^(zoom — 1). With each doubling of the metatile size, the number of tiles will be reduced by a quarter.

This is a big deal, as a full set of only zoom 15 tiles is over 1 billion files (4¹⁵ = 1,073,741,824, 1,431,655,765 tiles encompassing all zooms). Just writing (or deleting) this many files can take days and will push most filesystems to their limits (ever run out of inodes before running out of disk space? This’ll make that happen).

With a metatile size of 2, that billion tiles drops to a mere 268 million (4¹⁴ = 268,435,456). With a metatile size of 8 (covering 4 zooms — 2³ — as recommended below), it drops further to just under 17 million (4¹² = 16,777,216, 22,369,621 in total). This is a seriously big deal when managing files and even more of a benefit when considering cost: there will be 1/64th (4¹⁵/4¹²) the number of PUT requests and many fewer GET requests, as each request can cover multiple tiles.

There are latency considerations at play here too, particularly when used with cloud object stores like Amazon S3. S3 has relatively high latency, which is why it’s usually used with CloudFront when serving content to users. At scale, the use of a content delivery network hides this for popular content, but it turns out that accessing fewer, larger files (and caching them, fully or in part) also reduces the overall latency in the system by decreasing the number of upstream requests. There’s a slight increase in the cold start time, as more data needs to be loaded before serving the first tile, but if the data and the tile server are on the same network, this is negligible.

Anyway, Tapalcatl (the tile server) allows the metatile size to be configured and will read tiles out of a dramatically smaller set of ZIPs (with the metatile size balanced against the size of individual archives), easing management and data updates. (It can optionally hash coordinates for use as a path prefix in order to improve performance on certain object stores by distributing objects across multiple partitions or wider trees — the typical {z}/{x}/{y} scheme results in very uneven partition sizes, as each partition initially corresponds to a single zoom, with the number of objects increasing by a factor of 4 for each level).

This is interesting because…

ZIP is a clever format. It harks back to the days when floppy disk were prevalent and I/O was an extremely expensive operation. As a result of this constraint, ZIPs facilitate random access and cheap updates as part of their design.

Random access is achieved by providing a central directory, which contains a listing of all files contained in the archive (as well as their metadata and offsets). Once the central directory has been read, entry offsets can be used to seek to the portion of the file that contains a file of interest.

Furthermore, central directories are located at the end of the archive. Thus, assuming an archive without a comment (another interesting ZIP feature; more on this in a bit), a full accounting of the contents can be made with 2 reads: 22 bytes at the very end of the tile for the EOCD record (which contains the size of the central directory) and <central directory size> bytes for the central directory itself. (Both headers contain signatures, so a single, aggressively-sized read of the end of the file can capture both the EOCD and central directory.)

Because the central directory is at the end of the archive, updates can be made cheaply by appending compressed files followed by an updated central directory containing entries for the new version of the file (which also accounts for “deletes”). This creates dead zones within the ZIP for outdated content, but disk is cheap and I/O is expensive, right? ZIPs can be rewritten when it’s deemed that they’ve become too inefficient due to these dead zones dominating.

This contrasts starkly to the tarball, the lingua franca of Unix systems for decades. The fundamental difference is stream-ability; ZIPs cannot be read efficiently in a streaming manner because the authoritative directory of entries is at the end of the tile. Tarballs include headers before file content, allowing files to be skipped when reading sequentially. This means that a list of all files in a tarball requires seeking through the whole file (granted, skipping potentially large swaths). It also means that files stored within tarballs cannot be accessed randomly, especially when compression is applied across the top of the concatenated set of files (block-level compression helps this some, but still requires that offsets be calculated beforehand).

Key takeaway: ZIPs facilitate random access. HTTP Range Requests facilitate random access to remote files. This means that individual files can be extracted from remote ZIP files using 2–3 requests with the total transfer size being <EOCD header size> + <central directory size> + <compressed file size>. In other words, by requesting mere kilobytes from potentially multi-gigabyte files.

Right, so Tapalcatl uses ZIPs.

If you squint, Tapalcatl is the tile (raster, vector, generic data) equivalent of an internally tiled GeoTIFF, facilitating partial reads without needing to load the full dataset (i.e. fully reading the archive containing a target tile). However, rather than coming from the single-file approach taken by large rasters, Tapalcatl approaches the problem from the opposite pain point: too many files.

I’ve been saying for a while that ZIPs are the Cloud Optimized GeoTIFF of tiles. Tapalcatl is only missing 2 things to fulfill that promise: use of overviews (lower-resolution versions of primary content) and metadata. Both are easily addressed.

Overviews

Overviews for tiles already exist: they’re called lower zooms. Metatiles of size 2 (4 tiles) have 1 overview level. Size 4, 2 levels.

Metatiles with overviews can be thought of as sub-pyramids. 0/0/0 is one such sub-pyramid. With a metatile size of 8, there are 4 overview levels and 64 tiles at the highest zoom (4³ = 64), with a total of 85 tiles(4³ + 4² + 4¹ + 4⁰). Thus, the first 4 zooms of a tileset can be contained in a single archive. The next 4 zooms (zooms 4–7) occupy 256 archives (4⁴).

Tapalcatl archives in the wild contain a structure that looks like a sub-pyramid (see the Tilezen Tapalcatl Formulation below). However, rather than being used as overviews, they‘re used to support multiple tile sizes with differing levels of detail; 0/* within the sub-pyramid corresponds to a 1024𝗑1024 pixel view of the world, 1/* to 512𝗑512, and 2/* to 256𝗑256. When lower resolution data is desired, sub-zooms (0/* or 1/*) of lower zoom tiles are requested. (Note: the Tilezen tiles are vector-only, so raster considerations don’t apply other than incorporating 256𝗑256 tiles as a baseline “standard size”, even though they’re not always generated.)

The idea is that the coordinates within the metatile are offsets to the coordinate of the metatile itself, so 2/dx/dy corresponds to the smallest tile and 0/0/0 the largest. For historical reasons, our smallest is 256𝗑256px — even when we don’t actually generate those size tiles.

If you peruse Tilezen source code, you may encounter references to “nominal zoom”, the concept which led to this implementation. It’s confusing, and there seem to be clearer ways of achieving resolution-independence. My instinct is to treat scale as a top-level concern and to adopt <filename>@<scale>x.<extension>-style filenames as a convention.

Metadata

For a Tapalcatl archive to be useful, we need to know a few things about it:

  • what’s the metatile size? this can be inferred by reading the central directory, but explicit is better than implicit
  • how many overview levels are present? also implicit in the archive’s contents
  • what’s the sub-pyramid root? implicit in archive’s contents source path
  • are there format variants present? GeoJSON, TopoJSON, MVT, Geobuf, scale-variant PNG, JPEG, etc.
  • what’s the content-type for each variant? this can be used as a default rather than providing a content-type for each tile
  • what scales are available for each format? implicit in the archive’s content, but more useful if explicitly defined in metadata
  • is there additional metadata present? TileJSON is a good set of per-format metadata, including layers present in an MVT tileset and the overall bounds (helpful to explain why a sub-pyramid may be incomplete)

We have a pair of options for storing metadata within a Tapalcatl ZIP.

We could store a meta.json with the above metadata fields in the ZIP. However, reading this requires 2–3 reads and includes fetching the central directory (which may be very large and isn’t strictly necessary when scanning archives).

As I alluded to above, the ZIP format includes the ability to store archive- and file-level comments. Archive-level comments are stored at the very end of the ZIP and thus will be included when reading the EOCD record (which is required before doing anything else). These are not compressed, but our metadata JSON certainly isn’t huge in the grand scheme of things.

ZIP archive-level comments seem like the right place to store global metadata.

Per-Tile Metadata

Individual tiles also have metadata needs. When serving them up, it’s important to provide appropriate Content-Type and Content-Encoding headers and helpful to provide others, such as Content-Length, ETag, and Last-Modified. Special circumstances may warrant additional ones.

As Content-Type and Content-Encoding are likely to be common across all tiles belonging to a particular variant, deferring to archive-level metadata accounts for these.

Content-Length can be fulfilled by using “uncompressed size” from the central directory. CRC-32, while problematic and subject to hash collisions, is probably suitable enough for use as an ETag value, and file modification date / time can be used to construct Last-Modified (the ZIP format limits its precision to 2s, but that’s also good enough for our purposes). Using the latter 2 fields facilitates efficient responses to HTTP requests containing ETag or If-Modified-Since, as the data will already be available in the central directory without needing to read the file.

What to do about additional headers, then? As with archive-level metadata, we have 2 options: {z}/{x}/{y}.meta.json or as file-level comments. File-level comments are stored in the central directory (thus completely available prior to reading individual tiles; not a particularly useful trait in this case) and are uncompressed. Standalone metadata files increase the number of archive reads necessary (by 1, though we can structure the ZIP to place metadata files adjacent to their corresponding tile) while increasing the size of the central directory (to the tune of ~64 bytes (including filename approximation) per file). Small JSON files also don’t compress particularly well, especially since each file contained in a ZIP is compressed separately.

Given these trade-offs, storing per-tile metadata as file-level comments seems to make the most sense.

Why not MBTiles or GeoPackages?

MBTiles has long been the preferred format for sharing tilesets (and publishing them to Mapbox). GeoPackage is a relative newcomer (formalizing how Spatialite had been used) and a very flexible file format, as it can store both row-based vector data and rasters. GDAL supports both natively (GeoPackage support since 2.0, read support for MBTiles since 1.10, read/write support for MBTiles rasters since 2.1, and read/write support for MBTiles vectors (i.e. a Tippecanoe alternative) since 2.3).

Both MBTiles and GeoPackage are SQLite-based and result in a single archive containing everything that can be randomly accessed (a very useful characteristic that’s surprisingly uncommon). However, SQLite databases are more difficult to perform partial reads on, as storage is page-based and content navigation requires the SQLite library. It’s an incredible embedded database, but it relies on fast random access to perform appropriately.

Furthermore, the relational and spatial database capabilities are unnecessary when addressing tiles, as keys (“filenames”) can be easily calculated. MBTiles’ claim to reduce storage size by eliminating redundancy is specious for most uses and incurs a performance penalty due to the indirection it uses to achieve it.

An embedded key/value data store (e.g. LevelDB / RocksDB or BerkeleyDB) would be appropriate if one were committed to using a database, but it’s difficult to find one that stores everything in a single file (UnQLite comes closest, based on some cursory searches).

What’s a filesystem, if not a key/value store mapping filenames to content? What’s a ZIP, if not a compressed filesystem?

The Tilezen Formulation — Tapalcatl 1.x

  • ZIPs each containing 3 nominal zooms (corresponding to a metatile size of 4 for “standard” 256𝗑256 tiles). The size of the root archive (0/0/0.zip) is 7.7MB; the zoom 10 tile covering Bend, OR is 925KB. The average archive size is 61KB.
  • Multiple formats. Tangram GeoJSON-ish (.json), TopoJSON (.topojson), and Mapbox Vector Tile (.mvt).
  • TileJSON is derived and not explicitly stored anywhere
  • (optional) hash prefixes — for improved partition distribution / filesystem tree breadth. MD5("{z}/{x}/{y}.zip")[:5]or MD5("/{layer}/{z}/{x}/{y}.zip")[:5]

There are a few inconsistencies to my eye, but that have historical reasons behind them.

First, tiles within each sub-pyramid are named according to their relative position, so 10/166/372 is 0/0/0(this theoretically helps with de-duplication, but that hasn’t been implemented yet). If you are given an archive without knowing its path (zoom, x, y), the only way to know its coordinate is to extract coordinates from its root tile. By preserving absolute positions, knowing the archive’s path is unnecessary and multiple archives can be blindly unzipped to produce a filesystem extract.

Second, the notion of “nominal zoom” results in paths that look like sub-pyramids but aren’t.

Tapalcatl 2 (proposed)

Based lessons learned from Tapalcatl 1.x and Cloud Optimized GeoTIFFs, I propose the following high-level requirements for Tapalcatl 2:

  • ZIPs containing sub-pyramids, each containing 4 zoom levels, corresponding to a metatile size of 8 (configurable, seems like a reasonable default) — this attempts to strike a balance between too many files to manage and files that are too large to manage when updating with new data.
  • archive metadata (as archive-level comments) to facilitate simple sharing of internally-consistent datasets — partial collections of sub-pyramids are fully valid tilesets that are easier to extract / take offline (at their natural granularity) without shuffling tiles around.
  • optional per-tile metadata (as file-level comments) to provide additional context / HTTP headers when needed
  • tiles may be sorted by z-value — preserves locality for efficient coalesced reads and partial caching
  • may contain multiple variants (by prefix) — per-variant metadata is present in archive-level metadata
  • may contain multiple formats(by suffix / extension, e.g. @2x.<extension> or .png) — per-format metadata is present in archive-level metadata, keyed by variant name.
  • optional hash prefixes — for improved partition distribution / filesystem tree breadth. md5("{z}/{x}/{y}")[:5]

I’ve put together a metadata proposal as an Observable notebook.

Next Steps

As the revisions comprising Tapalcatl 2 are largely theoretical, work needs to be done on the tooling to turn it into reality. Here are some rough TODOs for that path:

  • ✔️ write this document
  • ✔️ circulate this document among current Tapalcatl users
  • circulate among users with Tapalcatl-shaped problems
  • define / document archive-level metadata
  • write a proof-of-concept JavaScript library to validate concepts and serve as the core of some subsequent tasks
  • write a Lambda proxy w/ efficient support for larger Tapalcatl 2 archives — to allow Tapalcatl 2 sources to be accessed like normal tile services
  • write a command line archive creation tool (zip + zipnote work in the meantime)
  • write a Tilelive module — for reading/writing/conversion using existing tools (e.g tessera and tl)
  • write a Service Worker —partial reads of Tapalcatl 2 archives can be done within a browser (likely transparently for Leaflet / Mapbox GL, as Service Workers can be programmable network proxies), including block-level caching (yauzl + zlib-browserify)
  • write a Python BytesIO (IOBase / RawIOBase / BufferedIOBase) implementation that makes HTTP Range requests on demand; unzip accepts BytesIO inputs, allowing this to work

Client Performance Optimizations

Here are some initial thoughts on optimizing Tapalcatl 2 reads:

  • central directories can be cached locally, keyed by archive path, to provide rapid repeated lookups for tiles
  • partial remote reads (vs. downloading entire archive) — allows larger archives (larger metatile sizes) to be used
  • block-aligned range reads (similar to Nginx’s ngx_http_slice_module) + LRU block caching, trading off fewer requests for larger response bodies.
  • ordering tiles by z-value increases locality with the archive, further increasing the potential cache hit-rate.

Implementations

These are the implementations I know of. I’ll expand this list as support grows.

Contributing

If this sounds interesting to you and/or you have a Tapalcatl-shaped problem (i.e. you’re currently experiencing pain managing tiled data sources) that would benefit from the proposed standardization, please weigh in. There’s no obvious place to coordinate (yet!), so feel free to comment on this post and we can go from there.

I want to thank Nathaniel Vaughn Kelso, Matt Amos, Ian Dees, Chris Holmes, Rob Emanuele, Jennings Anderson, Dave Lindenbaum, and Joe Flasher for their feedback and support.