ZDW: An Efficient Archival Storage Format for Well-Structured Flat Files
Save on long-term archival data storage with ZDW
Within the Adobe Experience Cloud, Adobe Analytics maintains a canonical data store of hundreds of petabytes of raw and enriched customer data across billions of files. This data set grows in size every year, so having a strategy to efficiently store and access this massive amount of data is critical. To meet that need, we developed a novel compression technique called ZDW that is optimized for long-term storage of well-structured flat files.
ZDW provides an average compression ratio approaching 70:1 on Analytics data, which saves millions of dollars in annual storage costs over alternative compression options. The ZDW technology has been recently open-sourced on GitHub and is now available outside of Adobe for others to use.
The ZDW mnemonic is an amalgam of “Z,” used in compressors like “zip,” “zlib,” etc., and “DW,” representing “data warehouse” archival data. ZDW has undergone multiple iterations over the years to improve efficiency and functionality, but here only the most recent version (also the simplest and most efficient) is covered.
Adobe Analytics maintains a long-term archival copy of canonical customer data. It is stored in well-structured, flat files that contain raw, post-processed and enriched columnar data.
Each file stores a chunk of data for a customer, bounded within a logical time range (i.e., one calendar day or less).
Multiple Analytics services draw from this canonical data source, including:
- Interactive reporting, e.g., Analysis Workspace (maintains a columnar view for accelerated reporting)
- Data Feeds (delivers scheduled batches of raw, hit-level feeds to customers)
- Data Warehouse reporting (ad-hoc OLAP reports with optional custom reporting logic)
- Cross-solution integrations
Most canonical data access traffic is for data received in the last 24 hours. Beyond the most recent days, the canonical data set is rarely accessed.
Under Adobe’s use cases, there is not a pressing need for low latency (i.e., sub-second) access to historical data in the canonical data set. As such, it is expedient to store the archival data in a format optimized for a small, long-term storage footprint.
ZDW usage and composition
The ZDW archival format is for row-oriented, well-structured text (i.e., CSV/TSV formatted) data with explicitly typed columns. ZDW is used in tandem with standard compression formats like XZ (LZMA) and GZIP to yield highly efficient compression. It is best suited for optimizing storage footprint for archival data, accessing large segments of the data, and outputting data row-by-row, as opposed to extracting only a few columns or key values.
ZDW uses a combination of:
- A global, sorted dictionary of unique strings across all text columns, for normalizing all text data
- Numeric and text values, as specified in the accompanying schema file
- Variable byte-size values for integers as well as dictionary indexes
- Minimum value baseline per column for integers and dictionary indexes, to reduce the magnitude of the value needed on each row of that column
- Bit-flagging repeat column values across consecutive rows (similar approach to run-length encoding, but applied on a per-row basis); for each row, only column values that have changed from the value on the previous row are included.
These approaches leverage domain knowledge of the well-structured nature of the data set. A schema file is provided alongside the TSV data to a ZDW writer, indicating the semantic type of each column in the TSV file. The column type information is used by the compression algorithm to efficiently represent the values for each column. Considering the non-contiguous nature of column data, the above techniques are orthogonal to those applied in standard data-agnostic compression formats, enabling compression savings by ZDW to be compounded by standard compression.
Data is compressed more efficiently in cases where column values repeat from row to row. For instance, analytics visitor or session data may feature repeat values across touchpoints. Pre-sorting the data (e.g., using an appropriate ORDER BY clause when dumping activity data from SQL) can yield improved compression efficiency.
For data access, ZDW best caters to:
- Decompress an entire file’s contents, either to disk, streamed to stdout, or unpacked into a row-level in-memory text buffer
- Access data row-by-row
- Access multiple columns in all rows
ConvertDWfile, our implementation of the ZDW compressor, receives two files, one with raw TSV data together with an accompanying schema file, formatted as generated via a MySQL table dump. For instance:
$ ./convertDWfile analytics-hits.sql analytics-hits.desc.sql
Example files can be found here.
ConvertDWfile has a simple approach but is compute intensive. It performs two passes over the uncompressed data. The first pass compiles, sorts and outputs
- a header with schema information, per-column offset sizes (i.e., a number of bytes used to represent the offset for each column) and baseline values for numeric columns, and
- a global string dictionary containing all text values in the source data.
The second pass converts and outputs the compressed row-oriented data. For each string value, its position in the compiled string dictionary is provided. When an additional compression flag is applied, this output data is piped into the specified standard compression binary (e.g., ‘xz’, ‘gz’) for additional size reduction.
Multiple internal data blocks are supported to keep dictionary sizes manageable. These blocks are for memory management and are not intended to provide intra-file seek optimizations.
Implementations of the ZDW un-converter support various output modes and convenience operations, such as selective column output, file validation, file statistics, and synthesis of virtual metadata columns.
As mentioned above, ZDW’s structure is designed to complement standard text or binary compression algorithms layered on top. As ZDW is intended for highly efficient compression of long-term archival data, applying a standard binary compression on top of the ZDW file format, such as XZ (LZMA), is useful and recommended.
XZ incurs a one-time compute-intensive compression cost, then has a reasonable compute cost for uncompressing one or more times. While our implementation is pipelined to improve wall clock timing, applying both ZDW and XZ compression together is anticipated to require about double the total CPU cycles of applying XZ alone. Our decompression implementation is also pipelined. XZ is typically the bottleneck during decompression, as string value lookups into an in-memory dictionary are O(1) operations.
Less efficient formats like GZ (gzip) may alternatively be used with ZDW, with a positive trade-off of reducing overall compression compute time, while incurring an anticipated reduction in compression efficiency of 25%. GZ requires about half the CPU cycles for uncompression compared to XZ.
A typical Adobe Analytics archival file has a very wide table schema and a few thousand rows. It is sparsely populated and has string values that are repeated between columns. ZDW’s shared dictionary has a strong positive impact in reducing overall file size. For Analytics data, as a rule of thumb, the dictionary segment occupies about 80% of the total file space and the encoded row data occupies about 20%.
To illustrate representative storage performance on Analytics archival data, the above and below graphics provide an empirical comparison in file sizes among common compression techniques.
ZDW usage and volume
Adobe Analytics maintains hundreds of petabytes of archival data stored across billions of files, all in the ZDW format. An average compression ratio approaching 70:1 is achieved on the entire data set. Applying ZDW with XZ saves millions of dollars annually over alternative compression options.
The ZDW technology has been recently released as open source on GitHub:
ZDW is an archival format for row-level, well-structured data (specifically, TSV with associated schema file, as from a…
A detailed description of the ZDW file format is included here.
ZDW file readers are implemented in Java/Scala and C++ for streaming data and files. Hadoop and Spark SQL read support is also implemented.
Contributions to improve ZDW and its write and read layers are welcome!