R package zlib and why Another Zlib Package for R?

Semjon Geist
4 min readSep 8, 2023

--

In the vast universe of R packages, I’ve introduced yet another one: the zlib package. But why? The answer lies in the need for a more robust and flexible solution for data compression and decompression in R.

The zlib package for R aims to offer an R-based equivalent of Python’s built-in zlib module for data compression and decompression. This package provides a suite of functions for working with zlib compression, including utilities for compressing and decompressing data streams, manipulating compressed files, and working with gzip, zlib, and deflate formats.

Introduction

The zlib package for R aims to offer an R-based equivalent of Python's built-in zlib module for data compression and decompression. This package provides a suite of functions for working with zlib compression, including utilities for compressing and decompressing data streams, manipulating compressed files, and working with gzip, zlib, and deflate formats.

Why Another Zlib Package?

R already has built-in methods for compression and decompression. So, why reinvent the wheel? Here’s why:

The Problems with Built-in Methods

Handling Corrupt Data: R’s built-in functions, such as memDecompress, can be unstable when dealing with potentially corrupt gzip bytes. This instability can lead to system crashes, especially when faced with:

  • Incomplete Data Streams
compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string.", 1000), collapse = ", ")))
# Trying to decompress only a part of the data
rawToChar(memDecompress(compressed_data[1:100], type="gzip")) # This can cause a hang-up or crash
  • Multiple Header Blocks
# Compressing the same data twice
double_compressed_data <- c(memCompress(charToRaw("Hello")), memCompress(charToRaw("World")))
# Trying to decompress the concatenated compressed data
rawToChar(memDecompress(double_compressed_data, type="gzip")) # This can lead to unexpected results

GZIP File Format Specification: The built-in memCompress function doesn't strictly adhere to the GZIP File Format Specification, especially concerning the usage of window bits (Official GZIP File Format Specification). This discrepancy was observed as early as 2012 and 2013, as evidenced by these messages. Some developers have attempted to create an alternative package, which can be found in this repository. Potential implications include:

  • Incorrect Window Bits (for gzip header)
# Compressing with only 15 window bits
compressed_data <- memCompress(charToRaw("Hello World"), type="gzip")
  • Incompatible with other tools
# Compressing with only 15 window bits
compressed_data <- memCompress(charToRaw("Hello World"), type="gzip")
tmp_file <- tempfile(fileext=".gz")
# using other tools like zcat for decompression
system(sprintf("zcat %s", tmp_file), intern=TRUE)
# gzip: /tmp/Rtmp5rBBB0/file872d554ab1fb.gz: No such file or directory
# character(0)
# or using gzip
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb"))
# character(0)

No Streaming Support: R lacks a native way to handle Gzip streams/chunks from REST APIs or other data sources. This often necessitates the use of temporary files or intricate workarounds to ensure robust continuation of decompression/compression.

# Example of cumbersome workaround using pipes and tmp files
url <- "https://example.com/data.gz"
tmp_file <- tempfile(fileext=".gzip")
download.file(url, tmp_file, mode="wb")
decompressed_data <- readBin(pipe(sprintf("gzip -d %s --stdout", tmp_file)), "raw", 1000)

What my Package Offers

Robustness: The zlib package can efficiently handle corrupted or incomplete gzip data without causing system failures.

  • Handling Incomplete Data:
compressed_data <- zlib$compress(charToRaw(paste0(rep("This is an example string.", 1000), collapse = ", ")))
# Decompressing only a part of the data without causing a hang-up
rawToChar(zlib$decompress(compressed_data[1:100]))
  • Handling Multiple Header Blocks:
double_compressed_data <- c(zlib$compress(charToRaw("Hello ")), zlib$compress(charToRaw("World")))
# Decompressing concatenated compressed data
rawToChar(zlib$decompress(double_compressed_data))

Compliance: The package strictly adheres to the GZIP File Format Specification, ensuring compatibility across systems.

  • Correct Window Bits (e.g. 31 for gzip):
# Compressing with correct window bits
compressed_data <- zlib$compress(charToRaw("Hello World"), zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
# This will be decompressed correctly by other tools or libraries
rawToChar(zlib$decompress(compressed_data, zlib$MAX_WBITS + 16))

Flexibility: With zlib, you can manage Gzip streams from REST APIs without the need for temporary files or other workarounds.

# Byte-Range Request and decompression in chunks

# Initialize the decompressor
decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)

# Define the URL and initial byte ranges
url <- "https://example.com/api/data.gz"
range_start <- 0
range_increment <- 5000 # Adjust based on desired chunk size

# Placeholder for the decompressed content
decompressed_content <- character(0)

# Loop to make multiple requests and decompress chunk by chunk
for (i in 1:5) { # Adjust the loop count based on the number of chunks you want to retrieve
range_end <- range_start + range_increment

# Make a byte-range request
response <- httr::GET(url, httr::add_headers(`Range` = paste0("bytes=", range_start, "-", range_end)))

# Check if the request was successful
if (httr::http_type(response) != "application/octet-stream" || httr::http_status(response)$category != "Success") {
stop("Failed to retrieve data.")
}

# Decompress the received chunk
compressed_data <- httr::content(response, "raw")
decompressed_chunk <- decompressor$decompress(compressed_data)
decompressed_content <- c(decompressed_content, rawToChar(decompressed_chunk))

# Update the byte range for the next request
range_start <- range_end + 1
}

# Flush the decompressor after all chunks have been processed
final_data <- decompressor$flush()
decompressed_content <- c(decompressed_content, rawToChar(final_data))

--

--