Processing large compressed files with PHP

Jose Manuel Cardona
Softonic Engineering
6 min readApr 14, 2023

One day we received a new task: download a huge file on a daily basis and process it as quickly as possible to make the information available as soon as possible.

The problem with usual file processing methods

An Image representing a server downloading files from the cloud
Powered by Stable Diffusion

The usual approach involve downloading the file to disk, decompressing it, and then reading the file, which took too long. We could use libraries like Archive Tar to do the job.

We then started to investigate streaming download, but the file was compressed, and PHP doesn’t offer a built-in way to decompress a tar.gz file on the fly. However, it does offer stream filters, and this is where things started to get interesting.

Understanding the file requirements

The first thing to know is that the file is around 2 GB compressed, it is compressed using tar.gz, and it always contains a single file.

The challenge of working with tar.gz files

We tried to download the file using streaming and removing the first problematic layer, the GZIP compression.
So we could go with the native filters in PHP:

$stream = fopen('https://domain/path/to/file.tar.gz', 'rb');

stream_filter_append($stream, "zlib.inflate", STREAM_FILTER_READ);

Unfortunately it doesn’t work, because it work just with ZIP and not Gzip, so we need to check what are the differences between them with the help of GZip RFC. There we can see that we can treat it as a Zip with some extra headers, so to be able to use the zlib.inflate filter, we need to remove the GZip headers first, and we went for it.

Creating a custom stream filter to remove gzip headers

After researching how the stream filters work, we found that we can create our own filters, so we created a filter to remove the gzip headers and we could use the zlib.inflate filter.

<?php

declare(strict_types=1);

namespace App\Infrastructure\StreamFilters;

use php_user_filter;

class GzipHeaderFilter extends php_user_filter
{
private const DEFAULT_HEADER_LENGTH = 8;

private bool $headerProcessed = false;

public function filter($in, $out, &$consumed, $closing): int
{
while ($bucket = stream_bucket_make_writeable($in)) {
if (!$this->headerProcessed) {
$headerLen = self::DEFAULT_HEADER_LENGTH;
$header = substr((string)$bucket->data, 0, $headerLen);
$flags = ord($header[1]);
if (($flags & 0x08) !== 0) {
// a filename is present
$headerLen = strpos((string)$bucket->data, "\0", $headerLen) + 1;
}

$bucket->data = substr((string)$bucket->data, $headerLen);
$this->headerProcessed = true;
}

$consumed += $bucket->datalen;

stream_bucket_append($out, $bucket);
}

return PSFS_PASS_ON;
}
}

The important part here is the if that contains the logic to remove the header, and the stream_bucket_append that append the bucket to the output stream.

$headerLen = self::DEFAULT_HEADER_LENGTH;
$header = substr((string)$bucket->data, 0, $headerLen);
$flags = ord($header[1]);
if (($flags & 0x08) !== 0) {
// a filename is present
$headerLen = strpos((string)$bucket->data, "\0", $headerLen) + 1;
}
$bucket->data = substr((string)$bucket->data, $headerLen);
$this->headerProcessed = true;

We can see here that it takes the default header length and depending on a flag on the header increase the header length
to include the filename in case that it is present. So now the remaining data that is passed is a valid ZIP that we can decompress using the zlib.inflate filter.

stream_filter_register('gzip_header_filter', GzipHeaderFilter::class);

$stream = fopen('https://domain/path/to/file.tar.gz', 'rb');

stream_filter_append($stream, "gzip_header_filter", STREAM_FILTER_READ);
stream_filter_append($stream, "zlib.inflate", STREAM_FILTER_READ);

But this is not enough, we still have the Tar layer and we need to remove it to be able to get the file that we need.

Creating a custom stream filter to extract a single file from tar

Unfortunately, PHP doesn’t offer a built-in way to decompress a tar file on the fly, so we need to create our own filter to do that.
Again, we need to check the TAR RFC to be able to know how it works and how to get the file content.

Thankfully, we have a single file inside the tar, so it seems easy to get the file content.
In the RFC the headers struct is exposed, and we know how to strip all the headers to access to the content that we want.
The problem is that tar files also have a footer, so we need to remove them too to avoid problems in our final reader.

<?php

declare(strict_types=1);

namespace App\Infrastructure\StreamFilters;

use php_user_filter;

use function octdec;
use function substr;

class TarExtractSingleFile extends php_user_filter
{
private const HEADER_BLOCK_SIZE = 512;

private const FILE_SIZE_HEADER_OFFSET = 124;

private const FILE_SIZE_HEADER_LENGTH = 12;

private bool $headerProcessed = false;

private int $fileSize = 0;

private int $cursor = 0;

public function filter($in, $out, &$consumed, $closing): int
{
while ($bucket = stream_bucket_make_writeable($in)) {
$data = $bucket->data;

// Remove the header if it is the first block
if (!$this->headerProcessed) {
$this->fileSize = $this->tarSizeToSize(
substr((string)$data, self::FILE_SIZE_HEADER_OFFSET, self::FILE_SIZE_HEADER_LENGTH)
);
$data = substr((string)$bucket->data, self::HEADER_BLOCK_SIZE, -1);
$this->headerProcessed = true;
}

// Remove the footer if it is the last block
$dataLen = strlen((string)$data);
if ($this->fileSize < $this->cursor + $dataLen) {
$nullByte = 1;
$data = substr((string)$data, 0, $this->fileSize - $this->cursor - $nullByte);
}

// Annotate the position in the file.
$this->cursor += $dataLen;

// Append the data to the output stream
$bucket->data = $data;
$bucket->datalen = strlen((string)$data);

$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}

return PSFS_PASS_ON;
}

/**
* Convert Tar record size to actual size
*/
private function tarSizeToSize(string $tarSize): int|float
{
/*
* First byte of size has a special meaning if bit 7 is set.
*
* Bit 7 indicates base-256 encoding if set.
* Bit 6 is the sign bit.
* Bits 5:0 are most significant value bits.
*/
$ch = ord($tarSize[0]);
if (($ch & 0x80) !== 0) {
// Full 12-bytes record is required.
$recStr = $tarSize . "\x00";

$size = (($ch & 0x40) !== 0) ? -1 : 0;
$size = ($size << 6) | ($ch & 0x3f);

for ($numCh = 1; $numCh < 12; ++$numCh) {
$size = ($size * 256) + ord($recStr[$numCh]);
}

return $size;
}

return octdec(trim($tarSize));
}
}

To do this code, we checked the previously named Archive Tar library, so it could help us with the tar processing.
We could see the tarSizeToSize function there, and we used it to get the file size from the header.

/**
* Convert Tar record size to actual size
*/
private function tarSizeToSize(string $tarSize): int|float
{
/*
* First byte of size has a special meaning if bit 7 is set.
*
* Bit 7 indicates base-256 encoding if set.
* Bit 6 is the sign bit.
* Bits 5:0 are most significant value bits.
*/
$ch = ord($tarSize[0]);
if (($ch & 0x80) !== 0) {
// Full 12-bytes record is required.
$recStr = $tarSize . "\x00";

$size = (($ch & 0x40) !== 0) ? -1 : 0;
$size = ($size << 6) | ($ch & 0x3f);

for ($numCh = 1; $numCh < 12; ++$numCh) {
$size = ($size * 256) + ord($recStr[$numCh]);
}

return $size;
}

return octdec(trim($tarSize));
}

The next important part is that we use it to know how long is the header, so we can remove it. Remember this code is couple to the specific
use case we have just single file.

$this->fileSize        = $this->tarSizeToSize(
substr((string)$data, self::FILE_SIZE_HEADER_OFFSET, self::FILE_SIZE_HEADER_LENGTH)
);
$data = substr((string)$bucket->data, self::HEADER_BLOCK_SIZE, -1);
$this->headerProcessed = true;

and now we have our objective, all the data passed through the stream is already the file content, but we still need
to remove the tar footer, but it is easy, because we know the file length from the tar headers, so we just
need to remove all the content after we reach that point.

$dataLen = strlen((string)$data);
if ($this->fileSize < $this->cursor + $dataLen) {
$nullByte = 1;
$data = substr((string)$data, 0, $this->fileSize - $this->cursor - $nullByte);
}

Putting it all together

Now we already have all what we needed, a file decompressed and ready to be read.

stream_filter_register('gzip_header_filter', GzipHeaderFilter::class);
stream_filter_register('tar_extract_single_file', TarExtractSingleFile::class);

$stream = fopen('https://domain/path/to/file.tar.gz', 'rb');

stream_filter_append($stream, "gzip_header_filter", STREAM_FILTER_READ);
stream_filter_append($stream, "zlib.inflate", STREAM_FILTER_READ);

Using stream filters to process tar.gz files in PHP

In this post, we saw how to use stream filters to decompress a tar.gz file and extract a single file from it.
It is not natively supported by PHP, but it is possible to do it if you know the file structure.
It is hard to read the RFCs, but they are very powerful, providing you the knowledge to reach your objective.

--

--