The ZIP File Format

Felix Stridsberg
7 min readJul 19, 2023

--

This is a technical overview of the binary format of ZIP files for programmers or other technical people.

The scope of this article is the ZIP file format itself, not the optional compression or encryption algorithms. After reading this article you should have enough information to be able to read and write ZIP files manually byte by byte.

Overview

The ZIP file consists of several headers. Each header is identified by a signature followed by header data.

The headers are:

  • Local file header — Contains metadata about a single file and is followed by the file data. Each file can be compressed and encrypted independently.
  • Central directory file header — Contains metadata about a single file and an offset to the Local file header for that file. A list of these headers forms the Central directory which can be used to quickly enumerate all files in the archive.
  • End of central directory record — Contains information on where to find the first central directory file header.

The entry to a ZIP file is the End of central directory record. This is always at the end of the file.

This is a visualization of a ZIP file containing a.txt and b.txt:

ZIP file

The reason for the central directory being at the bottom, is that we can overwrite it by just appending a new central directory. If we want to add a file c.txt, we can append that file record and append a new central directory that includes it:

Adding c.txt to “ZIP file” without any rewrite

The old directory still exists, but is no longer referenced. This way we have modified the archive without rewriting the file.

This behavior was especially important when floppy disks was used for storage. Writes were slow and ZIP archives could span multiple disks.

You can do more than just adding files without any rewrites, you can also modify and delete files in an archive by just appending data. The deleted file and previous version of the modified file will of course remain on disk, but they will not be read by a compliant ZIP reader.

This is how you can achieve modification and deletion of the “ZIP file” example above, by only appending new data:

Modifying b.txt and deleting a.txt from “ZIP file” without any rewrite

Some fun facts:

Since a ZIP file is read from the end, anything can be put at the start of the ZIP file. You can for example make a file that is both a valid PNG and a valid ZIP (polyglot) to hide a ZIP inside file that looks and works like an image.

If you decide to write your own ZIP extractor after this article, you can also append your ZIP file to your ZIP extractor binary and make it unzip itself when executed. This is called a self-extracting archive and is a common technique for auto unzipping when you double click the file.

The ZIP format allows “comments” to be put at several places in the file. These comments can store arbitrary data to make the ZIP file compatible with more than just the ZIP format.

Multi-Part/Multi-Disk ZIP Archive

The ZIP format allows for an archive to be split into multiple files. This was implemented to be able to split large archives over multiple floppy disks. It may still have uses today to circumvent file size limitations when uploading or emailing bigger archives.

The specification mentions fields called “disk number” etc, you can think of this as “file number” when working with multi-part archives. In this article, we just work with single file archives so you can always assume those fields are“disk 0”.

Reading a ZIP file

Now lets look at how to read the actual bytes of a ZIP file. All multi byte numbers are stored in little-endian.

Find the End of Central Directory Record

To read a ZIP file we must first find the End of central directory record (shortened EOCD) at the end of the file. This sounds trivial, but is a little tricky.

This is the definition of the EOCD:

Bytes | Description
------+-------------------------------------------------------------------
4 | Signature (0x06054b50)
2 | Number of this disk
2 | Disk where central directory starts
2 | Numbers of central directory records on this disk
2 | Total number of central directory records
4 | Size of central directory in bytes
4 | Offset to start of central directory
2 | Comment length (n)
n | Comment

The last line in the definition makes the finding of this record a little complicated. The record has dynamic length.

Depending on the comment length, the start of the EOCD will be at different offsets from the end of file.

  • If n=0 (empty comment), the EOCD starts at 22 bytes from the end
  • If n=0xffff (max length comment), the EOCD starts at 22 + 0xffff = 65557 bytes from the end

The interval where the EOCD signature may exist is between 65557 and 18 from the end. That is a total of about 65.5 kb. That is not much on a modern computer so we can read that whole interval into a buffer and scan it backwards to find the signature.

When the EOCD signature is found, the hardest part of reading the ZIP is done. The rest is just parsing predefined binary structures at specific offsets.

Reading the Central Directory

From the EOCD we know the offset where the central directory starts, and how many records there are. So we just need to seek to that offset and start reading the central directory file headers.

Each central directory file header looks like this:

Bytes | Description
------+-------------------------------------------------------------------
4 | Signature (0x02014b50)
2 | Version made by
2 | Minimum version needed to extract
2 | Bit flag
2 | Compression method
2 | File last modification time (MS-DOS format)
2 | File last modification date (MS-DOS format)
4 | CRC-32 of uncompressed data
4 | Compressed size
4 | Uncompressed size
2 | File name length (n)
2 | Extra field length (m)
2 | File comment length (k)
2 | Disk number where file starts
2 | Internal file attributes
4 | External file attributes
4 | Offset of local file header (from start of disk)
n | File name
m | Extra field
k | File comment

This defines everything we need to know about a file, including where we can find the local file header.

Since multi-part archives, compression and encryption is outside the scope of this article, we can make the following assumptions:

  • Bit flag = 0
  • Compression method = 0
  • Compressed size = Uncompressed size
  • Disk number where file starts = 0

The CRC-32 is a checksum to detect corrupt data. This can be ignored in toy projects, but in real projects it must be verified to avoid creating corrupt files when extracting.

The rest of the fields should be self explanatory, except Extra field, we will get back to that one later.

Extracting a File

From the central directory we know the offset to all the files. To extract a file we start with reading the local file header that looks like this:

Bytes | Description
------+-------------------------------------------------------------------
4 | Signature (0x04034b50)
2 | Minimum version needed to extract
2 | Bit flag
2 | Compression method
2 | File last modification time (MS-DOS format)
2 | File last modification date (MS-DOS format)
4 | CRC-32 of uncompressed data
4 | Compressed size
4 | Uncompressed size
2 | File name length (n)
2 | Extra field length (m)
n | File name
m | Extra field

Some of the data in the local file header is duplicated with the entry in the central directory. Since the local file header also contains all data needed for extraction, we can extract files that are not present in the central directory.

Followed by the local file header is the actual file data of the length specified in the compressed size field. Since we are not working with compressed or encrypted data, the file data will be the plain text content of the files. We can extract that data by reading it and writing it to a file according to the file name field.

With that done, we have successfully understood the ZIP file format!

Extra fields

The extra field fields are there to make the ZIP format extensible. These can for example be used to add extra metadata required for encryption or certain compression algorithms.

The extra field contains a list of these to fill the length of the field:

Bytes | Description
------+-------------------------------------------------------------------
2 | Header ID
2 | Data length (n)
n | Data

One common extra field is the one with Header ID 0x5455, that is a UTC Unix timestamp.

Another common extra field is the one with Header ID 0x0001. This is a field that contains data related to ZIP64 to allow sizes and offsets bigger than 32bit.

You can also define your own if you have a specific use case.

Conclusion

The ZIP format it self is quite simple, but extensible. It also have some interesting quirks.

To find the entry point, you have to scan for the signature 0x06054b50. However, that signature may exist in a comment or even as valid data in the EOCD record it self. To be sure you found the real EOCD signature, you likely have to follow a few offsets and validate that their signature is correct as well, otherwise go back and continue scanning.

References

--

--

Felix Stridsberg
0 Followers

I like software development, digital circuit design and taking apart wrist watches and putting them back together again.