A complete guide for working with I/O streams and zip archives in Python 3

Naren Yellavula
Dev bits
Published in
11 min readMar 8, 2020

--

Photo by Tomas Sobek on Unsplash

As part of daily job, sometimes you have to work with zip archives/files. Even though it looks straight-forward, sometimes few custom requirements can force you to the bang-head situation while searching a clean way to manage zip files.

Recently, at my work, I implemented a feature where I have to download a zip file from S3, update its content, and upload it back to S3. The content can be dynamic, and I have to update only the specific part(a file) and retain all others. On the process, I researched a bit about the topic, tried to explore the Python 3 standard library’s zip utilities. I want to share my knowledge here.

After reading this article, you can work with zip files effortlessly in Python. I try to cover possible use cases one might come across along with tests to understand how things work.

Note: We use Python 3.7 for our code samples and API. You can install a Python 3.7 using a virtual environment and activate it.

https://docs.python.org/3/library/venv.html

All the code samples can be found at this GitHub link:
https://github.com/narenaryan/python-zip-howto

Before you go, are you still using Google Docs to store your favourite ChatGPT prompts ? Then, checkout this cool prompt manager called Vidura for free:

Python I/O streams and buffers

Before jumping into Python’s zipfile API, let us discuss the basics of streams and buffers. They are necessary to understand the internals of how Python treats files and data in general. A file recognized by Python can store three types of data:

  1. Text (string),
  2. Binary (bytes)
  3. Raw data

Python considers an object falling in the above three categories as a “file-like object.” They are also called streams from where data can be read from or written. The data stored in streams are called buffers of that stream. The first two, i.e. Text and Binary streams, are buffered I/O streams, and raw type is unbuffered. In this article, we are only interested in buffered streams.

Everyone who worked with Python may have seen operating on files from disk before. In Python, one can open a file like this:

with open('file.ext', 'mode') as f:
# read from f

What is precisely the above code doing? It does these:

  1. Open a file-like object “file.ext” with mode “mode”. It returns a stream f
  2. f is a stream whose buffer can be randomly/sequentially accessed
  3. We can read/write to that stream depending on mode.

A stream can be a file-like object in Python

In Python, we can also create in-memory streams that can hold different kinds of buffers. Python’s io package provides two classes:

  1. StringIO: for storing UTF-8 string buffers
  2. BytesIO: for storing binary buffers

Let us discuss each buffered I/O type in detail.

Text Streams

A text stream operates on a text buffer. We can create an empty initialized “file-like object” using StringIO that can store text buffers like this,

The preceding program creates a new text stream, writes some data to its buffer, then prints the buffer content to console. This text stream can be moved freely among Python functions whose signature processes an I/O stream.

One should be aware that, in Python, a file-like object can be used in any I/O operation. The classic example is the print statement.

Python’s print statement takes a keyword argument called file that decides which stream to write the given message/objects. Its value is a “file-like object.” See the definition of print,

print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)

The sys.stdout is a stream, which is a file-like object. That default value makes Python write to the console. We can also change that destination to any custom writable stream. Let us modify our program to change the destination of print to our custom text stream.

Text streams are only useful in operating on UTF-8 buffers(XML, JSON, CSV). There are many other cases where we have to represent binary buffers(ZIP, PDF, Custom Extensions) in program memory. Binary streams come to the rescue.

Binary Streams

A binary stream stores and operates on binary data(bytes). It has the same methods as StringIO like getvalue, read, write. Except, it operates on a different kind of buffer data internally. Let us see an example where we create an in-memory binary stream with some data. Then, we can read the contents of the buffer using the getvalue method. If we try to copy the content of a binary stream into a text stream, it throws TypeError.

One can store any binary data coming from a PDF or ZIP file into a custom binary stream like the preceding one.

StringIO & BytesIO are high-level interfaces for Buffered I/O in Python.

In the next section, we discuss the basics of zipping in Python. We also see many use cases with examples.

Understanding the Python `zipfile` API

A zip file is a binary file. Contents of a zip file are compressed using an algorithm and paths are preserved. All open-source zip tools do the same thing, understand the binary representation, process it. It is a no-brainer that one should use BytesIO while working with zip files. Python provides a package to work with zip archives called zipfile

The zipfile package has two main classes:

  1. ZipFile : to represent a zip archive in memory.
  2. ZipInfo : to represent a member of a zip file.

A ZipFile is an exact representation of a zip archive. It means you can load a .zip file directly into that class object or dump a ZipFile object to a new archive. Every ZipFile has a list of members. Those members are ZipInfo objects.

A ZipInfo object is a path in the zip file. It is the combination of directories plus path. For example, let us say we have a directory called config, and it stores configurations for application, containers, and, some root-level configuration. Assume the content looks like this,

config
├── app
│ └── app-config.json
├── docker
│ └── docker-compose.yaml
└── root-config.json
2 directories, 3 files

If you zip config directory using your favourite zip tool, I pick this python command,

python -m zipfile -c config.zip config

and then try to list the contents of config.zip using Python command,

python -m zipfile -l config.zip

It displays all paths in the zip file.

What are those paths? Each path listed in the output is a ZipInfo object for Python. To prove that, let us write a small script that creates a zip archive in memory with config.zip,

When you run this script, you see the following output:

<ZipInfo filename='config/' filemode='drwxr-xr-x' external_attr=0x10><ZipInfo filename='config/docker/' filemode='drwxr-xr-x' external_attr=0x10><ZipInfo filename='config/docker/docker-compose.yaml' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2><ZipInfo filename='config/app/' filemode='drwxr-xr-x' external_attr=0x10><ZipInfo filename='config/app/app-config.json' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2><ZipInfo filename='config/root-config.json' compress_type=deflate filemode='-rw-r--r--' file_size=0 compress_size=2>There are 6 ZipInfo objects present in archive

This ZipInfo object is critical for modifying a file/path in the archive. It is a high-level wrapper for a file stream. On a ZipInfo object, one can read or modify data. One can also create new ZipInfo objects and add them to the archive. Let us see all variations where we use simple Python programs to create, update zip archives in the next section. All the examples don’t create zip files on disk but in memory.

Use case #1: Create zip archive with files

We can create a zip file with the given name by opening a new ZipFile object with write mode ‘w’ or exclusive create mode ‘x.’

Next, we can add files/paths to the zip archive. There are two approaches to do that:

v1: Add a file as a file-like object

This approach writes independent files as file-like objects.

v2: Add a file as a ZipInfo object

This approach composes files as objects and gives more flexibility to add meta information on file.

Both versions create config.zip file on disk. While creating a file in the archive, they consider relative paths like this,

docker/docker-compose.yaml

v2 is slightly flexible as it gives freedom to modify ZipInfo object properties at any point in time.

Use case #2: Read a file from zip archive

Another possible use case is to read a file from an existing zip archive. Let us use the config.zip file created from Use case #1.

For Ex: read the content of docker-compose.yaml from the zip and print it.

Use case #3: Update or Insert a file in zip archive

This use case is the most tricky part of the zipping business in Python. On the first look, it might look simple. Let us try attempting a few solutions.

Attempt #1

The apparent thing that comes to one’s mind is to update a specific file in the archive with the latest data, is this,

If we run the preceding script, it replaces the file in archive config.zip, but, as zipfile is opened in write mode ‘w,’ the other files/paths in archive can vanish. You can check it using this command,

python -m zipfile -l config.zipFileName                       Modified                  Size
docker/docker-compose.yaml 2020-03-08 20:05:48 27

Woah, root config and app config have vanished from the config.zip. It is a side-effect.

Don’t use ‘w’ mode, when you update/replace a single file in a zip archive, or your data is gone for good.

Attempt #2

Can’t we append a file to the existing zip? Will it magically overwrite the file? Yes, it does. Just replace mode in previous code snippet from ‘w’ to ‘a.’

with ZipFile(‘config.zip’, ‘a’) as zip_archive:
...

And rerun the script on a fresh config.zip(which has a root, docker and, app configs). You see this warning:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py:1506: UserWarning: Duplicate name: 'docker/docker-compose.yaml'
return self._open_to_write(zinfo, force_zip64=force_zip64)

It is just a warning. So go ahead, extract the content like this to see what is inside,

python -m zipfile -e config.zip config

and you find there is only one docker-compose.yaml file in docker directory with all other files/paths preserved. Wonderful!

❯ tree config
config
├── app
│ └── app-config.json
├── docker
│ └── docker-compose.yaml
└── root-config.json
2 directories, 3 files

Even though it seems to be an obvious solution, there is a serious bug here. The extracted archive may not have visible duplicate files, but the underlying file pointer might have duplicated information.

Type zipfile list command, to see those hidden duplicates.

❯ python -m zipfile -l config.zip
File Name Modified Size
docker/docker-compose.yaml 1980-01-01 00:00:00 23
app/app-config.json 1980-01-01 00:00:00 21
root-config.json 1980-01-01 00:00:00 22
docker/docker-compose.yaml 2020-03-08 20:34:48 31

So docker/docker-compose.yaml had appeared twice in the ZipInfo list but only once in extraction. For every update, the zip archive size grows and grows in the magnitude of the updated file size. If you ignore the Python warning, at some point the junk in the archive may occupy more space than actual files.

The two attempts until now couldn’t achieve an acceptable solution. Now comes third, which is a clean and elegant way.

Attempt #3

There is a no easy way to update the contents of a zip archive. A clean way is to create a new zip archive in memory and copy old ZipInfo objects from the old archive into the new archive. In case of a path where data should be inserted or replaced, instead of reading from the old archive, create a custom ZipInfo object and add it to the new archive.

The algorithm looks like this:

Here, we are defining a function that takes the path in archive and data to replace. It iterates over the old archive and copies existing stuff into the new archive. When it spots an existing element, it creates a new ZipInfo object and puts that into the new archive. Run the script on a fresh config.zip(created by createzipv1.py), and you see there is no duplication of file objects and docker/docker-compose.yaml is updated as expected.

This solution has a minor drawback of dealing with two streams at a given time, and in the worst case, it can end up consuming double the amount of run-time memory. It won’t happen unless you are talking about Gigabyte sized zip files.

Use the technique of cloning for updating/inserting paths in a zip archive.

Use case #4: Remove an existing file from zip archive

By now, after looking at many use cases, one can guess how to remove a file from the archive. The cleanest way again is to copy contents from the old archive to the new archive and skip the ZipInfo objects that match the given path. Finally, overwrite the old zip file with the new zip file. The algorithm should have only one condition like this,

...for item in old_archive.filelist:
if item.filename != path:
new_archive.writestr(item, old_archive.read(item.filename))
...

The delete script now has a function that takes only path argument and skips the respective ZipInfo object while copying.

It finishes all possible use cases that pop up while working with zip files in Python.

Note: The in-memory stream objects created(using BytesIO) in the above scripts can also be used with AWS S3 instead of flushing to a disk.

Final words

As we already discussed, one should monitor the size of a zip file and program memory for a copy operation.

A proper implementation uses a combination of techniques instead of a brute-force approach. Forex: When a stream holds a considerable buffer, Python provides a method called shutils.copyfileobj() to copy file-like objects from source to destination in an efficient way. It can chunk the buffer while copying. To solve the memory problem while updating/inserting/deleting paths in a big archive, one can use it for copying objects.

You can find more information here.

I hope you enjoyed this article! You can find all the code samples here.

https://github.com/narenaryan/python-zip-howto

References:

--

--

Naren Yellavula
Dev bits

When I immerse myself in passionate writing, time, hunger, and sleep fade away. Only absolute joy remains! --- Isn't this what some call "Nirvana"?