Working with Compressed Tar Files in Go

Vladimir Vivien
Jul 19, 2019 · 8 min read
Image for post
Image for post

This post shows how to use the archive and the compress packages to create code that can programmatically build or extract compressed files from tar-encoded archive files. Both packages use Go’s streaming IO idiom which makes it easy to read or write data from diverse sources that can be compressed and archived.

Source code for this post https://github.com/vladimirvivien/go-tar

Tar

A tar file is a collection of binary data segments (usually sourced from files). Each segment starts with a header that contains metadata about the binary data, that follows it, and how to reconstruct it as a file.

The tar Package

Let us start with a simple example that uses in-memory data (synthetic files) and tar that data into archive file out.tar. This is to illustrate how the different pieces of the tar package works.

The next section shows how to create tar files from actual file sources.

The next code snippet creates a function value assigned to tarWrite which loops through the provided map (files) to create the tar segments for the archive:

Sour file https://github.com/vladimirvivien/go-tar/simple/tar1.go

In the previous snippet, variable tw is created as a *tar.Writer which uses tarFile as its target. For each (synthetic) file from map data, a tar.Header is created which specifies a file name, a file mode, and a file size. The header is then written with tw.WriteHeader followed by the content of the file using tw.Write.

There are many more tar header fields. The three illustrated are the minimum required to create a functional archive.

When the code is executed, it will create file out.tar. We can inspect that the archive is properly created using the tar -tvf command:

Image for post
Image for post

We can see that the tar contains all three files as expected. However, note that because we used incomplete header information, some file information is either wrong or missing (such as the date, file ownership, etc).

To test the generated tar, use command tar -xvf out.tar to extract the files.

Programmatically, the files contained in the archive can be extracted using the tar package as well. The following source snippet opens the tar file and reconstruct its content on stdout:

In the previous snippet, variable tr, of type *tar.Reader, is used to extract the files from archive file tarFile . Using a forever-loop, the code visit each archive segment in order to reconstruct it by printing the content to standard out. The first step is to get the section’s header and ensure the file is not at EOF using tr.Next(). If not at EOF, then code reads the content of the section (using io.Copy) and prints it.

While these examples are functionally complete, they are not the best way to use the package. The next sections introduce couple of functions that adds more robust nuances to work with tar files.

Tar from Files

After reading the previous section, readers should be familiar with the pieces necessary to create tar-encoded archives and extract files from them programmatically. This section, however, explores the more common usage of building tar files from file sources.

Function tartar , in the following snippet, creates a tar file from a list of specified paths. It uses function filetpath.Walk and filepath.WalkFunc (from package path) to walk the specified file tree:

Full source github.com/vladimirvivien/go-tar/tartar/tartar.go

For the most part, this follows the same approach as before, where all of the work is done inside the walker function block. Here, however, instead of creating the tar header manually, function tar.FileInfoHeader is used to properly copy os.FileInfo (from fi).

Note, that when a directory is encountered, the code simply write the header and moves on to the next file without writing any content. This creates directory entry as an archive header which will allow fidelity of the tree structure to be maintained in the tar file.

When this code creates a tar, we can see that all of the file header information got added properly and includes proper time/date, ownership, file mode, etc:

Image for post
Image for post

Next, let us look at how the content of the archive can be extracted into a file tree on the filesystem programmatically. The following code uses function untartar to extract and reconstruct files from tar file tarName into path xpath:

Again, the extraction mechanism is similar to how it was done previously. Within a forever loop, method tr.Next is used to access the next header in the archive file. If the header is for a directory, the code creates the directory and moves on to the next header.

Recall in the tartar function, the header.Name is forced to be a relative path. This ensures that the file is placed in the proper subdirectory when it is extracted.

If the header is for a file, the file is created using os.OpenFile. This ensures that the file is created with the proper permission value. Finally, the code uses function io.Copy to transfer content from the archive into the newly created file.

Adding Compression

The compress package offers several compression formats (including gzip, bzip2, lzw, etc) that can easily be incorporated in your code. Again, since both archive/tar and compress/gzip packages are implemented using Go’s streaming IO interfaces, it is trivial change the code to compress the content of the archive file using gzip.

The following snippet updates function tartar to use the gzip compression when the archive file ends in .gz:

The previous code update is all that is necessary be able to compress the content being added the archive. The io.Writer instances tw and gz are chained with tarFile allowing bytes destined to tarFile to be compressed as they are pipelined through gz. Pretty sweet!

Files compressed using tartar can be inspected using the gzip command:

Programmatically, the code can decompress tar-encoded content while unpacking files from the archive. The following code snippet updates function untartar to chain io.Readers tarFile, gz, and tr:

With this change, the program will automatically decompress the content of the archived files using gzip. The same chaining strategy can be done to support other compression algorithms that implement the streaming IO API.

Conclusion

The archive and the compress packages in Go demonstrate how a powerful standard library can help programmers build serious tools. Both packages use Go’s streaming IO constructs to work with compressed tar-encoded files. For the astute or curious reader, you are encouraged to update the code to use other archive or compression algorithms.

Also, don’t forget to checkout my book on Go, titled Learning Go Programming from Packt Publishing.

Image for post
Image for post

Learning the Go Programming Language

Short and insightful posts for newcomers learning the Go…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store