Working with Compressed Tar Files in Go

Vladimir Vivien
Learning the Go Programming Language
8 min readJul 19, 2019

--

This post shows how to use the archive and the compress packages to create code that can programmatically build or extract compressed files from tar-encoded archive files. Both packages use Go’s streaming IO idiom which makes it easy to read or write data from diverse sources that can be compressed and archived.

Source code for this post https://github.com/vladimirvivien/go-tar

Tar

A tar file is a collection of binary data segments (usually sourced from files). Each segment starts with a header that contains metadata about the binary data, that follows it, and how to reconstruct it as a file.

+---------------------------+
| [name][mode][uid][guild] |
| ... |
+---------------------------+
| XXXXXXXXXXXXXXXXXXXXXXXXX |
| XXXXXXXXXXXXXXXXXXXXXXXXX |
| XXXXXXXXXXXXXXXXXXXXXXXXX |
+---------------------------+
| [name][mode][uid][guild] |
| ... |
+---------------------------+
| XXXXXXXXXXXXXXXXXXXXXXXXX |
| XXXXXXXXXXXXXXXXXXXXXXXXX |
+---------------------------+

The tar Package

Let us start with a simple example that uses in-memory data (synthetic files) and tar that data into archive file out.tar. This is to illustrate how the different pieces of the tar package works.

The next section shows how to create tar files from actual file sources.

The next code snippet creates a function value assigned to tarWrite which loops through the provided map (files) to create the tar segments for the archive:

import(
"archive/tar"
...
)
func main() {
tarPath := "out.tar"
files := map[string]string{
"index.html": `<body>Hello!</body>`,
"lang.json": `[{"code":"eng","name":"English"}]`,
"songs.txt": `Claire de la lune, The Valkyrie, Swan Lake`,
}
tarWrite := func(data map[string]string) error {
tarFile, err := os.Create(tarPath)
if err != nil {
log.Fatal(err)
}
defer tarFile.Close()
tw := tar.NewWriter(tarFile)
defer tw.Close()
for name, content := range data {
hdr := &tar.Header{
Name: name,
Mode: 0600,
Size: int64(len(content)),
}
if err := tw.WriteHeader(hdr); err != nil {
return err
}
if _, err := tw.Write([]byte(content)); err != nil {
return err
}
}
return nil
}
...
if err := tarWrite(files); err != nil {
log.Fatal(err)
}
}

Sour file https://github.com/vladimirvivien/go-tar/simple/tar1.go

In the previous snippet, variable tw is created as a *tar.Writer which uses tarFile as its target. For each (synthetic) file from map data, a tar.Header is created which specifies a file name, a file mode, and a file size. The header is then written with tw.WriteHeader followed by the content of the file using tw.Write.

There are many more tar header fields. The three illustrated are the minimum required to create a functional archive.

When the code is executed, it will create file out.tar. We can inspect that the archive is properly created using the tar -tvf command:

We can see that the tar contains all three files as expected. However, note that because we used incomplete header information, some file information is either wrong or missing (such as the date, file ownership, etc).

To test the generated tar, use command tar -xvf out.tar to extract the files.

Programmatically, the files contained in the archive can be extracted using the tar package as well. The following source snippet opens the tar file and reconstruct its content on stdout:

func main() {
tarPath := "out.tar"
tarUnwrite := func() error {
tarFile, err := os.Open(tarPath)
if err != nil {
return err
}
defer tarFile.Close()
tr := tar.NewReader(tarFile)
for {
hdr, err := tr.Next()
if err == io.EOF {
break // End of archive
}
if err != nil {
return err
}
fmt.Printf("Contents of %s: ", hdr.Name)
if _, err := io.Copy(os.Stdout, tr); err != nil {
return err
}
fmt.Println()
}
return nil
}
...
if err := tarUnWrite(files); err != nil {
log.Fatal(err)
}
}

In the previous snippet, variable tr, of type *tar.Reader, is used to extract the files from archive file tarFile . Using a forever-loop, the code visit each archive segment in order to reconstruct it by printing the content to standard out. The first step is to get the section’s header and ensure the file is not at EOF using tr.Next(). If not at EOF, then code reads the content of the section (using io.Copy) and prints it.

While these examples are functionally complete, they are not the best way to use the package. The next sections introduce couple of functions that adds more robust nuances to work with tar files.

Tar from Files

After reading the previous section, readers should be familiar with the pieces necessary to create tar-encoded archives and extract files from them programmatically. This section, however, explores the more common usage of building tar files from file sources.

Function tartar , in the following snippet, creates a tar file from a list of specified paths. It uses function filetpath.Walk and filepath.WalkFunc (from package path) to walk the specified file tree:

import(
"path/filepath"
)
func tartar(tarName string, paths []string) error {
tarFile, err := os.Create(tarName)
if err != nil {
return err
}
defer tarFile.Close()
tw := tar.NewWriter(tarFile)
defer tw.Close()
for _, path := range paths {
walker := func(f string, fi os.FileInfo, err error) error {
...
// fill in header info using func FileInfoHeader
hdr, err := tar.FileInfoHeader(fi, fi.Name())
...
// calculate relative file path
relFilePath := file
if filepath.IsAbs(path) {
relFilePath, err = filepath.Rel(path, f)
if err != nil {
return err
}
}
hdr.Name = relFilePath
if err := tw.WriteHeader(hdr); err != nil {
return err
}
// if path is a dir, go to next segment
if fi.Mode().IsDir() {
return nil
}
// add file to tar
srcFile, err := os.Open(f)
...
defer srcFile.Close()
_, err = io.Copy(tw, srcFile)
if err != nil {
return err
}
return nil
}
if err := filepath.Walk(path, walker); err != nil {
fmt.Printf("failed to add %s to tar: %s\n", path, err)
}
}
return nil
}

Full source github.com/vladimirvivien/go-tar/tartar/tartar.go

For the most part, this follows the same approach as before, where all of the work is done inside the walker function block. Here, however, instead of creating the tar header manually, function tar.FileInfoHeader is used to properly copy os.FileInfo (from fi).

Note, that when a directory is encountered, the code simply write the header and moves on to the next file without writing any content. This creates directory entry as an archive header which will allow fidelity of the tree structure to be maintained in the tar file.

When this code creates a tar, we can see that all of the file header information got added properly and includes proper time/date, ownership, file mode, etc:

Next, let us look at how the content of the archive can be extracted into a file tree on the filesystem programmatically. The following code uses function untartar to extract and reconstruct files from tar file tarName into path xpath:

func untartar(tarName, xpath string) (err error) {
tarFile, err := os.Open(tarName)
...
defer tarFile.Close()
absPath, err := filepath.Abs(xpath)
...
tr := tar.NewReader(tarFile)

// untar each segment
for {
hdr, err := tr.Next()
if err == io.EOF {
break
}
if err != nil {
return err
}
// determine proper file path info
finfo := hdr.FileInfo()
fileName := hdr.Name
absFileName := filepath.Join(absPath, fileName)
// if a dir, create it, then go to next segment
if finfo.Mode().IsDir() {
if err := os.MkdirAll(absFileName, 0755); err != nil {
return err
}
continue
}
// create new file with original file mode
file, err := os.OpenFile(
absFileName,
os.O_RDWR|os.O_CREATE|os.O_TRUNC,
finfo.Mode().Perm(),
)
if err != nil {
return err
}
fmt.Printf("x %s\n", absFileName)
n, cpErr := io.Copy(file, tr)
if closeErr := file.Close(); closeErr != nil {
return err
}
if cpErr != nil {
return cpErr
}
if n != finfo.Size() {
return fmt.Errorf("wrote %d, want %d", n, finfo.Size())
}
}
return nil
}

Again, the extraction mechanism is similar to how it was done previously. Within a forever loop, method tr.Next is used to access the next header in the archive file. If the header is for a directory, the code creates the directory and moves on to the next header.

Recall in the tartar function, the header.Name is forced to be a relative path. This ensures that the file is placed in the proper subdirectory when it is extracted.

If the header is for a file, the file is created using os.OpenFile. This ensures that the file is created with the proper permission value. Finally, the code uses function io.Copy to transfer content from the archive into the newly created file.

Adding Compression

The compress package offers several compression formats (including gzip, bzip2, lzw, etc) that can easily be incorporated in your code. Again, since both archive/tar and compress/gzip packages are implemented using Go’s streaming IO interfaces, it is trivial change the code to compress the content of the archive file using gzip.

The following snippet updates function tartar to use the gzip compression when the archive file ends in .gz:

import(
"compress/gzip"
)
func tartar(tarName string, paths []string) (err error) {
tarFile, err := os.Create(tarName)
...

// enable compression if file ends in .gz
tw := tar.NewWriter(tarFile)
if strings.HasSuffix(tarName, ".gz"){
gz := gzip.NewWriter(tarFile)
defer gz.Close()
tw = tar.NewWriter(gz)
}
defer tw.Close()
...
}

The previous code update is all that is necessary be able to compress the content being added the archive. The io.Writer instances tw and gz are chained with tarFile allowing bytes destined to tarFile to be compressed as they are pipelined through gz. Pretty sweet!

Files compressed using tartar can be inspected using the gzip command:

> gzip -l tartar.tar.gz
compressed uncompressed ratio uncompressed_name
724385 6213632 88.3% tartar.tar

Programmatically, the code can decompress tar-encoded content while unpacking files from the archive. The following code snippet updates function untartar to chain io.Readers tarFile, gz, and tr:

func untartar(tarName, xpath string) (err error) {
tarFile, err := os.Open(tarName)
...
tr := tar.NewReader(tarFile)
if strings.HasSuffix(tarName, ".gz") {
gz, err := gzip.NewReader(tarFile)
if err != nil {
return err
}
defer gz.Close()
tr = tar.NewReader(gz)
}
...
}

With this change, the program will automatically decompress the content of the archived files using gzip. The same chaining strategy can be done to support other compression algorithms that implement the streaming IO API.

Conclusion

The archive and the compress packages in Go demonstrate how a powerful standard library can help programmers build serious tools. Both packages use Go’s streaming IO constructs to work with compressed tar-encoded files. For the astute or curious reader, you are encouraged to update the code to use other archive or compression algorithms.

Also, don’t forget to checkout my book on Go, titled Learning Go Programming from Packt Publishing.

--

--