Writing generic file reader in Rust

Mingwei Zhang
BGPKIT
Published in
5 min readNov 25, 2021

When writing Rust code handling compressed data files, one important design task is that how can we write code to read from files with different compression algorithms and potentially from different locations? In this post, we will discuss how we write generic file reader that can read bzip2 or gzip compressed files from either local file system or remote locations.

Generics in Rust: Trait System

First, we will first talk a bit about how Rust handles generics. We will use our coding task as an example: create an generic read interface that handles read from files regardless of their compression algorithms or storage location.

A trait tells the Rust compiler about functionality a particular type has and can share with other types. [1]

When approaching the task, the first observation is that we are dealing with different types of IO sub-tasks, with a common feature that they all read from something somewhere. In Rust, we define this kind of common behavior in a trait, in this case, Read trait.

Read trait definition.

A trait defines a set of functions that all structs that shares the trait must also implement. The Read trait allows for reading bytes from a source. For reading from a compressed file locally, or remotely, the tasks all involve reading bytes from a certain source, and thus should all implement Read trait.

As a downstream user of a library that handles reading, one does not necessarily care what was happening behind the scene, as long as the bytes keeps coming in. In this case, we can define a function that returns a trait object.

A trait object points to both an instance of a type implementing a specified trait as well as a table used to look up trait methods on that type at runtime. [2]

In essence, a trait object is a way to allow users to call functions associated with some type defined in a trait without actually specifying the specific type. In our example here, the user cares only the bytes to read, but not where the bytes is from necessarily. Here we can define such a trait object as Box<dyn Read> .

Here are a few points about this definition:

  • Box means it lives on heap rather than on stack. Because we don’t know exactly what type the reader is, the compiler cannot allocate a fixed-size space on stack.
  • dyn Read : the dyn keyword indicate that the calls to the following trait is dynamically dispatched. For more about dynamic dispatch, see official document here. The Read is of course our trait needed here.

Implementing a Generic Reader

Now we can implement a function that returns a generic reader that we want.

The specific task now is to design a function that takes a file path and returns a reader that can handle the files pointed by the path. The path can be either a local file system path, or a remote http[s] URL. The file pointed by the path can be either a .gz or a .bz2 file, determined by the file suffix.

Determine File Type

The following one-liner determines the file type by looking at the file suffix:

let file_type = path.split(".").collect::<Vec<&str>>().last().unwrap().clone();

It effectively split the path by . and get the last match (i.e. the file suffix).

Reader for Remote and Local Files

Now we need to create a reader that first reads the raw bytes before decompression.

let raw_reader: Box<dyn Read> = match path.starts_with("http") {
true => {
let response = Request::get(path).body(())?.send()?;
Box::new(response.into_body())
}
false => {
Box::new(File::open(path)?)
}
};

The code above checks if the file path starts with http. If so, we will create a remote request and box the response body into a read trait object. Here we use the isahc crate, in which the response body implements the Read trait, fitting our usage here. If the file is a local one, then we can simply box the File::open(path) result, which also implements the Read trait.

In essence, we can use any crates here that returns the content body and also implements Read trait.

Reader for Different Compressions

The second reading step is to read from raw bytes and decompress the bytes. The data source in this step is the reader trait object from the previous step. Using the file type we first determined early, we can decide which decompression library to use accordingly

match file_type {
"gz" => {
let reader = Box::new(GzDecoder::new(raw_reader));
Ok(Box::new(BufReader::new(reader)))
}
"bz2" => {
let reader = Box::new(BzDecoder::new(raw_reader));
Ok(Box::new(BufReader::new(reader)))
}
t => {
panic!("unknown file type: {}", t)
}
}

Here, the raw_reader is the raw bytes reader from previous step. We also wrap the decompression reader with a buffered reader so that we can more efficiently handle decompression tasks.

Full Example

The full example is shown in the code snippet below:

For more context and full code changes, see this pull request for BGPKIT Parser: https://github.com/bgpkit/bgpkit-parser/pull/16

Why not `reqwest`?

For those who is familiar with the Rust ecosystem, you may wonder why don’t I use the reqwest crate for handling HTTP requests? There are actually some good reasons that reqwest does not work well here for our usage.

We are designing our code base in a synchronized runtime, meaning for reqwest, we can only use the reqwest::blocking::* functions. The issue is that the API provided in the reqwest blocking feature does not allow us to do byte-to-bytes reading. We have to read all bytes first and then wrapping the bytes into a reader later. This is particularly not favorable for our case, because in BGPKIT Parser, we often need to handle large RIB table dumps (hundreds MB in size), and loading files all in is slow and memory intensive.

Instead of waiting for reqwest to download all remote bytes and then decompress, we could use isahc's request body type and reads bytes as they comes in and start parsing immediately. This approach not only allowing the parser to start the parsing process significantly faster, it also greatly reduces the memory usage when handling multiple large table dump files.

--

--