Rust: Reading a file line by line while being mindful of RAM usage.

Thomas Simmer
6 min readFeb 10, 2024

--

Imagine your task is to read a file and do some work with each of its lines. In Rust, you could do something like:

use std::fs::File;
use std::io::{BufRead, BufReader};

let file = File::open(file_path).unwrap();
let reader = BufReader::new(file);
let lines = reader.lines().enumerate();

for (_index, bufread) in lines {
let line = bufread.unwrap_or_default();
# Do your stuff here...
}

The issue here is that if your file contains 100 GB of data over one line (i.-e., before any “\n” character), then your program and computer are dead.

Let’s understand why.

For that, go to the definition of the lines() method.

lines method in the BufRead trait.

Now go to the definition of this Lines object.

As you can see, Lines is an iterator over the lines of our file. It uses the method read_line of the buffer to read these lines. The method next creates a mutable variable buf, which you should not confuse with the buf property of the iterator. This variable is then passed to read_line. Let’s see this other method.

I am not an expert, but it seems to me that we are trying to append the result of read_until inside the buf variable. This read_until method is the last thing we need to understand.

This method relies on a loop. Inside, we are filling a bounded buffer called available using the fill_buf method of our object r that implements the BufRead trait.

Just for your information, if you go more into details, you will find that the available buffer’s size will be equal to DEFAULT_BUF_SIZE (8 * 1024).

If that available buffer contains our delimiter “\n”, then we return the beginning of that buffer until the delimiter, included. If that available buffer does not contain our delimiter, then we extend buf with this available and start again the loop. This is where the problem is. This buf variable could grow indefinitely.

To answer our needs, let’s create our own LimitedBufferedReader. Just before that, add these two variables to your config.rs file.

pub const NB_BYTES_ALLOWED_PER_LINE: usize = 10_000;
pub const USE_LIMITED_BUFFER: bool = true;

Now, let’s start with the necessary methods and traits to implement.

use crate::config::*;
use std::cmp::min;
use std::io::{BufRead, BufReader, ErrorKind, Read, Result};

pub struct LimitedBufferedReader<R> {
reader: BufReader<R>,
}

impl<R: Read> LimitedBufferedReader<R> {
pub fn new(inner: R) -> Self {
LimitedBufferedReader {
reader: BufReader::new(inner),
}
}
}

impl<R: Read> Read for LimitedBufferedReader<R> {
fn read(&mut self, buf: &mut [u8]) -> Result<usize> {
self.reader.read(buf)
}
}

impl<R: Read> BufRead for LimitedBufferedReader<R> {
fn fill_buf(&mut self) -> Result<&[u8]> {
self.reader.fill_buf()
}

fn consume(&mut self, amt: usize) {
self.reader.consume(amt)
}
}

Now, let’s add two other methods to the BufRead trait implementation.


fn read_line(&mut self, buf: &mut String) -> Result<usize> {
unsafe { self.read_until(b'\n', buf.as_mut_vec()) }
}

fn read_until(&mut self, delim: u8, buf: &mut Vec<u8>) -> Result<usize> {
println!("Using our optimized method to read a line...");

let mut read = 0;

loop {
let (done, used) = {
let available = match self.fill_buf() {
Ok(n) => n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Err(e),
};

match memchr::memchr(delim, available) {
Some(i) => {
if read <= NB_BYTES_ALLOWED_PER_LINE {
// buf here has DEFAULT_BUF_SIZE bytes allocated, but we shouldn't
// write more than NB_BYTES_ALLOWED_PER_LINE inside.
let max_bytes_to_write = min(NB_BYTES_ALLOWED_PER_LINE - read, i);
buf.extend_from_slice(&available[..max_bytes_to_write]);
}
(true, i + 1)
}
None => {
let available_len = available.len();
if read <= NB_BYTES_ALLOWED_PER_LINE {
let max_bytes_to_write =
min(NB_BYTES_ALLOWED_PER_LINE - read, available_len);
buf.extend_from_slice(&available[..max_bytes_to_write]);
}
(false, available_len)
}
}
};

self.consume(used);
read += used;

if done || used == 0 {
return Ok(read);
}
}
}
}

To put it in simple words, we are just giving a limit to the size of that buf variable. If available contains our delimiter, we inject its content only until the NB_BYTES_ALLOWED_PER_LINE limit. If it does not contain our delimiter, we inject its content only until the NB_BYTES_ALLOWED_PER_LINE limit.

Good. Now, how can we be sure that this does exactly what we want ?

We will write a test where we create a file with a huge line and see how htop’s RAM usage reacts when we iterate over the lines, using BufReader and then our LimitedBufferedReader.

The first thing to have is a helper to easily toggle between our LimitedBufferedReader and the regular BufReader.

use crate::config::*;
use crate::structs::limited_buffered_reader::LimitedBufferedReader;
use std::fs::File;
use std::io::{BufRead, BufReader, Read, Result};

pub enum CustomReader {
Limited(LimitedBufferedReader<BufReader<File>>),
Regular(BufReader<File>),
}

pub fn build_reader(file: File) -> CustomReader {
if USE_LIMITED_BUFFER {
let inner = BufReader::new(file);
CustomReader::Limited(LimitedBufferedReader::new(inner))
} else {
CustomReader::Regular(BufReader::new(file))
}
}

// There should be other necessary methods to implement, but I try to keep this article short.
// You can find them on the GitHub project at the end of this page.

And now let’s write two small tests. One for the case where a line is less than NB_BYTES_ALLOWED_PER_LINE and one where it’s more.

#[cfg(test)]
mod tests {
use rust_file_read::config::{NB_BYTES_ALLOWED_PER_LINE, USE_LIMITED_BUFFER};
use rust_file_read::utils::helpers::build_reader;
use std::fs;
use std::thread::sleep;
use std::time::Duration;
use std::{fs::File, io::BufRead};

fn write_in_file_and_read(n: usize, duration: Option<Duration>) {
// Build a file with a line containing n zeros.
let file_path = "test";
File::create(file_path).unwrap();
fs::write(file_path, &"0".repeat(n)).unwrap();

let file = File::open(file_path).unwrap();
let reader = build_reader(file);
let lines = reader.lines().enumerate();

for (_index, bufread) in lines {
let line = bufread.unwrap_or_default();

if USE_LIMITED_BUFFER && n > NB_BYTES_ALLOWED_PER_LINE {
assert_eq!(line.len(), NB_BYTES_ALLOWED_PER_LINE);
} else {
assert_eq!(line.len(), n);
}

if let Some(duration) = duration {
println!("\n---------- LOOK htop NOW -----------\n");
sleep(duration);
println!("\n---------- STOP LOOKING htop NOW -----------\n");
}
}

fs::remove_file(file_path).unwrap();
}

#[test]
fn read_lines_under_limit() {
write_in_file_and_read(4, None);
}

#[test]
fn read_lines_above_limit() {
write_in_file_and_read(1_000_000_000, Some(Duration::from_millis(5000)))
}
}

As you can see, I am using htop to monitor my RAM usage while I have the content of my line. It will be useful for the test read_lines_above_limit. When this test runs and this print appears, I will look at the RAM usage and compare it between our LimitedBufferedReader and the BufReader.

Let’s start with BufReader. Change your config.rs so that you have:

pub const NB_BYTES_ALLOWED_PER_LINE: usize = 10_000;
pub const USE_LIMITED_BUFFER: bool = false;

Before we run it, let’s see our normal RAM usage.

It is about 3.82 GB. Now run the test, and don’t forget to watch htop.

The RAM usage climbed to 4.76 GB. The test finishes well.

Now let’s try with this config:

pub const NB_BYTES_ALLOWED_PER_LINE: usize = 10_000;
pub const USE_LIMITED_BUFFER: bool = true;

The RAM usage stays around 3.83 GB, as it was before the test. The test finishes well.

This confirms that our optimization worked.

Conclusion

I hope this article has helped or interested you. I am not an expert yet in Rust, and even if I was, please correct me if I said anything wrong in this article or if you have any recommendations.

You will find the GitHub project here: https://github.com/thomassimmer/rust-file-read

--

--