Making an HTML parsing script a hundred times faster with Rayon

Sam Van Overmeire
4 min readApr 26, 2024

--

Image by Bing Image Creator

Recently, I was writing a Rust script to gather some data from the internet. The task at hand was fairly simple: visit a URL, retrieve the HTML, parse it, use selectors to find relevant information, and write the result to a CSV file. I used the reqwest crate for visiting the sites, scraper for parsing the HTML, and csv for writing the file. There was only one catch: the script would have to visit hundreds of pages.

// imports from std, scraper, reqwest, serde and anyhow

fn main() -> Result<()> {
// setup
let services = retrieve_services_from_file()?;
let client = Client::new();
let class_selector = Selector::parse(".impl-items > details").unwrap();
// other selectors

// loop, retrieve, analyze, write
let results = services
.iter()
.filter(|s| !s.is_empty())
.map(|service| {
let docs = retrieve_page(&client, service)?;
let analysis = analyze_text(
&class_selector,
// more selectors
)?;
write_to_file(service, analysis)?;
Ok(())
})
.collect::<Vec<Result<()>>>();

// print a message for the failures (if any)
for result in results {
match result {
Ok(_) => {}
Err(e) => {
println!("An error occurred: {}", e);
}
}
}

Ok(())
}

// functions

Running the script serially on my machine took over 5 minutes (324 seconds) according to the time command. That’s a long feedback loop. And while writing the first version of my code, I noticed that most of the work could be done in parallel, like building the URL, retrieving the HTML, parsing and analyzing it. In fact, I had decided to write to one file per page — instead of having a single file for all output — because that would make writing results independent and parallelizable. (Plus, combining files afterward is just one simple and fast bash command away, e.g. cat *.csv >> output.csv.)

I had already toyed around with the rayon library, which regularly features in blogs and articles as an easy way to speed up Rust code without changing your implementation. (Coincidentally, while I was writing this, Shuttle released a post about rayon and good use cases for the library.) So, I decided to give it a try. After adding the crate (rayon = “1.10.0”), all I did was add two imports and change the iter into par_iter.

// earlier imports
use rayon::iter::IntoParallelRefIterator; // <= additional import
use rayon::iter::ParallelIterator; // <= additional import

fn main() -> Result<()> {
// setup is still the same

let results = services
.par_iter() // <= par_iter instead of iter!
.filter(|s| !s.is_empty())
.map(|service| { // <= this map can now be done in parallel!
let docs = retrieve_page(&client, service)?;
let analysis = analyze_text(
&class_selector,
// ...
)?;
write_to_file(service, analysis)?;
Ok(())
})
.collect::<Vec<Result<()>>>();

// everything else is also the same
}

// functions

After this extremely minor change, time cargo run took less than 8 seconds. With a release build that went down to 3.7 seconds. Most projects I’ve worked on have significantly longer unit test runs! Obviously, you should take the exact figures with a large grain of salt. I only measured the speed of each run once, and the original serial run was done without the release flag, making the already approximative 100x speed bump unfair, as we are not measuring the same thing… yet 100x just sounds so much better than 40x. Still, let’s be skeptical and say the real speedup was closer to 10 or 20x. That is still absolutely amazing for a one-line code change.

For a quick comparison, I wrote the first part of the script (where we get information from a file, build URLs, and retrieve the HTML) in Javascript using Promise.all. On my machine, that took 23 seconds. Impressive. But then again, IO is exactly what Javascript is best at. So it’s safe to say that duration would have — at the very least — doubled had I added the parsing, analyzing, and outputting.

“But which of the two versions required more code? Which one took longer to write?”, are the valid questions that another programmer asked me afterward. Hard to say as I never wrote the entire implementation in Javascript. Still, I don’t think it will surprise anyone that the simplified Javascript version was short, some 15 lines of code. Few would call it verbose. Rust, meanwhile, required 30 lines to do the same work. Sounds like a win for Javascript. Although… there was code dedicated to handling errors in my Rust code, which I skipped entirely in the other version. And when retrieving the pages in a single Promise.all, the code would crash with connection timeouts, forcing me to write 10+ lines of code to batch the calls. Only when visiting less than 100 pages could I get away with a single, simple all.

Meanwhile, comparing the speed of development is even harder. The Javascript version was very easy to write, though I did make two mistakes that caused the program to crash at runtime. Meanwhile the Rust code ‘just worked’, doing calls with reqwest took little effort thanks to prior experience with the crate, and despite being new to scraper, getting it running proved easy. What I spent most of my time on, was figuring out what selectors would work best for extracting the information I needed from the HTML. And I would have done that regardless of my language preferences!

To conclude: rayon allows for very easy parallelization of Rust tasks, like doing REST calls, parsing HTML, and writing the results to a CSV file. On my machine, it outperformed basic Javascript at IO. Comparing the Rust and Javascript implementations for retrieving HTML, the former was twice as long, though this was partially due to error handling and ignores batching. Finally, writing the script in Rust and not Javascript did not noticeably slow me down, mostly because I spent the bulk of my time thinking about (good) selectors.

--

--

Sam Van Overmeire

Medium keeps bugging me about filling in a Bio. Maybe this will make those popups go away.