The Startup
Published in

The Startup

Digging Out the News 📰— With Rust

I’ve been looking at web-scraping for a side project of mine and thought I’d have a look at what Rust has to offer. I found the select crate, which seemed to have all right stuff.

Now I just needed some website for target practice 🎯.
I was reading an article on the Associated Press’s website at the moment so…
The news it is!

With tools and target set, let’s try digging out some information from the data soup called HTML 🍲.

Photo by Luís Feliciano on Unsplash

The Select crate 📦

So what do we have to work with?
A quick peek at the top-level modules reveals some main concepts.

- document

From the naming alone, I think we’re confident enough to get this show on the road.

The Node ⚫

A Node according to the docs can be an element, a comment or text.

So a Node can for example be:
- <div>|<a>|<h1>
- <!-- This is a comment -->
- This is just some text

The Predicate❓

A Predicate is a statement that is meant to be used for evaluating if something is true of false. The Predicate is a trait, which is implemented by a handful of structs that can be found under the predicate::* module.

Here is the complete list of predicates from the docs.

The Document 🗒️

Our entry point will be to create a Document representing the page we want to work with. So let’s create a Document, but from what?

I thought we could use the front page of AP’s website. We can probably find something of interest there.

We can download and load the website into our code like this 💾.

curl > src/index.html# #fn main() {
let webpage_content = include_str!("./index.html");

With that out of the way, lets create the Document.

use select::document::Document;let document = Document::from(webpage_content);

Working with the Document 🛠️

With the document created, we need to figure out what to do with it.
The methods find and nth seem to be a good start.

Let’s look at the nth method first.

The .nth method 🧮

nth(&self, usize) -> Option<Node>

According to the docs, it returns the node in the n:th place starting from zero. So in most cases document.nth(0) will return a Just(Node) containing the data for the <html> tag.

Calling nth with the value 0, 1 and 2 would conceptually result in something like this:

.nth(0) -> Just(Node(<html>...</html>))
.nth(1) -> Just(Node(<head>...</head>))
.nth(2) —> Just(Node(<meta>...</meta>))

Maybe not super exciting, but we’ve got a Node at least.

Before we go on and play with the Node type we should take a look at the find method on our document.

The .find method 🔍

find<P: Predicate>(&self, predicate:P) -> Find<P>

This method accepts a Predicate and evaluates it against every node in our Document. Every node that tests true against our predicate is included in the returning result. The method will then return a Find. But what is a Find?

Having a peek at the docs gives you a hint that it’s more or less an Iterator.
The actual source code for filtering based on the predicate is close to this:

if self.predicate.matches(&node) {
return Some(node);

Chop chop 🪓

Armed and ready, we prepare to chop our way through the markup jungle.

What do we want to find?
Well, we could try to extract the text from AP’s headline story. It’s usually the first headline on the front page featured with a big picture 🖼️.

Let’s put on our explorative programing cap and start out a bit naive.
A really simple but crude statement could be to claim that our data of interest exists in a <div> element.

The Name Predicate 📛

This predicate is used for matching nodes in our document with a specific element name. This could be <h1>, <div>, <body> and so on.

Let’s match all <div> elements in our document.

We’ll use the .text method on the Node type in order to extract the human-intended String from each node in our Find iterator.

use select::predicate::Name;let head_lines = document
.map(|node| node.text())

We check the results and, without much hope, realize that only using the <div> as a matching criteria is not specific enough. Let’s iterate! ♻️

The Attr Predicate 🎀

A second look at the <div> I had in mind reveals an attribute:

This might help us be more precise in finding what we what. We can make use of the Attr predicate like this:

use select::predicate::Attr;Attr(“data-key”, “main-story”);

The predicate above doesn’t combine our requirements though. We want the element to be a <div> and have the right attribute key/value.

The And Predicate ➕

In order to combine the predicates we can use another predicate 🤯.
The And predicate.

use select::predicate::{And, Name, Attr};And(Name(“div”), Attr(“data-key”, “main-story”))

There is also a convenience method on the Predicate type:

and<T: Predicate>(&self, T) -> And<Self, T>

With this we can chain or predicates together, which can be nice for syntax aesthetics.

use select::predicate::{Name, Attr};
use select::predicate::Predicate;
Name(“div”).and(Attr(“data-key”, “main-story”));

We need to go deeper 🤿

So far we have this, which is much better than before.

use select::predicate::{Name, Attr};let head_lines = document
.find(Name(“div”).and(Attr(“data-key”, “main-story”)))
.map(|node| node.text())

We’re getting headlines for the news articles on the front page, but the result is still to information-heavy though. We are capturing the author and the published date of the article. We just want the headline.

A closer inspection shows that we need to dig into the resulting div.
Every article headline is surrounded by an <a> element.

The Child Predicate 👶

There’s a child method on the Predicate type and at first I thought that would be the solution. But after a few failing attempts, I realized that the <a> element I’m trying to find is wrapped in another <div>.

child<T: Predicate>(self, other: T) -> Child<Self, T>

.and(Attr(“data-key”, “main-story”))
.child(Name("a")) <-- Will not work
--- <div data-key="main-story"> <-- The div we found earlier
<div> <-- The wrapping div
<a> ... </a> <-- What we want

The Descendant Predicate 👪

Hope is not lost! Predicate has another method called descendant.
This will not only search through the child nodes, but the children’s children and so on.

descendant<T: Predicate>(self, other: T) -> Descendant<Self, T>

use select::predicate::{Name, Attr};let head_lines = document
.and(Attr(“data-key”, “main-story”))
.map(|node| node.text())

Close but no cigar 🚭

Now if we run this we can see that we are really close, but there is still some noise. The code is matching some unwanted <a> tags.

Looking closer at the <a> tag, we see that it has data-key=”card-headline” as an attribute. This might be of use. Using our friend And again gives us this code.

use select::predicate::{Name, Attr};let head_lines = document
.and(Attr(“data-key”, “main-story”))
.and(Attr("data-key", "card-headline"))
.map(|node| node.text())

Getting fancy 💅

It’s good, but we can improve. By adding an extra step we can become even more explicit. Inside the <a> tag there lives an <h1> holding the actual headline. We could use Name to solve this, but let’s try something new.

Selection 🪜

The select library has a fourth concept which we haven’t tried out yet, Selection.

A Selection seems to be focused around concepts that allow you to walk through a Document of Nodes, with methods like:

- parent
- children
- prev
- next

There is a nice method on the Find type called into_selection, which is perfect since we have a Find and want a Selection.

The .children method 👶👶

children(&self) -> Selection<'a>

The method children could solve our problem of finding the <h1> tag.
In order for us to .map over the result from Selection we need to turn it into an Iterator again with .iter.

With this knowledge we can do this:

use select::predicate::{Name, Attr};let head_lines = document
.and(Attr(“data-key”, “main-story”))
.and(Attr("data-key", "card-headline"))
.map(|node| node.text())

The .first method 🥇

At the beginning of this I said that I was only interested in the first main story, so let’s try to fix that as well.

Right now we have some duplicates and other articles being caught in our scrape. Looking closer at the Selection type, shows that there is a method called first. That might fix it for us.

first(&self) -> Option<Node<'a>>

Let’s replace the .iter() call with .first and the .collect call with .unwrap_or_default().

use select::predicate::{Name, Attr};let head_line = document
.and(Attr(“data-key”, “main-story”))
.and(Attr("data-key", "card-headline"))
.map(|node| node.text())

Cleaning up space junk 🛰️

Perfect, one single main story headline served!

The resulting text will probably have some new-line characters and white-space junk though, so lets trim it down like this:

 ... .map(|node| node.text())
.map(|t| {
.map(|s| s.trim())
.join(" ")

Wrapping up 🎁

There we have it, a single white-space junk-free main story headline from AP.
Now that I, and maybe you, know a little bit more about the select library, we can go back and make this code even better.

Link to a complete example.

I’ll leave it at that for now.

/ Robert




Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

EOSC Weekly Report #114

8 Factors for an iOS developer to know apart from Swift, Objective-C and Xcode

Run Golang Executable binary as Daemon Service

CS Freshie! Have this in your resume.

Asynchronous task execution using Cloud task

The best programming languages to learn in 2022

“Funslingers” Devblog #29 | Making Homing Missile Behavior in Unity

About 20 More Days to Go!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
robert barlin

robert barlin

… a little bit of everything :)

More from Medium

My story when started learning rust

Using SiriDB for IoT applications

Introducing Obsidian 5.0, built for Deno

The Sweet Taste of Wasm & Rust