Scraping the web: experiments with functional programming in javascript — part II

Ricardo Araújo
Frontend Weekly
Published in
18 min readDec 28, 2017

In Part I we went through the fundamental constructs of functional programming, at least the ones that we’ll now be using to build our scraper. Now, we can start building it!

What we want to build

So, our first goal is to build a scraper to obtain movie information from IMDB, stuff like title, summary, director’s name, etc. After that, we’ll want to generalize our scraper and turn it into a library so we can use it to do scrape any web page.

OK, some web pages.

Yeah, it will probably be useless.

Libraries we’ll use

If you read part I, you probably already guessed that one of the libraries we’ll use is Ramda. It’s a utility library that’s specifically designed for functional programming. Other excellent alternatives would be Lodash/fp or Sanctuary (and, no doubt, many others I don’t even know about).

As cool as it is, Ramda does not provide us the basic types that we talked about previously: Maybe, Either and Future (note however that we’ll be using map and chain from Ramda). We’ll use Sanctuary for Maybe and Either, and Fluture for Future.

For fetching the web page we’ll just use Request. Do note that this means our scraper will not work for pages with dynamic content, specifically it won’t work for single page apps (SPAs). SPAs will typically load a static HTML page and then let Javascript kick in and load the bulk of the page dynamically. But for that to work you need a web browser (or something that can mimic one, like PhantomJS). With Request tough, we will only do a single request to the server, meaning we won’t get past the static HTML page. Bummer.

So, Request will help us get the HTML for a specific URL. For extracting information from elements of that HTML we will use Cheerio, which lets us use jQuery selectors to get the values we need.

In summary, these will be our external imports:

import request from 'request'
import cheerio from 'cheerio'
import R from 'ramda'
import S from 'sanctuary'
import Future from 'fluture'

Most of the times, I will not be describing the functions that I use from these libraries. Hopefully, their names will be more or less self-explanatory; if that’s not the case, you are hereby encouraged to take a detour and check the online documentation for them.

The main scraper function — part I

Our main function should receive an URL and give back an object that describes a movie. Let’s start by defining the type signature of this function:

scrapeUrl :: String -> Object

That seems overly generic, not very descriptive, and honestly not all that helpful. For example, if we pass in a random string, are we really expecting to get back information about a movie? And what about the shape of our Object? Let’s do better:

type alias Url = Stringtype alias Movie = {
title: String,
summary: String,
year: String,
director: String
}
scrapeUrl :: Url -> Movie

(Mind you that these are just comments, it’s not valid Javascript, so we can basically write what we want here, just as long as it proves helpful to us, and to whoever has to read our code.)

Although this is much better, it still has some problems. For example, supposing that we pass in an Url, is that any guarantee that we’ll be able to get a movie description back? Of course not! Lots of things can go wrong, especially if you give it this.

So, if we give it an Url to the wrong page, we want our function to fail. It’s a good opportunity to use our friend Either! Our function will either give us back a Movie or an Error.

type alias Url, Movie...// we could also use something like Boom, for example
type alias Err = { msg: String }
scrapeUrl :: Url -> Either Err Movie

OK, so immediately we now know that our scraper can fail synchronously. But if our network connection is unavailable? Or the website is down? That’s an asynchronous error that we are not contemplating in our code. For that, we can use Future:

type alias Url, Movie, Future...scrapeUrl :: Url -> Future Err (Either Err Movie)

Ouch! This is getting ugly fast.

So, it’s kind of cool to know that our function can fail due to several problems, but seems to me like there’s too much detail there.

Honestly, when using this function I only need to know if it failed or not. I don’t really care if it failed due to network, or invalid website, or whatever! Let’s ditch the Either and keep only the Future (more on this later):

type alias Url, Movie, Future...scrapeUrl :: Url -> Future Err Movie

Adapting the external libraries

OK, so after all this work we got a function signature that takes an Url and gives back a Future. The implementation should be simple: (1) use Fetch to the HTML and (2) use Cheerio to turn that HTML into a Movie object.

The problem is that Fetch and Cheerio don’t use the functional style that we’re imposing in our program, and as such, they won’t compose easily with our own functions. We’ll have to build adapters for the parts of those libraries that we want to use; a functional API of sorts.

In case of Fetch, we just want to always get the HTML wrapped in a Future. Fetch allows us to work both with callbacks and promises. Here I chose callbacks:

// type alias Html = String// getHtml :: Url -> Future Err Html
export const getHtml = url => Future((rej, res) => {
const opts = { method: "GET", uri: url };
request(opts, (error, response, body) => {
if (error) {
rej(error);
} else {
res(body);
}
});
});

Now we never have to deal with callbacks again! For Cheerio, it’s going to take a little bit more work because we need it to perform several operations for us when scraping a web page. At a minimum we need to:

  1. from an Html string we want to build a Cheerio object (let’s name that type DOM)
  2. from DOM and a jQuery selector (let’s name that Selector) we want to get a list of elements
  3. from DOM and a Selector we want to get a single element (Just the first one, if it exists; Nothing otherwise)
  4. from a DOM and the name of an attribute, we want to get the value of that attribute
  5. from DOM we want to get its text content

That’s a lot of functions for us to adapt Cheerio for our functional programming style:

// type alias Html = String
// type alias DOM = Object // unknown structure, defined by Cheerio
// type alias Selector = String
// 1. loadDom :: Html -> DOM
const loadDom = cheerio.load.bind(cheerio)
// 2. selectAll :: Selector -> DOM -> List DOM
// get all the elements that match the selector
export const selectAll = R.curry((sel, dom) => {
const res = dom(sel)
// we want all selected elements to also be valid Dom elements
// that way, we'll be able to query them the same way a full doc
return R.map(cheerio, res.toArray())
})
// 3. selectFirst :: Selector -> DOM -> Maybe DOM
// get the first selector that matches
export const selectFirst = R.curry((sel, dom) =>
R.compose(S.toMaybe, R.head, selectAll(sel))(dom)
)
// 4. attr :: String -> Dom -> Maybe String
export const attr = R.curry((attrName, dom) =>
R.compose(R.map(R.trim), S.toMaybe, dom.attr.bind(dom))(attrName)
)
// 5. text :: Dom -> String
export const text = dom => R.trim(dom.text())

Note how we use selectAll in order to define selectFirst, and wrap the result in a Maybe simply because it may not exist. Note that S.toMaybe wraps a value in a Just, or returns Nothing if we give it null or undefined.

Scraping a single movie

With our new Cheerio functional API, we are now ready to turn a DOM representation into a Movie. Similarly to what we did with scrapeUrl, let’s start with the function signature:

// decodeMovie :: DOM -> Either Err Movie

That looks neat. Parsing a movie from any old DOM element can obviously fail, so this should return an Either.

Side note: notice one recurring pattern here? our functions can return values wrapped inside some functor like Either, Future or Maybe whenever it makes sense. That’s not a problem because we already know how to compose them with other functions: we’ll use map and chain. That’s our main reason to use functors and monads: being able to compose our compose our functions!

However, very rarely will our functions take these wrapped values as arguments; we want our functions to be as simple as possible. If I say decodeMovie takes a DOM element, it means it must be called with one no matter what. If that DOM element may not be there and it’s wrapped in a Maybe, then map will make sure this function will not get called. This allows our functions to as simple as they can be, without needing to check if the Maybe is a Just or a Nothing, for example. That kind of plumbing is handled before the function gets called, and best of all: we don’t have to handle it ourselves.

We already defined a Movie as an object with four fields: title, summary, year and director. Let’s assume that all these fields are required except for the year. This means that if any of the other fields isn’t found by our scraper, we want it to fail (meaning decodeMovie will return a Left), otherwise it will succeed. In particular, if the year isn’t found for some reason the function will succeed, but that field will be an empty string.

Let’s start by building a small function that uses our new cheerio API for getting the text of a required element. We should also specify the error message, in case that element can’t be found.

// required :: Err -> Selector -> Either Err String
const required = R.curry((err, selector) => R.compose(
S.maybeToEither(err), // Either Err String
R.map(C.text), // Maybe String
C.selectFirst(selector)) // Maybe Dom
)

Remember that R.compose reads from right-to-left, on in this case bottom-up

Tip: if you don’t like this particular ordering, you can also use R.pipe which is exactly the same as R.compose but the functions are specified in the reverse — perhaps more natural? — order.

Also, I included some comments to make it clear what type we get at the end of each line. I will be doing this in most functions that define this type of pipeline from now on.

So, this function selects the first element that matches the given jQuery selector and extracts its text content. However, remember that the selector function from our cheerio API returns Maybes instead of Eithers, which makes sense on its own, but it’s not what we’re looking for here. Fortunately, our functional toolset (in this case, Sanctuary) gives us the means to easily convert a Maybe to an Either. We just need to provide what we want to put in the Left of the Either when our Maybe has Nothing (hopefully that’s not too confusing by now), which in this case is our error value.

Side note: Whenever we’re transforming one functor into another (in this case, from Maybe to Either) that is called a natural transformation (seriously though, don’t open that link). It’s possible to define natural transformations for a lot of functors, such as from Either to Future, Either to Maybe, etc.

It’s very interesting to compare map and chain with natural transformations: whereas both map and chain allow us to change the value that’s inside the functor/monad, but not change the functor itself, a natural transformation does just the opposite: the wrapped value stays the same, and the wrapper is changed.

Now for a function that does the same for an optional field:

// optional :: String -> Selector -> String
const optional = R.curry((defaultValue, selector) => R.compose(
S.fromMaybe(defaultValue), // String
R.map(C.text), // Maybe String
C.selectFirst(selector)) // Maybe Dom
)

Notice there is no need to return an Either here: the function cannot fail, worst case scenario it should just return the defaultValue.

Now we can finally start building our movie decoder:

// decodeMovie :: DOM -> Either Err Movie
const decodeMovie = dom => {
const obj = {
title: required('what the hell is it named?',
'.title_block .title_wrapper h1')(dom),

summary: C.required(`Don't know what it's about!`,
'.summary_text')(dom),
year: optional('', '.title_block #titleYear a')(dom),

director: required('Could not find director',
'.plot_summary span[itemprop=director]')(dom),
}

This is now the shape of our object:

{
title: Either Err String,
summary: Either Err String,
year: String,
director: Either Err String,
}

Clearly not what we want. To solve this, we need to first make sure that all our fields are Eithers:

const decodeMovie = dom => {
const obj = {
title: ...,
summary: ...,
year: S.Right(optional('', '.title_block #titleYear a')(dom)),
director: ...,
}

Easy part done. We know we always have a value for the year, so we can just wrap in a Right. Now we need to somehow unwrap our individual fields and wrap our whole object in an Either. The idea being that if at least one of our fields is a Left, then the whole object will be wrapped in a Left. Else, it will be wrapped in a Right (success!).

Good news first: every time we have a value wrapped inside two special kinds of functors (for the record, traversable and applicative) we can flip the order in which they wrap that value (pretty obscure, I know, but tremendously useful in many cases, as we’ll soon see).

For this, we can use Ramda’s sequence function. Example straight from the documentation (here we want to “flip” an Array with a bunch or Maybes inside, to get a Maybe with an Array inside it):

R.sequence( Maybe.of, [Just(1), Just(2), Just(3)] );
//=> Just([1, 2, 3])
R.sequence(Maybe.of, [Just(1), Just(2), Nothing]);
//=> Nothing

The first argument is a function that teaches R.sequence how to build a Maybe. And R.sequence just knows how to traverse an Array! Note that if there’s a Nothing somewhere, then the whole thing will be a Nothing. The same thing happens with Either/Left, or with Future: one failed Future is all it takes for R.sequence to give us back another failed Future.

Now for the bad news: object (i.e., our Movie) is not a valid traversable, and so R.sequence doesn’t know how to traverse its fields. Fortunately, it’s pretty easy to do it ourselves: we can turn the object’s values into an array, map R.sequence to it, and zip the object back up:

// sequenceObject :: (* -> f *) -> Object (f *) -> f (Object *)
// This is basically R.sequence, but for objects
const sequenceObject = R.curry((appl, obj) => {
// e.g. obj = {title: Maybe(1), summary: Maybe(2), year: Maybe(3)}

const keys = R.keys(obj)
// e.g. ['title', 'summary', 'year']
const wrappedValues = R.values(obj)
// e.g.[Maybe(1), Maybe(2), Maybe(3)]
const unwrappedValues = R.sequence(appl, wrappedValues)
// e.g. Maybe([1,2,3])
return R.map(R.zipObj(keys))(unwrappedValues)
// e.g. Maybe({ title: 1, summary: 2, year: 3 })
})

Notice that, if unwrappedValues becomes a Nothing, the last R.map that zips the object back up would be a no-op and return Nothing. We didn’t even have to check for failures. Sweet!

Now we can finish our decodeMovie function:

// decodeMovie :: DOM -> Either Err Movie
const decodeMovie = dom => {
const obj = {
title: required('what the hell is it named?',
'.title_block .title_wrapper h1')(dom),

summary: C.required(`Don't know what it's about!`,
'.summary_text')(dom),
year: S.Right(optional('', '.title_block #titleYear a')(dom)),

director: required('Could not find the director',
'.plot_summary span[itemprop=director]')(dom),
} return sequenceObject(S.of(S.Either), obj)
}

The main scraper function — part II

Now, we’re finally ready to build the scrapeUrl function we defined earlier:

// scrapeUrl :: Url -> Future Err Movie
const scrapeUrl =
R.compose(
R.chain(U.eitherToFuture),
R.map(decodeMovie),
R.map(C.loadDom),
U.getHtml
)

By now, this should look familiar. Our function takes an URL, loads the result into a Dom object, decodes the movie, and flattens the result so that we don’t get an Either inside a Future (notice that U.eitherToFuture returns a Future; if we used map instead of chain here we would have gotten a Future inside another Future). So it defines a pipeline with the following types:

URL => Future Html => Future Dom => Future (Either Movie) => Future Movie

To use this we just have to fork the Future:

scrapeUrl('http://www.imdb.com/title/tt0089695')
.fork(console.error, console.log)

Here, we’re just logging the results to the console.

Generalising our scraper

Now we have a scraper that works for IMDB movie pages. What if we want to get all the movies from a specific actor? We need to get the list of movies from the actor page, which means we have to scrape that page. Let’s change our scrapeUrl function to be able to adapt to any page:

// scrapeUrl :: (Dom -> Either Err a) -> Url -> Future Err a
const scrapeUrl = R.curry((strategy, url) =>
R.compose(
R.chain(U.eitherToFuture),
R.map(strategy),
R.map(C.loadDom),
U.getHtml
)(url)
)

In the spirit of parameterizing all thing, we’ll pass a function to our scraper that knows how to turn some Dom into the data we need (or an error). Note how the type of this data must match with the data inside the Future.

By the way, we’ve been using R.compose a lot to build our functions, but that’s not the only option we have to build our function pipelines. Using R.compose (or R.pipe) promotes a programming style known as point-free style, where we can define our functions without ever naming their arguments. However, many people tend to think this code is difficult to read and understand (like that’s of any importance…). It’s perfectly OK to write our functions explicitly instead:

const scrapeUrl = R.curry((strategy, url) => {
const html = U.getHtml(url)
const dom = R.map(C.loadDom, html)
const res1 = R.map(strategy, dom)
const res2 = R.chain(U.eitherToFuture, res1)
return res2
}

Of course, having to name all the things sucks sometimes, as my last two constants show… Luckily, there’s an even better, and at least equally readable alternative, at least when we just need to map or chain stuff: our beloved types Maybe, Either and Future (as well as any other functor or monad) are also javascript objects that have these functions available as methods (hard-core functional programmer rolls eyes):

const scrapeUrl = R.curry((strategy, url) =>
U.getHtml(url)
.map(C.loadDom)
.map(strategy)
.chain(U.eitherToFuture))

This looks very clean and natural, at least in javascript. In some other languages, where every function is curried by default, we would use the pipe operator (usually |> or <|) to do something similar. For example, in Elm it could look something like this:

getHtml url
|> Task.map loadDom
|> Task.map strategy
|> Task.andThen eitherToFuture
-- Task is equivalent to Future
-- andThen is equivalent to chain
-- Elm doesn't have type classes, so that means there isn't a single map interface. I.e., there's a Task.map, List.map, Maybe.map, etc.

Anyway, just use whatever whatever option works best for you. Indeed, you can even mix and match styles:

const scrapeUrl = R.curry((strategy, url) =>
U.getHtml(url)
.map( R.compose(strategy, C.loadDom) )
.chain(U.eitherToFuture))

By the way, we haven’t really covered this, but here we used one of the functor laws (creepy music plays in the background) that says that composing two mapped functions is the same as mapping one composed function. This is always true, and we can apply this law blindly, without even knowing what those functions do. Really cool stuff: imagine you have a large array and you want to apply two functions over it. You can just compose those two functions and map the new function on the array once. In some languages, the compiler can actually perform these optimizations on the code automatically!

Scraping a list of movie URLs

OK, back to reality. We can know scrape any page, so now we’re ready to get a list of URLs of movies for a single actor. That should be fairly straightforward:

// decodeActorMovies :: Dom -> Either Err (List Url)
const decodeActorMovieUrls = dom => {
const actorMoviesSelector = '#filmography #filmo-head-actor + .filmo-category-section .filmo-row b a' const arr = R.compose(
R.sequence(S.of(S.Either)),
// Either Err (List String)
R.map(S.maybeToEither('Invalid Url')),
// List (Either Err String)
R.map(R.map(x => 'http://www.imdb.com' + x)),
// List (Maybe String)
R.map(C.attr('href')),
// List (Maybe String)
C.selectAll(actorMoviesSelector)
// List Dom
)(dom)
return arr
}

We select all links for movies where this person has acted with some CSS selector, and get its href attribute. At this point we get a list of maybes, so we know we’ll have to mapmap!

Then, since those URLs are relative, we have to turn them into absolute URLs by prefixing the server address, then we turn the Maybes into Eithers because we want to specify an error message in case something goes wrong.

Finally, we just use our friend sequence in order to “flip” the result from a List of Eithers into an Either of List.

Putting it all together

Finally, we’re ready to scrape all the movies of one actor in one go:

const actorPage = 'http://www.imdb.com/name/nm0000241'scrapeUrl(decodeActorMovieUrls, actorPage)
// Future Err (List Url)
.map(R.map(scrapeUrl(decodeMovie)))
// Future Err (List (Future Err Movie))
.chain(R.sequence(Future.of))
// Future Err (List Movie)
.fork(e => {
// e :: Err
console.error(e)
}, x => {
// x :: List Movie
console.log(x)
})

By the way, notice we opted to use method chaining here instead R.compose. It really is just a matter of preference.

We first scrape the URLs from the page, then we scrape each one of those movie pages. Then all that’s left is to use R.sequence and chain to get the data in the format we want, and fork the Future. Since Futures are lazy, it’s only when we fork that the requests are performed on the network. Up until fork, we’ve only been dealing with pure functions. That is one of the mantras of functional programming: since you can’t live without side effects, you should at least try to push them to the edges of your program.

It’s interesting to note that we only forked once, and that single fork is responsible to set everything else in motion. In particular, we only needed a single fork to perform the request that fetches the list of movie URLs and each subsequent request for the details of each movie.

Side note: By using sequence the way that we do here, if the scraper fails even if for a single movie, then the whole thing will fail. We get nothing back. Which kinda sucks.

One way of solving this is to transform each individual Future into an Either, and then handle errors and successes accordingly. More information here.

Performance

If we run this for an actor that stars in a lot of movies, we’ll see that the whole process is somewhat slow. That happens because of the way R.sequence works: it goes sequentially through the list of URLs and handles each one at a time! Surely, we can speed things up if only we can parallelize some of those requests.

Fortunately, Fluture comes to the rescue: Fluture.parallel works just like R.sequence, but takes an extra argument that dictates how many futures can be running at the same time. For example, if we specify 10, Fluture will start running 10 futures, and as soon as one finishes it will spin up another one. And so on, until they’re all finished. We basically get this parallel functionality for free, almost without changing our code. Pretty neat!

scrapeUrl(decodeActorMovieUrls, actorPage)
.map(R.map(scrapeUrl(decodeMovie)))
.chain(R.parallel(10))
.fork(console.error, console.log)

Very important: remember not to set this value too high because you don’t want to put too much stress on the server. Scrape responsibly!

Conclusion

Developing software (as well as solving almost any other problem) has always been about decomposing some problem into smaller pieces, smaller problems that are easier to solve on their own, and then composing those solutions back together.

What I find nice about this functional programming style is that it almost forces us to write very small, very focused, and thus very reusable functions. Consequently, these functions are also very easy to test, (almost) without any mocking (we didn’t go into testing, but you can check the GitHub repo for some examples). These functions become the basic building blocks of many potentially different applications.

All in all, this was a really fun little project. It wasn’t all roses though, as there were some bumps along the way, such as

  • Javascript is not type checked. And although many people don’t like it, I think it really would help in this case. Even for such a simple problem like building our scraper, I must admit I found myself struggling from time to time to make sure the types align correctly when composing functions. A type checker would prove valuable here, but due to the dynamic nature of Javascript, and heavy use of currying in this programming style, I doubt compile-time type-checking (like TypeScript of Flow) will exist in the near future for this kind of thing. Note however that Sanctuary does provide run-time type checking in case you want to try it out.
  • Sometimes, especially when using point-free style, if an error occurs, the stack trace will be completely obfuscated. Worst case scenario, you will have no idea where the error has occurred! For more information on this problem and how to solve it, take a look at this article.

Link-throwing

Here are some useful links:

And of course, learn other languages. Let’s be honest. This hasn’t been the most idiomatic javascript ever written. The fact is that this stuff looks and works a lot better in other programming languages.

Just look at the amount of effort that’s been put into trying to turn javascript into a more FP-friendly language:

  • Immutable.js — immutable data structures
  • TypeScript and Flow — compile-time type checking
  • Ramda, Sanctuary, Folktale, Ramda-fantasy, …— functional style utility libraries, algebraic data types implementations, etc.
  • To some extent, even React and Redux are also trying to bring stuff from the FP community to the front-end
  • and many others, I’m sure

Don’t get me wrong: these libraries are great! The thing is, most purely functional languages give you this stuff and more for free. It’s worth checking them out, if nothing else for learning purposes.

  • Haskell: Learn You a Haskell for Great Good seems to have become the de facto standard in learning both Haskell and functional programming. Give it a try!
  • Elm: It’s becoming one of the programming languages of choice for the front-end. Also be sure to check out ClojureScript, PureScript or Reason.
  • F#: I’ve never coded a single line of F#, but F# for fun and profit has some really good content, and most of it can be easily translated into other functional languages. Specifically, stuff about domain-driven design and making it impossible for our software to even represent illegal states. The site is run by Scott Wlaschin who as also some great introductory videos about functional programming on YouTube: 1 and 2;
  • If you’re on the JVM, you can check-out Scala and Clojure in order to have access to the gigantic Java ecosystem

--

--