Scraping the web: experiments with functional programming in javascript — part I

Published in

Frontend Weekly

13 min readDec 28, 2017

Recently I needed to scrape some data for a personal project. Of course, there’s a lot of options that I could use for this, from helper libraries for various programming languages to full-blown applications designed specifically for this purpose.

However, this seemed like a good opportunity to learn something new, so I decided to implement the scraper in javascript using functional programming techniques described in the excellent Professor Frisby’s Mostly Adequate Guide to Functional Programming. This book is written by Brian Lonsdorf and is available online for free. Check it out! You should also see his free egghead.io video course that moves at a faster pace. Great stuff.

Since this will be a long post, I will be dividing it into two parts. Part I will describe (well, mostly enumerate) some fundamental aspects of functional programming, all the way from composing pure functions to describing some useful functors and monads. Then, in Part II, we’ll be using that good stuff to incrementally build our scraper.

Some fundamentals

Before we start coding there are some concepts regarding functional programming that we should grasp, or else we would be moving way too slow. I will be quickly describing these concepts in this section.

Pure functions

These are functions that:

given the same set of inputs, always produce the same output. This means they cannot depend on any outside state (i.e., anything that was not on its set of inputs). As a consequence, a pure function without any inputs could also be called a constant!
They cannot produce any side-effect, like changing some outside state (outside of the function, I mean), writing to a database, or even log to the console.

Pure functions have some advantages over impure ones, such as:

easier to test, because we don’t need to mock anything. Provide the inputs, test for expected outputs.
no side-effects means that you can always safely call the function without thinking for a second if it will destroy the world (or worse, your app). It’s a liberating feeling: no matter how many times I call this function, the result will always be the same, and it won’t destroy or change anything. As a matter a fact, I can safely memoize this function and never look back!

We’ll be writing most of our functions as pure functions, and try to delay the use of impure functions as much as possible. Note that we always want our program to produce some side-effects (such as show print some output, produce a file, or whatever…), so it would be impossible to use only pure functions.

Composing

Composing is basically taking two functions and creating a new one that executes the first two in order. For example, given:

const add1 = x => x+1const mult3 = x => x*3

We can build a new function that does both:

const add1mult3 = x => mult3(add1(x))add1mult3(5)  // 18

We’ll be using Ramda’s compose function for this (we could also easily make our own. That’s actually a cool exercise!):

const add1mult3 = R.compose(mult3, add1)

Note that order matters! R.compose reads from right to left, so add1 will be executed first, and mult3 will be executed with the result from add1.

One (cool?) property of this version is we didn’t even have to come up with a name for the parameter (x in the case of the “manual” version). We just make sure the types of both functions match (i.e., the return type of the first is what the second function is expecting) and compose away.

Composing functions (either this way or the “manual” way) is the way to manage the complexity of software. We decompose big and complex problems into smaller ones and build small functions in order to solve those smaller problems. We then compose those functions (i.e., call them in the correct order) to solve the big problems.

Two additional remarks about R.compose:

we’re not limited to composing two functions at a time. Guess what this does:

const add5mult3 = R.compose(mult3, add1, add1, add1, add1, add1)

one very important constraint of function composition done this way is that each function must receive one parameter, no more, no less. WHAT?! That seems rather limiting. Do you really expect me to code functions such as add1, add2, add3. Of course not, for solving that we need:

Currying

Currying a function creates a new function whose parameters needn’t be supplied all at once. If we supply all them, then the function is immediately executed, as usual. However, if we supply fewer parameters than needed by the function, then a new function is returned, one that is expecting the missing parameters. For example, consider a function that takes three parameters:

const volume = (a, b, c) => a * b * c// we'll curry the function using Ramda.curry
const curriedVolume = R.curry(volume)curriedVolume(1,2,3)    // 6
curriedVolume(1,2)(3)   // 6
curriedVolume(1)(2)(3)  // 6

The first time I read about currying I thought “So what? Why da hell would I want to do that?!”. At first sight it seemed rather useless. However, currying is extremely important in order to be able to compose functions with more than one parameter. Let’s look at our previous example with curry in our toolbox:

const add = R.curry( (x,y) => x+y )
const mult = R.curry( (x,y) => x*y )
const add1mult3 = R.compose(mult(3), add(1))
add1mult3(5)  // 18

Much nicer! Now, our functions are much more general, and so our chances of reusing them (by composing them with other functions) are far greater.

One extremely important thing to pay attention to with respect to currying is the order of the parameters when we are defining our curried functions. The main data parameter, if you will, that we want to pass to our function should always be the last parameter, because that will be the thing that gets passed from the previous function in a composition.

As an example, imagine we’re defining a function that returns the value of a field on a given object. This function has two parameters:

the name of the field that we want to fetch from the object. This is not the data that we’ll pass to the function, but more of a configuration parameter: for example, if we want to, we can create another function that is set up to fetch the name field of any object we throw at it
the object that holds the field. Most of the time, this will be the data that we’ll pass to the function; it’s the variable part, not a configuration parameter.

With that in mind, we should define this function like this:

const getProp = R.curry( (field, obj) => obj[field] )

And we can use it in a composition that reverses the name of some guy:

const guy = { name: 'Ritchie' }
const getName = getProp('name')
const reverseName = compose(R.reverse, getName)
reverseName(guy)  // eihctiR

If we had chosen to place obj as the first parameter, we’d be screwed right now, because we couldn’t have configured our function to fetch the name field of any object using currying.

Side note: it actually is possible in Ramda to configure parameters other than the first using R.__. Sometimes we’ll have to, because deciding which parameter is the main data and which ones are configuration parameters may not be simple, and may depend entirely on the context in which we use the function.

By the way, two of the cooler features about Ramda is that its functions are automatically curried, and the order of their parameters is carefully crafted so that they can easily be curried in most situations (once again, depending on the context in which we’re using these functions, that may not always be the case). Other libraries such as Lodash/fp or Sanctuary (and probably others) do the same, so feel free to use any of those.

Mapping

Probably every javascript developer knows the map function. It simply takes an array and a function and iterates through the array applying the function to each of its elements. In the end, it returns a new array, with the same number of elements, that contains the results of applying the function.

However, at least for the purpose of this post, map will no longer be about iteration; map will be about applying a function to a value inside some context.

The context is what determines if and how the function is applied. In the case of an array, the array itself is the context and it determines that the function must be applied to all of its elements. Also, if the array is empty, the function is never applied to any values.

Array is just one such context with which we can use map with. There are lots of other interesting contexts, some of which we’ll use for our web scraper, namely Maybe, Either and Future. All these contexts hold (or wrap) some value (like our array which actually holds a bunch of them) but the really interesting thing about them is the way they control how functions get applied on that value.

One very important about map is that it always returns a context of exactly the type. Meaning:

mapping over an Array always produces another array with the exact same number of elements. Note that the values inside the new array may well be of a different type (for example, we can create an array of numbers from an array of strings by mapping the function toString), but the new array will always have the same number of elements. No more, no less.
mapping over a Maybe will always produce another maybe (we’ll briefly discuss Maybes and other types next)
and so on

A type (or context) that can be mapped over is also called a functor.

Chaining

The chain function is very similar to map in the sense that it applies a function to a value inside some context. The difference between the two is the type of function that is mapped and how that mapping occurs.

Consider the following scenario: we have an array of purchase order (PO) numbers. For each order, we want to get its line items, extract the price from each item, and calculate the sum for all POs. This is what we got:

// this returns an array of line items. a line item is an object that looks like:
// { id: 1, name: "socks", price: "3", ... }
const getLineItems = poNumber => ...// get the price of one line item
// this is the same as using R.prop('price')
const extractPrice = li => li.price// all the PO numbers
const poNumbers = [1, 23, 11, 45]

The important part here is that getLineItems is a function that returns an array. If we try to map it to our array of poNumbers, we get an array of arrays:

const lineItems = R.map(getLineItems, poNumbers)
// --> [ [{id:1,price:3,...} , {id:2,price:56,...}]
       , [{id:10,price:42,...} , {id:292,price:4,...}]
       , [{id:43,price:2,...} , {id:456,price:5000,...}]
       ]

In this case, it would be better if we got a flatten array because we want to just treat each line item equally, no matter which PO it belongs to. We just want to extract their prices and sum the whole thing.

We can use chain to do that. Like map, chain also operates on these contexts we’ve been talking about, but takes a function that returns a value that’s also inside the same kind of context. In our example, we use chain on an array, and give it a function that also returns an array. What chain does is collapse the two arrays into one:

const lineItems = R.chain(getLineItems, poNumbers)
// --> [ {id:1,price:22,...} , {id:1,price:56,...}
       , {id:10,price:42,...} , {id:292,price:4,...}
       , {id:43,price:2,...} , {id:456,price:5000,...}
       ]

From there, we can easily complete our requirement:

const prices = R.map(extractPrice, lineItems)
const total = R.sum(prices)

Or, the whole thing in a more concise way (remember that compose reads from right-to-left):

const getLineItems = poNumber => ...
const poNumbers = [1, 23, 11, 45]const calc = R.compose(
    R.sum,                     // returns the total price
    R.map(R.prop('price')),    // returns an array of prices
    R.chain(getLineItems))     // returns an array of line itemscalc(poNumbers)

This way we don’t have to make up names for the intermediate values. Whether it reads better or worse is probably a matter of opinion.

This is a rather convoluted example, but chain is really extremely helpful in many situations where we don’t want to have a value inside two contexts (e.g, array of array, maybe inside maybe, etc.), as we’ll see later.

Other possible names for chain are bind and flatMap. Also, a type (or context) that can be chained over in this way is also called a monad.

Types

We’ve been talking about contexts (or types) that contain values, and on which we can map and chain functions. We also know that Array is one of those types. In this section we’ll describe three other types that we’ll use to build our scraper: Maybe, Either and Future.

Maybe

The Maybe type represents a value that may or may not be there. It can only have one of two possible values:

it can be Nothing, in which case it doesn’t have anything else associated with it. If we map or chain over Nothing, we always get Nothing back
or it can be Just something. In this case, it wraps another value (which can be anything, a string, a number, or even the whole state of your application). If we map over Just something, we get back another Just wrapping whatever was the result of applying the function to the original value. Note that if we chain over Just we can get back a Nothing, though!

Or, in code:

// 'a' means any type of value, such as string, number, object, etc.
// '|' reads as OR
// Maybe a = Nothing | Just aconst add1 = x => x + 1
R.map(add1, Just(5))     // Just(6)
R.map(add1, Nothing)     // Nothing

We are going to be using Maybes to replace null and undefined in our code. Every time there’s a field somewhere that may not have a value, or a function that may return an empty or non-existent value, we’ll use a Maybe.

The advantage of Maybe over null or undefined is that we can safely apply functions on empty Maybes (i.e., Nothings) without it blowing up on us. And when it’s time to get the value inside the Maybe, we have to take into account the possibility that there is no value, forcing us to handle that case.

Either

Either is similar to Maybe in the sense that it has only two possible values: a Left and a Right. However, in an Either both of those values wrap another value:

// Either a b = Left a | Right bconst add1 = x => x + 1
R.map(add1, Right(5))              // Right(6)
R.map(add1, Left({msg: "Ouch!"}))  // Left("Ouch!")

The Right value is basically the same as Just: it represents the “regular” case, one where the functions we pass to map will actually get applied.

The Left value, on the other hand, is like the Nothing because it ignores the function we try to map on it, and just returns the value it already had.

Why is this useful? We are going to use Either to avoid throwing and catching exceptions in our scrapper! Every time we code a synchronous function that may fail, that function will return an Either. That way, if the function fails in the middle of a large composition of functions, any subsequent mapping or chaining will be no-ops.

And, in the end, we will be forced to look inside the Either to check if all went well or if we had an error. This means that, we can still choose to ignore errors that happen in our program, but that will be a conscious decision on our part; it won’t be because we simply forgot to handle the error case!

Futures

Whereas Either will be used to handle synchronous functions that may fail, Future will be used to handle any asynchronous function (which, of course, can always fail).

A Future is like a Promise, with some very important differences:

A Promise is eager, which means it will start doing its stuff (e.g., network connection) right after it’s created. A Future is always lazy: you create it, map or chain over it several times and it doesn’t do anything. Only when you call a special method called fork, will it actually do its thing
A Promise has a then method that’s used to chain operations on the promise. From the then method you can return either another promise or a “regular” value. The next method on the chain will receive the value from the previous value. So the then method on the promise behaves like our map function(in case we return a “regular” value on then) or like the chain function (in case we return another promise on then)

Before we finally start our scraper, there’s one last thing to talk about that will help us gather important insight about our functions without looking at their code, or even their name: type signatures!

Type signatures

We’ll use type signatures extensively in our scraper, specifically to describe our functions. A type signature will let us know immediately the inputs and the output of a function.

As an example, let’s consider the following type signature.

// add :: Number -> Number -> Number

This means there’s a function named add, that takes two numbers and returns another number. Or we could also read it as a function add that takes a number and returns another function that takes a number and returns a number (we won’t read functions that way, but it’s good to recognize this is as currying!)

Now, for our purposes we can implement this function in any of the following ways:

// uncurried version
const add = (x, y) => x + y// curried version (with standard arrow functions)
const add = x => y => x + y// curried version (using Ramda)
const add = R.curry( (x, y) => x + y )

Mind you that neither of those function are identical (can you spot the differences in the ways each version can be used?), but we can treat them as roughly the same function. We’ll use the first version if we think we probably don’t need currying (or if the function has only one value, of course). And we’ll use the third version if we know that we might need currying.

Type signatures are really helpful because they tell a lot about our functions even without needing to look at their body. For example, can you guess which function this is?

??? :: (Number -> String) -> Array Number -> Array String

So, this takes a function that transforms a number into a string, and an array of numbers. And it puts out an array of strings. This is a specialized version of our best friend, the map function! A more generic approach would be:

mapArray :: (a -> b) -> Array a -> Array b

(here, a and b are type variables that can be replaced by any two types). Of course, we’ve already seen that map is not at all about arrays, so:

mapMaybe :: (a -> b) -> Maybe a -> Maybe b

Or, for completeness, in the most general type signature:

map :: Functor f => (a -> b) -> f a -> f b

Here, Functor f is just a constraint that says that f must be a Functor. Remember that Array, Maybe, Either and Future are all functors. So, in the type signature above, you can just replace f by one of those and get mapArray, mapMaybe, etc.

As an aside, here Functor is also called a type class because it’s saying: any type that wants to join the club needs to implement a function named map that has this signature. It’s important to understand that each specific functor implements map differently: Arrays must iterate, Maybes must ignore Nothing, etc. However, the map function must obey the functor laws.

That’s it for part I! Head on over to Part II, which contains the fun bit, where we actually build the scraper using the stuff we’ve just been over.