pet project report: a small downloader utility

I wrote a small Haskell program to download the videos from the 2012 Oregon Programming Languages Summer School. The GitHub repository can be found here.

(Actually, the videos are really huge and perhaps you could do better watching them on YouTube, here are a few links.)

Getting the page

I used wreq as http client, it its simple and easy to use, although perhaps its lens dependency is a bit heavyweight. It seems like it could work perfectly with a smaller lens library like microlens.

Parsing the page

I used a combination of tagsoup and megaparsec, with the useful tagsoup-megaparsec library acting as glue.

I hadn’t used megaparsec before. People coming form attoparsec need to be aware that megaparsec doesn’t backtrack automatically, except for a few combinators. You need to request backtracking by using try.

One nice thing about megaparsec is that it is a monad transformer. I was having problems debugging a parse error, and I put a Writer monad under Parser to be able to inspect partial results. Putting Writer below Parser means you can get debug messages even for branches that backtrack.

Creating the folder structure

In the page, the videos are structured in courses and lessons. I wanted an easy way of selecting which courses and lessons to download. The approach I took was to work in stages:

  • First, create a structure of empty course/lecture folders without downloading anything.
  • Then, ask the user to delete the folders in which he is not interested.
  • Download only the files corresponding to the folders that were not deleted.

A rudimentary solution, but it works.

Data.Tree is used to represent the folder structure, each node containing the folder name and a list of files to download.

As a learning exercise, I decided to manipulate the trees without explicit recursion, using only catamorphisms (the latest version of containers includes the usefult foldTree catamorphism).

“vanilla” catamorphisms destroy a tree starting from the leaves. However, there were cases in which I wanted to propagate information from the root to the leaves, not the other way around. For example when computing the full file path for each node by concatenating path pieces starting from the root.

I ended up using this auxiliary function:

inherit :: Tree a -> Tree (NonEmpty a)
inherit tree = foldTree algebra tree [] where
algebra :: a -> [[a] -> Tree (NonEmpty a)] -> [a] -> Tree (NonEmpty a)
algebra a fs as = Node (a:|as) (fs <*> [a:as])

Which decorates each node with the (non-empty) list of all its ancestors. The function uses a common “trick” to propagate information from the root: make the catamorphism return a function that builds the final result, not the final result itself.

A simpler version of this trick is expressing foldl in terms of foldr (if you squint a little, the first element of a list is like the “root” of a linear tree). A beefed-up version of this trick allows calculating inherited attributes in attribute grammars.

Downloading the files

To allow multiple concurrent downloads, I resorted to the ever-useful Concurrently Applicative from the async package.

In order to avoid overwhelming the server, I throttled the requests by bracketing the IO actions with semaphore operations:

traverseThrottled :: Traversable t => Int -> (a -> IO b) -> t a -> IO (t b)
traverseThrottled concLevel action taskContainer = do
sem <- newQSem concLevel
let throttledAction = bracket_ (waitQSem sem) (signalQSem sem)
. action
runConcurrently (traverse (Concurrently . throttledAction) taskContainer)

Finally, I assumed I would need some streaming library for handling each download, but a combination of wreq’s foldGet and withFile worked well enough, without further complication.