pet project report: a small downloader utility
(Actually, the videos are really huge and perhaps you could do better watching them on YouTube, here are a few links.)
Getting the page
Parsing the page
I hadn’t used megaparsec before. People coming form attoparsec need to be aware that megaparsec doesn’t backtrack automatically, except for a few combinators. You need to request backtracking by using try.
One nice thing about megaparsec is that it is a monad transformer. I was having problems debugging a parse error, and I put a Writer monad under Parser to be able to inspect partial results. Putting Writer below Parser means you can get debug messages even for branches that backtrack.
Creating the folder structure
In the page, the videos are structured in courses and lessons. I wanted an easy way of selecting which courses and lessons to download. The approach I took was to work in stages:
- First, create a structure of empty course/lecture folders without downloading anything.
- Then, ask the user to delete the folders in which he is not interested.
- Download only the files corresponding to the folders that were not deleted.
A rudimentary solution, but it works.
Data.Tree is used to represent the folder structure, each node containing the folder name and a list of files to download.
As a learning exercise, I decided to manipulate the trees without explicit recursion, using only catamorphisms (the latest version of containers includes the usefult foldTree catamorphism).
“vanilla” catamorphisms destroy a tree starting from the leaves. However, there were cases in which I wanted to propagate information from the root to the leaves, not the other way around. For example when computing the full file path for each node by concatenating path pieces starting from the root.
I ended up using this auxiliary function:
inherit :: Tree a -> Tree (NonEmpty a)
inherit tree = foldTree algebra tree  where
algebra :: a -> [[a] -> Tree (NonEmpty a)] -> [a] -> Tree (NonEmpty a)
algebra a fs as = Node (a:|as) (fs <*> [a:as])
Which decorates each node with the (non-empty) list of all its ancestors. The function uses a common “trick” to propagate information from the root: make the catamorphism return a function that builds the final result, not the final result itself.
A simpler version of this trick is expressing foldl in terms of foldr (if you squint a little, the first element of a list is like the “root” of a linear tree). A beefed-up version of this trick allows calculating inherited attributes in attribute grammars.
Downloading the files
In order to avoid overwhelming the server, I throttled the requests by bracketing the IO actions with semaphore operations:
traverseThrottled :: Traversable t => Int -> (a -> IO b) -> t a -> IO (t b)
traverseThrottled concLevel action taskContainer = do
sem <- newQSem concLevel
let throttledAction = bracket_ (waitQSem sem) (signalQSem sem)
runConcurrently (traverse (Concurrently . throttledAction) taskContainer)