Automating Download from Project Gutenberg using the Linux Terminal
Recently I had switched to Linux as my primary OS and so I started using terminal and then bash and then… One day I wanted to download all the Top 100 EBooks last 30 days in text format from Project Gutenberg, but whose gonna open click, click and save as, again and again. So instead of using a fancy programming language I thought using the terminal, rest is what I did.
Again, for a new Linux User this can be taken as a simple demonstration of how powerful the terminal is when it gets combined with some very great tools.
Getting Started
In this automation, I am going to use:
- bash ( OfCourse, its automation using terminal)
- sed and grep
- cURL
- HTML-XML-utils
- wget
Before anything else the installation on a Debian/Ubuntu machine will be:
apt-get update
sudo apt-get install php5-curl html-xml-utils
This will install cURL and HTML-XML-utils. Bash, grep
, sed
and wget
are available on every GNU/Linux. If not they are just a search way.
Starting…
So, The automation starts from this page. This page contains the list of top 100 Ebooks based on various criteria. I’m am interested Top 100 EBooks last 30 days so I have to scrape the links to get into the next page of each book title. Here comes the first line.
curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" http://www.gutenberg.org/browse/scores/top | hxclean | hxselect -s '\n' li | tail -n 200 | head -n 100 > ebooksPages.txt
This is a single statement and I know there is a good amount of piping. Here is the breakdown, curl access the internet and gives the html of the page to the next pipe. hxclean
then applies heuristics to correct the html. hxselect
combines with -s
flag produces all the li
tags in a separate line. hxclean
and hxselect
commands both belong to html-xml-utils. Now the output is of 600 tags but we need only tags numbered 400–500 because they correspond to our requirement. So we use our tools famous tools head
and tail
. Saving our output to ebooksPages.txt file. Now we have the links to our pages which will eventually lead us to the links of the text files.
Now we going to extract the links from the partial html file.
hxwls ebooksPages.txt | sed 's/^/www.gutenberg.org/g' > ebooksPages-2.txt
The hxwls
helps us to get the href
attribute from the a
tags. We save these links in the ebooksPages-2.txt . This file contains links to all the individual ebook’s page.
Okay, now we access the page and get the download link.
for i in `cat ebooksPages-2.txt`; do
curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" $i | hxwls | grep txt >> finalLinks.txt
done
In the above code of 3 lines, curl accessed the page -s
made the output not to printed -A
specifies the User-Agent. Next the hxwls
, found all the links and lastly grep
searched for ‘txt’. And using the for
loop we do it for each link, simply, we get the page, find all the links in it and then search the list of links to find the text file. Using >>
we are able to append the output of each iteration to the file. In finalLinks.txt
we get the download links.
Just one step left
Finally Its Time to Download. But if you open this file you will able to see that the download link something like //www.gutenberg.org/ebooks/xxxx.txt.utf-8
, there is double forward slash in the beginning.
for i in `cat finalLinks.txt`; do
wget `sed -r 's/..(.*)/\1/g' <<< $i`
done
Command Substitution in the 2nd line clears we use the sed
tool to remove the double forward slashes and then finally wget
does its job.
Before you repeat this.
But Before you start doing this, there is a problem. The website is really strict in terms of its usage and access. It may block you for 24 hours or more if it sees a good amount of download happening in a short time from a single computer. When I had did this, 17 files used to get downloaded and then I got blocked for 24 hours.