Automating Download from Project Gutenberg using the Linux Terminal

Gurdit Singh Bedi
3 min readDec 25, 2016

--

Recently I had switched to Linux as my primary OS and so I started using terminal and then bash and then… One day I wanted to download all the Top 100 EBooks last 30 days in text format from Project Gutenberg, but whose gonna open click, click and save as, again and again. So instead of using a fancy programming language I thought using the terminal, rest is what I did.

Again, for a new Linux User this can be taken as a simple demonstration of how powerful the terminal is when it gets combined with some very great tools.

Getting Started

In this automation, I am going to use:

  1. bash ( OfCourse, its automation using terminal)
  2. sed and grep
  3. cURL
  4. HTML-XML-utils
  5. wget

Before anything else the installation on a Debian/Ubuntu machine will be:

apt-get update
sudo apt-get install php5-curl html-xml-utils

This will install cURL and HTML-XML-utils. Bash, grep, sed and wget are available on every GNU/Linux. If not they are just a search way.

Starting…

So, The automation starts from this page. This page contains the list of top 100 Ebooks based on various criteria. I’m am interested Top 100 EBooks last 30 days so I have to scrape the links to get into the next page of each book title. Here comes the first line.

curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" http://www.gutenberg.org/browse/scores/top |  hxclean | hxselect -s '\n' li | tail -n 200 | head -n 100 > ebooksPages.txt

This is a single statement and I know there is a good amount of piping. Here is the breakdown, curl access the internet and gives the html of the page to the next pipe. hxclean then applies heuristics to correct the html. hxselect combines with -s flag produces all the li tags in a separate line. hxclean and hxselect commands both belong to html-xml-utils. Now the output is of 600 tags but we need only tags numbered 400–500 because they correspond to our requirement. So we use our tools famous tools head and tail. Saving our output to ebooksPages.txt file. Now we have the links to our pages which will eventually lead us to the links of the text files.

Now we going to extract the links from the partial html file.

hxwls ebooksPages.txt | sed 's/^/www.gutenberg.org/g' > ebooksPages-2.txt

The hxwls helps us to get the href attribute from the a tags. We save these links in the ebooksPages-2.txt . This file contains links to all the individual ebook’s page.

Okay, now we access the page and get the download link.

for i in `cat ebooksPages-2.txt`; do 
curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" $i | hxwls | grep txt >> finalLinks.txt
done

In the above code of 3 lines, curl accessed the page -s made the output not to printed -A specifies the User-Agent. Next the hxwls, found all the links and lastly grep searched for ‘txt’. And using the for loop we do it for each link, simply, we get the page, find all the links in it and then search the list of links to find the text file. Using >> we are able to append the output of each iteration to the file. In finalLinks.txt we get the download links.

Just one step left

Finally Its Time to Download. But if you open this file you will able to see that the download link something like //www.gutenberg.org/ebooks/xxxx.txt.utf-8 , there is double forward slash in the beginning.

for i in `cat finalLinks.txt`; do 
wget `sed -r 's/..(.*)/\1/g' <<< $i`
done

Command Substitution in the 2nd line clears we use the sed tool to remove the double forward slashes and then finally wget does its job.

Before you repeat this.

But Before you start doing this, there is a problem. The website is really strict in terms of its usage and access. It may block you for 24 hours or more if it sees a good amount of download happening in a short time from a single computer. When I had did this, 17 files used to get downloaded and then I got blocked for 24 hours.

--

--