A WayBack Machine Downloader
Using the Terminal to Excavate Digital Artifacts from Archived Websites on the Internet Archive
The wayback_machine_downloader is actually designed to download an entire website from the Internet Archive Wayback Machine but I am focusing usage on how to download specific filetypes. Read on if you are curious …
You need to install Ruby on your system (>= 1.9.2) — if you don’t already have it. Then run:
gem install wayback_machine_downloader
Tip: If you run into permission errors, you might have to add
sudo in front of this command.
Having old computers can sometimes mean software and other supporting material is hard to come by, well not really but in some cases it can be. Enter the Internet Archive’s Wayback Machine!
For last 20+ years the Internet Archive (IA) has been tirelessly and diligently downloading and storing as many websites as it can. As a side effect, yesterday’s web can easily be experienced just as it was all those years ago and if you find luck on your journey, some of those old .txt, .doc, .pdf, .zip, .bin, .sit, .dmg and many other filetypes can still be downloaded. How cool is that?
To achieve this we will install and use a very simple tool, aptly named the WayBack Machine Downloader. We will use it to search for and download specific filetypes from an entire website’s history, or from a specific range of dates within a website’s history. Installation and use is easily performed using the terminal.
I collect a lot of old digital files and computer hardware because it’s something I enjoy immensely, it’s just fun to rediscover, learn and ultimately admire what was once mainstream computing. There are many websites and blogs whose niche nicely fit my interest but sometimes links to items that interest me are simply dead. It might also be the case that an old magazine or book that references an item of interest just can’t be easily found anymore. In these instance there is a good chance that the Wayback Machine can come to the rescue.
First we need to download and install the wayback_machine_downloader which can be found on github. There is plenty of instruction already there but I will take some time to summarize the tool and get into some examples of how I use it.
I will begin with the command that I most often use along with a break down of its’ use.
wayback_machine_downloader -s -a -c3 —o "/\.(zip|hqx|sit|pdf|bin|txt|tgz|dmg)$/i" somewebsite.com
-- all-timestampsDownload all snapshots/timestamps for a given website
—- allExpand downloading to error files (40x and 50x) and redirections (30x)
—- concurrency NUMBER
Number of multiple files to download at a time. Default is one file at a time (ie. 20). Note: Enter a value but without a space after
—- only ONLY_FILTER
Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex)
In summary the command at the top is set to consider all snapshots (
-- all-timestamps), it’s also instructed to download even if some of the pages had a 40x, 50x or 30x status (
--all). In my example I set the downloader to run 3 concurent downloads at a time, the default is 1 (
--concurrency ). My experience has been that high concurrencies may result in skipped files requiring the command to be rerun a second time. The (
—- only) flag is the secret sauce as it allows the download to focus on specific filetypes. I should mention that there is an exclusion flag as well, (
—- exclude), obviously this excludes everything mentioned.
Let’s take a closer look at the only filter (
-o ). In side the brackets are some filetypes, you might be using more or less depending on your interest. Each filtype is seperated by a pipe character.
The tail of the command line is simply the website address, typically it’s something like
Where do files download to?
The default location for downloads is inside a folder named “websites” found inside your home folder.
If you want items to download to a specific location on your hard drive you will need to prefix the location with the
—- directory flag. For example:
wayback_machine_downloader -d /Volumes/Partition1/Alex/Desktop/download_folder_on_my_desktop -s -a -c3 —o "/\.(zip|hqx|sit|pdf|bin|txt|tgz|dmg)$/i" somewebsite.com
Alternatively you can just enter a name for a folder and it will download into your home directory. For example I use:
wayback_machine_downloader -d this_is_the_name_of_my_folder_for_this_download_IA -s -a -c3 —o "/\.(zip|hqx|sit|pdf|bin|txt|tgz|dmg)$/i" somewebsite.com
TIP: On a Mac, you can type
wayback_machine_downloader -d, hit the space bar once to create a space, and then drag and drop your favorite download folder from wherever it is on your hard drive to the terminal window. This will auto fill the terminal with the path to your download folder. Follow that with a space and then continue typing the rest of your command.
wayback_machine_downloader Help file
To get help on all the commands simply enter the following in the terminal
Caveat: If you choose to download txt files you will find a lot of robots.txt come down with the flow. After download is complete search the download folder for “robots.txt” and simply trash them but this will result in a lot of empty folders. Fortunately, there is a Mac app by Thomas Tempelmann called Find Empty Folders. Install it on your Mac to quickly find and trash empty folders. I find it invaluable for post-processing.