Wouldn’t it be great to travel back in time and uncover the long forgotten versions of the website — like an archaeologist, discovering secrets from the past, but working in the digital world?
There are lots of reasons to dig up a website’s past.
If you’re a business with a new domain name, you might be interested in knowing what the previous iteration’s website looked like.
If you’re a writer who has had an idea stolen, you can prove that you published it before the thief did, or simply rediscover lost content.
If you’re an SEO investigating a drop in a recently changed website’s visibility, having access to a prior version might provide you with clues to what went wrong.
In the digital world, information gets shared, copied and embedded. It becomes a part of our cultural heritage, passed on from one individual to another.
And even though the information is out of sight and out of mind, it never quite disappears. Broken links aren’t the only thing left behind when a website comes to an end. In fact, it’s even possible to recreate the content if all the resources were deleted from the origin server.
It’s all (mostly) being saved.
But the question is: By whom?
Whenever Google’s crawler visits your website, it also creates a snapshot of the page. This gets preserved in the Google cache. You can access it by typing cache: before the page’s URL in the Google search engine:
The cache’s date alone will give you an idea when was the last time Googlebot visited the page. Ideally, it shouldn’t be older than a day.
If you discover that the cached version of your page is three weeks old, this should ring an alarm bell. Your website might have some serious issues with its crawling strategy.
However, Google cache doesn’t enable us to dig deep enough.
It only provides access to the latest version of the page, which in 99% of the cases, would be exactly the same page that is working live.
In terms of Internet Archeology, Google cache performs the role of a trowel. It lets us uncover the surface, but it doesn’t allow us to go deeper.
Therefore, for our online excavations, we need a better tool.
We need a spade.
Travel Back in Time Using the Internet Archive
According to this report, around 50% of the overall online traffic is caused by bots, not actual humans.
There is a variety of different automated programs crawling the web. Some of them are malicious, such as scrapers, spammers or hacker tools, others (search engine bots, feed fetchers, etc.) perform a spectrum of useful tasks.
One such crawler has a rather unusual mission of perpetuating fragments of the digital world. It belongs to the Internet Archive, a non-profit organization which has undertaken the task of preserving digital information for future generations. They collect all sorts of data: book scans, videos (including television news programs), audio records, images, and even software programs.
Most importantly, the Internet Archive gives us access to over 20 years of web history, with over 361 billion stored web pages!
You can find them by using the Wayback Machine.
How Does it Work?
The service was set up in 1996 and went public in 2001, after five years of collecting data. In 2016, a more advanced version of the website was released.
As mentioned before, the Wayback Machine uses a bot to archive pages. It navigates between websites by using a network of links. And it saves everything it finds in the process.
Since the more links pointing to your site from other domains, the greater the chance that your website will be discovered, it stands to reason that the bigger and more popular websites are more likely being stored; whereas a tiny personal blog might not be archived at any point.
If you want to make sure that your website will be discovered, you can always use a save page button and send the URL that should be archived.
Online Excavations Utilizing the Wayback Machine
If you wish to discover the archived versions of a page, you should enter the URL in the search bar on the Wayback Machine (if you don’t know the exact address, you can try searching by the keywords it should contain).
This is what you will get after typing your URL:
On the timeline at the upper part of the page, you can see a graph representing how many snapshots of a given page was created during a single year.
After selecting a year, a number of dots, different in size and color, appear on a calendar below the timeline.
The dot means that the page has been archived in a given timestamp, and the size of the dot indicates how many snapshots have been taken. One timestamp can have one, a few or no snapshots at all, and you can see the exact number by simply navigating to the dot:
As you can see, on November 16, 2017, five snapshots were taken of a given page, at different hours. The timezone is GMT.
The dots can have four different colors:
- The Blue dot indicates that an object (URL address) has been successfully visited and archived;
- The Green dot reveals that an object contains a redirect to a different snapshot (or to a different object which might not be available in the archive);
- The Orange dot means that when a bot visited an URL it returned the HTTP status code 4XX;
- The Red dot is an indicator that a server error appeared when a bot tried to reach an origin URL.
And the only dots that contain stored archives are the blues. The other colors might give you information about any encountered issues or changes in the website’s structure.
If you enter the given archive, you will get a version of the page encountered by the bot in the given time, and by using the timeline, you can navigate among the newer and older snapshots.
However, it won’t be a 1:1 copy.
Also, it’s worth noting that the archive often has problems storing images.
Despite its limitations, the tool is still incredibly useful and it gives you the unique ability to investigate the history of a single page and its evolution over the years (provided that enough of the archived data exists).
You can also access the summary of the page’s history. Here you can see the details concerning the resources of the archived page.
Using the Wayback Machine for SEO and Business Purposes
If the website has been running for years, there is a fair chance that its structure has been rebuilt many times.
The Wayback Machine gives you the ability to see how the URLs have changed, and what’s even more important, it will give you insight about which pages have and haven’t been redirected.
It is possible that somewhere in the website’s structure, long forgotten pages are hidden. Combining the Wayback Machine and Screaming Frog can help you track down valuable redirect opportunities.
For businesses, this feature might also prove useful while investigating old Information Architecture.
This will gather a year’s worth of data of the page’s archive, to build up a scheme of the website’s Information Architecture.
It is presented on a graph, with the inner circle representing the homepage. Subsequent subpages are added in the form of additional layers, just as you can see in the screenshot:
By clicking on different years you can observe how the website’s architecture evolved over time.
However, you have to keep in mind that this sitemap is not 100% accurate (actually it might be really far away from 100%) and its appearance depends solely on the Wayback Machine’s archive. If the snapshot of a given page wasn’t ever taken, the page wouldn’t make its way to the sitemap.
Retrieving the HTML Code of an Archived Page
In the Wayback Machine, you can take a look at the HTML of the stored version of the page. This will allow you to investigate the metadata of the old pages, see the changes made in the code, or even check the analytics code placement.
You have to keep in mind though that the code has been slightly modified from its original form:
- All URLs (internal links, resources, etc.) have been modified, so they point to the archive;
- A big block of additional code has been added to create the Wayback Machine’s toolbar.
You may browse through the code keeping that in mind; however, there is a simple trick for extracting the original HTML. To achieve that, simply add after a timestamp in the URL of the given archive.
For example, if the URL of the archived page is:
Then, the path to view the version of the code would be:
Investigating Crawling Issues
If your website’s Google Search Console is reporting issues concerning the robots.txt file, the Wayback Machine can give you access to the old version of the file.
You have to investigate the time range for when the errors started to appear and then look for the snapshots from this time range inside the archive. The archive stores everything it finds, so there is a decent chance that robots.txt also made its way to the database.
You might be surprised how often it happens in the case of big websites:
Recovering Lost Content
Last but not least, the Wayback Machine can help you recover what once was lost.
If for some reason, the content of a given page has been deleted and you don’t have a saved copy of that page, Internet Archive might save the day. You can hope that at some point the lost page was stored in the archive, and if so, just retrieve the content. It’s as simple as that.
Keeping that in mind, the tool can really prove useful while performing a number of SEO and business-related tasks, and it gives you the unique ability to take a look at the past on the World Wide Web.
Originally published at https://www.onely.com.