Google is probably the first thing that springs to most people’s minds when you mention the word “search engine”. Your friend or colleague asks you a question that you have absolutely no idea about - Google it. I think just about everyone has done that.
Google has a 90.14% market share worldwide, according to statcounter.com. Bing comes in at a not-very-close second with 3.24%.
So we use search engines every day, without even having to think about it or be told to do it. We just do it. And one of the reasons Google has conquered the search universe is largely down to it’s homepage. Lovely white space. Without doubt the simplest web form on the internet, full stop!
But believe me, Google is like a duck paddling around a serene lake. Supremely calm on top, but its little webbed feet are working like troopers underneath. Google is working 24/7, 365.
So what exactly happens when you type “cute kittens” into that form field and click Search? Well, before I answer that, we need to look at three important processes which make it all possible - crawling, indexing and document selection (or serving results).
Creepy web spiders
Before web users can find your website, Google kindly asks that you submit your website to essentially alert them that you exist and want to be found. This sets off a process known as crawling.
The process of crawling uses a web crawler, officially known as Googlebot, but more often referred to as a spider. In essence, the spider is sent on its way to discover new and updated pages to be added to our next step, the Google Search index. It does this by starting with the most authoritative websites, such as CNN and New York Times, and follows links from here until the process is started all over again (see below). The aim is to index as much of the web as possible, while keeping results fresh.
Technically speaking, Googlebot is actually an incredibly advanced software programme, running across thousands of computers, which crawl (or fetch) billions - yes billions - of pages all over the web.
The start of the crawl process requires a list of web addresses - ones that webmasters have submitted in the form of XML sitemaps and ones generated from previous crawls. The crawler begins to visit each URL provided and detects any links on these pages (SRC and HREF). Newly discovered links get added to the original list and will be crawled later. Changes to existing pages and dead links are noted and used to update the index.
In simple terms, our creepy spiders are actually software programmes with instructions to create a big list of URLs, follow them, find more and report back on anything new or updated. This provides the basis for the next step: indexing.
Organising the web with words
So what happens to all of the links discovered by the crawling phase of this process? Well, hold on to your hats - here we go.
When a working web page has been found, Google renders the content of the page (just as you would when viewing a web page in a browser) and scrapes through all of the words on this page. So from our example above, the words “cute” and “kittens” may appear once or many times on a web page or document.
Google doesn’t just look at the content of the page, but also other areas including key content tags and attributes such as title tags, ALT attributes and the words in the page URL. It can figure out the location of the words on the page and also their proximity to each other (Google are clever, aren’t they).
All of these documents are saved (indexed) in databases all of over the world. Interestingly, Google claim there is an “entry for every word seen on every web page we index. When we index a web page, we add it to the entries for all of the words it contains.” This incredible document organisation is absolutely key to the running of a search engine as it ensures that when we come to the document selection part of the process, everything is filed away nicely and ready to go.
When you search Google, you’re not searching the live web. Instead, you’re searching Google’s index of the web which, like the list in the back of a book, helps you pinpoint exactly the information you need.
Amazingly, the Google Search index contains hundreds of billions of web pages and is in excess of 100,000,000 gigabytes in size. (I wonder how many pages of cute kittens are in there?!).
Show me cute kittens, damn it!
At a basic level, when we click Search Google begins searching through its vast index of documents and gives us matching pages it believes are the most relevant to the search term. It then ranks them based on various factors to ensure the freshest and most relevant content available.
Analysing your search
The Google search algorithm first attempts to understand what exactly you are looking to discover, whether it’s a very specific search or a broad query, local information (e.g. “near me”), trending results and so on. This is a system developed over many years and one that is getting more sophisticated to include the correction of spelling mistakes and apply natural language understanding.
Matching your search terms
Next Google will select all of the documents (from the indexing phase above) that contain the words from our search, so “cute” and “kittens”. This could be 1,000 documents or it could be 100,000,000. Once we have this scaled-down list of documents, Google can now begin to rank them in order of relevance.
Ranking matching pages
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
Google will also be looking at things such as (not in order):
- the freshness of content
- how many times the search term appears in a number of places on the page (i.e. page content, Title tag, in the URL and prominence on the page )
- the usability of the page, which includes Page Speed and Mobile-friendliness
- the use of HTTPS
- the quality of the content written
Once Google has decided on the order of pages, they are returned to the user in relevance order - the web pages deemed the most useful to the user and their specific query.
Below is the result for “cute kittens” (20–06–18).
These are, at this exact point in time, the most relevant to my search. And remember that these results have been narrowed down from 459,000,000 indexed documents in just 0.48 seconds. Wow!
So next time you type something into Google, just think what it takes to show you that page or results.