Crawling with Spiders: How the Google Search Engine Works
Written by Tanisha Thatte
How big is the web? Well, it’s called the “World Wide Web” so the assumption can be made that it is fairly large. We can estimate that the web is made up of over 130 trillion individual pages, and it is constantly growing. With so many different pages to choose from, how can someone find exactly the page they want to see? Well the answer is easy: “just Google it”. If you tell someone to “Google” something, they know exactly what you mean. “Google” has become part of our everyday vernacular, a new verb for the internet age. Every second of every day, an average of 40,000 people are using Google to search for something. Google has over 100 billion searches a month. And even though an average search takes less than ½ second, behind the scenes there are several steps that occur before you see the most relevant results.
Let’s continue using the term “World Wide Web”. Speaking literally, what is something every “web” has? The answer is spiders. When you “Google” something, you’re not actually searching through the entire web. Instead, you’re searching through Google’s index of the web. Google creates this index using software programs called spiders. Spiders start by looking at a few webpages, and then they follow the links on the initial pages to find other pages. Spiders repeat this process until they have indexed many billions of pages. In this sense, the spiders have “crawled” across the web.
Suppose you want to know the number of Olympic medals each country has won. Your search term might be something like “Olympic medal count by country”. The Google software will then search through the index to find every page that includes those search terms. However, within the index, there are hundreds of thousands of pages that include those terms. To decide which few pages you would most like to see, Google asks over 200 questions. Things like “How many times does this page contain your key words?” or “Is this page high quality or is it spam?” .Google also looks at something called “page rank”. PageRank is an algorithm that rates a webpage’s importance by looking at how many outside links point to it and how important those links are. Finally, Google combines all these different factors to determine each page’s overall score. Based on these scores, you get your most relevant search results. All this takes about ½ second.
Of course, even with these complex algorithms in place, Google is constantly evolving. Instead of just matching keywords from a search to pages in an index, Google is now trying to understand what exactly those words mean. If Google can understand the meaning of the words you search for, it can do a better job of delivering relevant search results. The way Google is trying to build information about the real world is by building a “knowledge graph”. The Knowledge Graph attempts to understand the relationship between various real world things so that Google can analyze these relationships to bring you better search results.
Recently, you may have noticed a panel with some information about your search appearing next to some of your search results. This is a feature that utilizes the Knowledge Graph. For example, if you search for “Marie Curie” because she is the only woman you know that has won a Nobel Prize, you will now see a panel within your search results that will help you explore the broader topic of women who have won a Nobel Prize. You can then learn more about different women who have won and more about the topics they studied. Google is now transitioning from a search engine to what they like to call a “knowledge engine” and the Knowledge Graph is the first step in that direction.
So as you can see, much more goes on behind the scenes within the ½ second it takes you to Google something than is really apparent. Google search is a constantly evolving process. The Google search team is made up of programmers, scientists, and engineers, but it is also made up by the millions of people that use Google every day, including you. The original ambition of Google co-founders Larry Page and Sergey Brin was to “organize the world’s information and make it universally accessible and useful”. Today, Google has put the world’s information right at our fingertips. And as the world’s information base continues to grow, Google will too.
About the Author:
Tanisha is a founding member of Humans For AI, a non-profit focused on building a more diverse workforce for the future leveraging AI technologies. Learn more about us and join us as we embark on this journey to make a difference!