Make It Real Elite — Week 3: Search Engine Indexer — Crawling the internet

Sebastian Zapata Mardini
6 min readApr 30, 2018

--

Search engines play an essential part in our lives, and they influence lots of jobs in the world, specially software development positions. Have you ever wondered how they work? How Google can have the correct information everytime you want to search for something? What is the strategy to crawl all the internet? Is Google some sort of wizard able to read your mind every time you search for something? For proving if this is some “googling magic” or not, we will build a WEB crawler. We might not have a magic stick, but we have Golang, and some of its amazing features like goroutines, channels and worker pools in order to reach our goal. Stay close.

Crawling the internet

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Using a single command we will tell our crawler to start in an specific URL, and we will extract from there the information below:

  • Page title
  • Page description
  • HTML Body
  • Links (This is a key-point. Later on, we’ll go to each link in this set and we’ll repeat the same crawling process. That’s how we can index all the internet from a single starting point :) Magic? )

Storage

I decided to use ElasticSearch as database to store all the information we want to index from the internet. ElasticSearch is one of the most popular, highly scalable full-text search engines on the market, and it uses an inverted index algorithm to allow very fast full-text searches. Simply put, it’s perfect for what we need in this project.

Further, we will use an excellent client for ElasticSearch and Golang called elastic. (Special thanks to Oliver Eilhard for his amazing work on maintaining this project.)

Code

Let’s say that if we would really have a magic stick, that would be the Golang concurrency. Stand by me to see how we’ll use its features for the next strategy.

  1. A channel called queue. We’ll use a FIFO structure to send all the links we find in a website. Want to know more about FIFO? Here you can. Our workers will be receiving and processing from this channel all the links found it in our goroutines.
  2. A worker pool with a size of 10. Each worker in this pool will be in charge of visiting one URL, extracting the information, and using it to create or update a new index in our database. It allows us to have 10 goroutines working in a different link at the same time. Also, the worker will find the links in the website and send all of them through the queue channel.
  3. Some special considerations. We will discard invalid links like the ones with # or javascript() in the href attribute. If we visit an existing index in our website we will update the information in the database by default. Also, we won’t implement a page rank algorithm for this case. We will trust in the score given for ElasticSearch to each result as a relevance measure.

Since this is the first delivery out of three posts about building a Search Engine with Golang, I’d like to leave the code to speak for itself. You can check all of the indexer code in this repository.

// Scraper for each website
type Scraper struct {
url string
doc *goquery.Document
}

We will use the Scraper struct to represent a scraper or a crawler for a specific page. This struct has a string property with the url, and a doc property which represents a GoQuery document, GoQuery is a Go package, It brings a syntax and a set of features similar to jQuery, perfect for our purpose in this tutorial.

Attached to this struct we have some methods to extract the information from the website, let’s take a look:

To find the title and description we’ll use the MetaDataInformation function:

// MetaDataInformation returns the title and description from the page
func (s *Scraper) MetaDataInformation() (string, string) {
var t string
var d string
t = s.doc.Find("title").Contents().Text()s.doc.Find("meta").Each(func(index int, item *goquery.Selection) {
if item.AttrOr("name", "") == "description" || item.AttrOr("property", "") == "og:description" {
d = item.AttrOr("content", "")
}
})
return t, d
}

To get the body we’ll use the Body function:

// Body returns a string with the body of the page
func (s *Scraper) Body() string {
body := s.doc.Find("body").Text()
// Remove leading/ending white spaces
body = strings.TrimSpace(body)
return body
}

Finally, to get all the links from the website, we’ll use the Links function:

// Links returns an array with all the links from the website
func (s *Scraper) Links() []string {
links := make([]string, 0)
var link string
s.doc.Find("body a").Each(func(index int, item *goquery.Selection) {
link = ""
linkTag := item
href, _ := linkTag.Attr("href")
if !strings.HasPrefix(href, "#") && !strings.HasPrefix(href, "javascript") {
link = s.buildLink(href)
if link != "" {
links = append(links, link)
}
}
})
return links
}

We will use a struct called Page, which will represent a scraped website. In this struct we will have an unique ID, title, description, body, and URL attributes. Also, we define the json representation since we will parse the struct to interact with the ElasticSearch client.

// Page struct to store in database
type Page struct {
ID string `json:"id"`
Title string `json:"title"`
Description string `json:"description"`
Body string `json:"body"`
URL string `json:"url"`
}

In the startCrawling function we allocate our workers and we publish the first URL in the queue channel to be received and processed for one of our existing workers.

func startCrawling(start string) {
checkIndexPresence()
var wg sync.WaitGroup
noOfWorkers := 10
// Send first url to the channel
go func(s string) {
queue <- s
}(start)
// Create worker pool with noOfWorkers workers
wg.Add(noOfWorkers)
for i := 1; i <= noOfWorkers; i++ {
go worker(&wg, i)
}
wg.Wait()
}

For each iteration in this block we create a goroutine. The worker function runs for each of these goroutines receiving(or listening) the links sent through the the queue channel and processing all of them with our crawURL function.

func worker(wg *sync.WaitGroup, id int) {
for link := range queue {
crawlURL(link)
}
wg.Done()
}

No let’s take a look at the main function, the crawURL function. The workflow for this function is really simple:

  • Start a new Scraper struct for the given URL
  • Extract the title, description, body and links from the URL
  • Check if the current URL is already stored in our database. If not, I would create a new index; if the URL already exists, it will update the index existing in the database
  • Publish all the links found it in the queue channel to be consumed later by the workers
func crawlURL(url string) {
// Extract links, title and description
s := NewScraper(url)
if s == nil {
return
}
links := s.ScrapeLinks()
title, description := s.MetaDataInformation()
body := s.Body()
// Check if the page exists
existsLink, page := ExistingPage(url)
if existsLink {
// Update the page in database
params := map[string]interface{}{
"title": title,
"description": description,
"body": body,
}
success := UpdatePage(page.ID, params)
if !success {
return
}
fmt.Println("Page", url, "with ID", page.ID, "updated")
} else {
// Create the new page in the database.
id, _ := shortid.Generate()
newPage := Page{
ID: id,
Title: title,
Description: description,
Body: body,
URL: url,
}
success := CreatePage(newPage)
if !success {
return
}
fmt.Println("Page", url, "created")
}
for _, link := range links {
go func(l string) {
queue <- l
}(link)
}
}

And we are done! A special consideration before testing the code. Since I’m running this code into a Docker containerized environment, you may want to change this line and also this one in the code, since you will be running your ElasticSearch server locally.

After running the code with the command go run *.go index STARTING_URL with your ElasticSearch server up, you should see an input like the following:

# go_search_engine_indexer[master]$ go run *.go index http://www.elcolombiano.comElasticsearch version 6.2.4
Page http://www.elcolombiano.com with ID JfGBscGig updated
Page http://www.elcolombiano.com/colombia with ID FXGBy5GmR updated
Page http://www.elcolombiano.com/antioquia with ID MunfycGig updated
Page http://www.elcolombiano.com/inicio with ID QIVBs5Gmg updated
Page http://www.elcolombiano.com/antioquia with ID MunfycGig updated
Page http://www.elcolombiano.com with ID JfGBscGig updated
...
...
...

Well, magic or not, here we are at the end of the first post, crawling the Internet like pros. :) If you want to check more about the search engine indexer using Go, you can visit the repository. Don’t forget to check the readme file also. I would appreciate any comment, feedback or contribution. Till next time, cheers!

--

--

Sebastian Zapata Mardini

Software Engineer. Visionary and entrepreneur. I want to conquer my own universe. http://sebastianzapata.co