Webl: A simple web crawler written in Go

An exploration of the Go language (golang) to build a simple webcrawler, all code is available on Github. This application was written as an exploration of the language and demonstration of the various features of the language; it is not feature complete but should be complex enough to provide some examples of using Go’s various concurrent features, tied together in a simple command line and web interface.


The webcrawler uses Redis, to store results. Please install it and ensure it is running before starting.

Grab the three projects

go get github.com/aforward/webl/api go get github.com/aforward/webl/weblconsole go get github.com/aforward/webl/weblui

Build and test them

cd $GOPATH/src go test github.com/aforward/webl/api go build github.com/aforward/webl/weblconsole go build github.com/aforward/webl/weblui

The Web Application

To start the webserver

cd $GOPATH/src/github.com/aforward/webl/weblui go build ./weblui

The launched application should be available at, and you can add urls to crawl.

Using websockets, it attaches itself to the running project and streams the current status back to Javascript.

Data is persisted to Redis, so that you can view recently crawled

In the details view, we show the sitemap as a table, showing links (to other pages), and assets (being used on the current page, e.g. Javascript / CSS). For simplicity, we are only crawling within a domain (e.g. a4word.com), and do not look beyond (e.g. twitter.com / facebook.com) or within other subdomains (e.g. admin.a4word.com).

I experimented with Graph Dracula for some better visualizations, but right now the results are far too busy.

The Console Application

To start the console

cd $GOPATH/src/github.com/aforward/webl/weblconsole 
go build
# change a4word.com with your url
./weblconsole -url=a4word.com

The webl library is consumed by a web application (described above) and console application, described here. Both systems are thin clients and push most of the work to the Go library.

In fact, the logged output in the web application is drawn from the same logging information used to display to the console (but the console has extra flags to turn verbosity up and down)

The Data Store

For simplicity, data is stored in a Redis database. We are using a set to manage all crawled domains.> smembers domains 
1) "instepfundraising.com"
2) "a4word.com"

Each resource is unique identified by it’s url and is stored internally as a has of properties> hgetall "resources:::http://a4word.com/snippets.php" 
1) "name"
2) "/snippets.php"
3) "lastanalyzed"
4) "2014-05-19 15:54:23"
5) "url"
6) "http://a4word.com/snippets.php"
7) "status"
8) "200 OK"
9) "statuscode"
10) "200"
11) "lastmodified"
12) ""
13) "type"
14) "text/html"

Assets and Links within a page are stored in an edges set for each resource.> smembers "edges:::http://a4word.com/snippets.php" 
1) "http://a4word.com/Andrew-Forward-Resume.php"
2) "http://a4word.com/css/gumby.css"
3) "http://a4word.com/snippets.php"
4) "http://a4word.com/js/sh_3.0.83/scripts/shBrushBash.js"
5) "http://a4word.com/js/libs/ui/gumby.retina.js"
6) "http://a4word.com"

The data structure in Go represent the results of a crawl, are captured in a Resource struct.

type Resource struct {
Name string
LastAnalyzed string
Url string
Status string
StatusCode int
LastModified string
Type string
Links []*Resource
Assets []*Resource

Configure The Logger

To flexibly manage the logging of information, we configured four type of logs:

// TRACE: For a more in-depth view of how the code is behaving 
// INFO: Key feedback of the running system
// WARN: When things go awry, but not necessary totally un-expected // ERROR: Catatrosphic issue that typically results in
// a shut down of the app

From the command line, by default, we display INFO, WARN, and ERROR. Using the -verbose flag, we include TRACE, and using the -quiet flag we turn off INFO (and TRACE is off too). This is accomplished by setting the appropriate io.Writer.

// use ioutil.Discard to ignore message 
// use os.Stdout for displaying messages to the standard console
// use os.Stderr for displaying messages to the error console

In addition, the logger also accecpt a WebSocket (*websocket.Conn), which allows us to stream results from the crawler directly to the browser.

Using WaitGroups For Concurrency

The Crawl algorithm is broken down into two parts: fetch and analyze. The fetch portion performs an HTTP Get on the domain being crawled, and then passes the response to the analyze to extract the wanted metadata (like status code, and document type) as well as look for additional links to fetch.

// A trimmed down version of Crawl to show the basic of WaitGroups
func Crawl(domainName string) {
var wg sync.WaitGroup

alreadyProcessed := set.New()
url := toUrl(domainName,””)
name := ToFriendlyName(url)

AddDomain(&Resource{ Name: name,
Url: url,
LastAnalyzed: lastAnalyzed })
go fetchResource(name, url, alreadyProcessed, &wg)

In the crawl, we setup a synchronized set to avoid processing the same link twice, and delegate the fetching of the data to a goroutine fetchResource. To ensure the main program doesn’t quit right away, we use a WaitGroup to track how many things we have left to process.

In a first, iteration, the fetchResource needs to tell the WaitGroup, it’s done, and we do this with a defer call (similar to ensure in ruby).

func fetchResource(domainName string, 
currentUrl string,
alreadyProcessed *set.Set,
wg *sync.WaitGroup) {

// When the method ends, decrement to wait group
defer wg.Done()

// Don’t re-process the same urls
if alreadyProcessed.Has(currentUrl) {

// Only process the Urls within the same domain
// (a rule for this web crawler, not necessarily yours)
} else if shouldProcessUrl(domainName,currentUrl) {

// Fetch the data
resp, err := http.Get(currentUrl)
should_close_resp := true

if err != nil {
} else {
contentType := resp.Header.Get(“Content-Type”)
lastModified := resp.Header.Get(“Last-Modified”)

// Only webpages (e.g. text/html) should be trasversed
// other links are assets like JS, CSS, etc
if IsWebpage(contentType) {
if resp.StatusCode == 200 {
should_close_resp = false

// More work to do, so increment the WaitGroup
// And delegate to the analyzer
go analyzeResource(domainName,

// Note that the analyzeResource will close
// the response, but it’s not called in all cases
// So I have this extra code here
if should_close_resp {
if resp == nil || resp.Body == nil {
defer io.Copy(ioutil.Discard, resp.Body)
defer resp.Body.Close()

The analyzer will process the data, identify the links and then fetch them.

func analyzeResource(domainName string, 
currentUrl string,
resp *http.Response,
alreadyProcessed *set.Set,
httpLimitChannel chan int,
wg *sync.WaitGroup) {
defer wg.Done()
defer resp.Body.Close()
defer io.Copy(ioutil.Discard, resp.Body)

tokenizer := html.NewTokenizer(resp.Body)
for {
token_type := tokenizer.Next()
if token_type == html.ErrorToken {
// Something went wrong in the processing of the file
if tokenizer.Err() != io.EOF {
WARN.Println(fmt.Sprintf(“HTML error found in %s due to “,
currentUrl, tokenizer.Err()))
token := tokenizer.Token()
switch token_type {
case html.StartTagToken,
path := resourcePath(token)

// If the found token contains a link to another URL
// We have more work to, and must fetch another resource
if path != “” {
nextUrl := toUrl(domainName,path)
go fetchResource(domainName,

For simplicity, I stripped out the persistence of the crawled data, so please refer to the Github project To browse the working code.

Throttle http.Get with a Channel (semaphore)

Go is fast, fast enough that you can run out of resources locally (i.e. 1024 open files), or burden the remote server. To throttle goroutines, we use a “full” channel to limit the number of executing http.Get Implementing a Semaphores in Go.

// fills up a channel of integers to capacity
func initCapacity(maxOutstanding int) (sem chan int) {
sem = make(chan int, maxOutstanding)
for i := 0; i < maxOutstanding; i++ {
sem <- 1

// initialize the http GET limit channel
httpLimitChannel := initCapacity(4)

// wrap the http.GET around the channel
// as <- will block until available
resp, err := http.Get(currentUrl)
httpLimitChannel <- 1

Here’s a great talk introducing Go’s concurrency patterns

Respecting Robots.txt

The Web Robots Pages describes how bots, like webl all allowed to interact with the site. To achieve this, we used a robotstxt and enhanced the fetchResource to keep track of which robots it had loaded to avoid having to fetch the data on each request.

The bulk of the heavy lifting looks like the following:

func canRobotsAccess(input string, 
allRobots map[string] *robotstxt.RobotsData)
(canAccess bool) {
canAccess = true
robotsUrl := toRobotsUrl(input)
inputPath := toPath(input)

if robot,ok := allRobots[robotsUrl]; ok {
if robot == nil {
canAccess = robot.TestAgent(inputPath, “WeblBot”)
} else {
allRobots[robotsUrl] = nil
TRACE.Println(fmt.Sprintf(“Loading %s”,robotsUrl))
resp, err := http.Get(robotsUrl)
if resp != nil && resp.Body != nil {
defer resp.Body.Close()
if err != nil {
if resp.StatusCode != 200 {
"Unable to access %s, assuming full access.",
} else {
robot, err := robotstxt.FromResponse(resp)
if err != nil {
allRobots[robotsUrl] = robot
canAccess = robot.TestAgent(inputPath, “WeblBot”)
"Access to %s via %s (ok? %t)”,

Next Steps

  • Adding the ability to manage multiple crawls over a domain and provide a diff of the results.
  • Adding security to prevent abuse from crawling too often.
  • Improve visualization based on how best to use the data (e.g. broken links, unused assets, etc). This will most likely involve an improved data store (like Postgres) to allow for reaching searching.
  • Improved sitemap.xml generation to grab other fields like priority, last modified, etc.
  • Improved resource meta-data like title, and keywords, as well as taking thumbnails of the webpage.
  • Improved link identification by analyzing JS and CSS for urls.

Originally published at a4word.com in 2013.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.