Friday, 27 December 2013

My solution to the Go Tutorial Web Crawler exercise

I've been having a fun time learning Go. The final exercise (http://tour.golang.org/#73 at time of writing) asks you to modify a web crawler to crawl pages in parallel without fetching the same page twice.

One general rule of thumb with parallel programming is never write your own synchronisation code; it's difficult, error prone, and difficult to validate. This is true for monitor based synchronisation, and as neat as Go's channels and goroutines are1, they're still relatively low-level so I believe it's generally true for Go as well. Instead of writing your own synchronisation code, use well written libraries that provide a suitable abstraction.

There are two concurrency problems with the WebCrawler exercise. The first is the problem of making the main function wait until all spawned goroutines complete. The second is the problem of storing and accessing visited URLs in a safe manner.

For the first problem, the Go library provides a WaitGroup type in the sync package. For those who are familiar with Doug Lea's excellent java.util.concurrent package, WaitGroup is similar to the java.util.concurrent.Phaser class.

The WaitGroup type simply holds a counter. Clients can increment the counter (using Add(delta int)) or decrement the counter (using Done()). Clients can also call WaitGroup.Wait(), which blocks until the counter reaches zero.

We want to have the main routine wait for an unknown number of goroutines, so WaitGroup is perfect for our needs. Before spawning a goroutine, we simply increment the counter, and in the goroutine itself, we use Go's defer feature to ensure that the counter is decremented before the goroutine completes.

For the second problem, we need to have multiple goroutines store visited URLs, and also query all visited URLs in a safe manner. A 'test-and-set' type operation is ideal for this, so I created a Visited type with a TestAndSetVisited(url string) function which uses a sync.Mutex to protect reads and writes to a map. I considered a sync.RWMutex, but when using test-and-set semantics, a mutex is simpler. I suspect that a RWMutex may be more efficient if the URL graph was dense, but it doesn't seem worth the hassle.

Below is the tutorial code, with my modifications highlighted in green.

package main

import (
    "fmt"
    "sync"
)

type Fetcher interface {
    // Fetch returns the body of URL and
    // a slice of URLs found on that page.
    Fetch(url string) (body string, urls []string, err error)
}

type Visited struct {
    url_map map[string]bool
    mutex sync.Mutex
}

func (self *Visited) TestAndSetVisited(url string) bool {
    defer func() {
        self.url_map[url] = true
        self.mutex.Unlock()
    }()
    
    self.mutex.Lock()
    return self.url_map[url]
}

// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, 
           visited *Visited, wg *sync.WaitGroup) {
    defer wg.Done()
    if depth <= 0 || visited.TestAndSetVisited(url) {
        return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q\n", url, body)
    for _, u := range urls {
        wg.Add(1)
        go Crawl(u, depth-1, fetcher, visited, wg)
    }
    return
}

func main() {
    var visited = Visited{ url_map: make(map[string]bool) }
    var wg sync.WaitGroup
    wg.Add(1)
    go Crawl("http://golang.org/", 4, fetcher, &visited, &wg)
    wg.Wait()
}

// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult

type fakeResult struct {
    body string
    urls []string
}

func (f fakeFetcher) Fetch(url string) (string, []string,
                                        error) {
    if res, ok := f[url]; ok {
        return res.body, res.urls, nil
    }
    return "", nil, fmt.Errorf("not found: %s", url)
}

// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
    "http://golang.org/": &fakeResult{
        "The Go Programming Language",
        []string{
            "http://golang.org/pkg/",
            "http://golang.org/cmd/",
        },
    },
    "http://golang.org/pkg/": &fakeResult{
        "Packages",
        []string{
            "http://golang.org/",
            "http://golang.org/cmd/",
            "http://golang.org/pkg/fmt/",
            "http://golang.org/pkg/os/",
        },
    },
    "http://golang.org/pkg/fmt/": &fakeResult{
        "Package fmt",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
    "http://golang.org/pkg/os/": &fakeResult{
        "Package os",
        []string{
            "http://golang.org/",
            "http://golang.org/pkg/",
        },
    },
}

Update: I have since discovered that fmt.Printf is not thread safe, and can interleave output. Concurrency is hard.

Also, Sonia at the Sonia Codes blog has a nice writeup of a very go-like solution to this exercise. It's 'go-like' in that it explicitly shares memory by communicating. See http://soniacodes.wordpress.com/2011/10/09/a-tour-of-go-69-exercise-web-crawler/ .



1. For an introduction to where the ideas behind Go's channels and goroutines came from, have a look at C. A. R. Hoare's Communicating Sequential Processes.