Unleashing Concurrency in Web Scraping with Go Routines: A Beginner’s Guide

Introduction

Emilio Limo Cliff
7 min readMar 12, 2024

--

Have you ever stared impatiently at your web scraper, waiting for it to finish? Web scraping often involves downloading content, which can create bottlenecks. Web scraping has become an essential skill for data enthusiasts, and Go’s concurrency features open up new possibilities. This article aims to demystify the concept of concurrency, specifically in the context of web scraping. We’ll walk through the entire process, from setting up the environment to comparing runtimes and understanding the advantages of Go routines.

GO ROUTINES

Go routines are lightweight threads built into the Go programming language. They allow you to run multiple tasks concurrently, significantly boosting your scraper’s speed. The real magic happens when we use channels to communicate between these Go routines. Channels facilitate the safe exchange of data, making them an ideal tool for concurrent programming. Let’s dive into the code and explore how Go routines achieve this magic.

To highlight the benefits of Go routines, we’ll implement two scraping approaches: one using a traditional for loop and another leveraging Go routines. We’ll measure and compare the runtimes of both methods to showcase the efficiency gains achieved through concurrency.

For our project we will need the github.com/gocolly/colly

In our directory initialize a go module by running go mod init <module_name> then run the go get github.com/gocolly/colly to install the colly package we will use this for extracting structured data from streaming movie site and store it a a csv file called movies.csv.

— In our main.go we first declare a Struct Movies that will hold each movie details and some constants and a global var fo time taken to execute the scrapping

type Movie struct {
Genre string
Title string
MovieLink string
Year string
ImageURL string
QualityLevels string
}

const (
URL = "https://www.goojara.to/watch-movies-genre"
scraperType = "without"
)

var elapsedTime time.Duration

— In our main function we open a CSV file and a writer then go ahead and write the CSVtitles of each column

 file, err := os.Create("new-movies.csv")
if err != nil {
fmt.Println("Error while creating file: ", err)
}
defer file.Close()

writer := csv.NewWriter(file)
defer writer.Flush()

writer.Write([]string{
"Genre",
"Title",
"Year",
"QualityLevels",
"MovieLink",
"ImageURL",
})

— Specify our genres we want to search for and declare a WaitGrouptogether with a Channel We also keep tract of the start time


genres := []string{"Action", "Adventure", "Comedy", "Drama", "Sci-Fi", "Horror", "Crime", "Thriller", "Romance", "Fantasy"}

var wg sync.WaitGroup
ch := make(chan []Movie, len(genres))

startTime := time.Now()
  • wg: It helps to coordinate and wait for a collection of goroutines to finish their execution before proceeding. It’s like a counter that keeps track of how many goroutines are active. When a goroutine completes its task, it signals the WaitGroup, and the counter decreases. The main program or another goroutine can wait until the counter becomes zero, indicating that all goroutines have finished. It ensures proper synchronization and orderly termination of concurrent tasks
  • ch:A channel is a communication mechanism that allows one goroutine to send data to another. The make(chan []Movie, len(genres)) part creates a channel with a buffer. The buffer size is set to len(genres), which means the channel can hold that many slices simultaneously. It's a way to allow asynchronous communication between goroutines. When a goroutine wants to send a slice of movies, it can put it into the channel, and the program calling it can receive it or another goroutine can later receive it.

— Create a switch statement that checks the scraperType and chooses which method to use to scrape the data

 switch scraperType {
case "with":
for _, genre := range genres {
wg.Add(1)
go scrapeMoviesWithGoroutines(&wg, genre, ch)
}

go func() {
wg.Wait()
close(ch)
}()

elapsedTime = time.Since(startTime)

for movieList := range ch {
writeMoviesToCSV(writer, movieList)
}

case "without":
var allMovieList [][]Movie
for _, genre := range genres {
returnMovieList := scrapeMoviesWithoutGoroutines(genre)
allMovieList = append(allMovieList, returnMovieList)
}

elapsedTime = time.Since(startTime)

for _, movieList := range allMovieList {
writeMoviesToCSV(writer, movieList)
}

default:
fmt.Println("Invalid input. Please enter 'with' or 'without'.")
}
  • In the case of “with” the scrapping is done using go routines. Using go before a function call launches it as a goroutine, allowing concurrent execution without waiting for it to complete. Before calling we add to the counter wg then pass it together with the channel ch and the genre type as an argument to our scrapeMoviesWithGoroutines We then ask the main program to wait for the go routines before proceeding with the rest of the code and finally close the channels. We then set our elapsed time and iterate through the channel ch and calling a function to write to CSV.
  • In case of “without” the scrapping is done with the for loop as each loop iterates over the genres list declared.The scrapeMoviesWithoutGoroutines only receives a genre as an argument. We create a list of Movie list where we append each list received from each iteration over the genres. We set our elapsed time and iterate through the list of list passing each list to the writeMoviesToCSV
  • default case is executed if none of the previous cases match. In our code, the default case is triggered if the value of scraperType doesn't match either "with" or "without". When this happens, the program prints the message to the standard output using fmt.Println. This is a way to handle unexpected or invalid inputs and provide feedback to the user.

— Final part of our main.go file we print the elapsedTime

 fmt.Printf("Time Taken: %v", elapsedTime)

— Creating our helper functions and the scrapping function

func writeMoviesToCSV(writer *csv.Writer, movieList []Movie) {
for _, movie := range movieList {
if err := writer.Write([]string{
movie.Genre,
movie.Title,
movie.Year,
movie.QualityLevels,
movie.MovieLink,
movie.ImageURL,
}); err != nil {
fmt.Println("Error writing data to CSV")
}
}

}

func scrapeMoviesWithGoroutines(wg *sync.WaitGroup, genre string, ch chan<- []Movie) {
defer wg.Done()

c := colly.NewCollector()

movieList := []Movie{}

newURL := fmt.Sprintf("%s-%s", URL, genre)

c.OnHTML("div.dflex", func(e *colly.HTMLElement) {

e.ForEach("div > a", func(_ int, a *colly.HTMLElement) {
var movie Movie
movie.Genre = genre
movie.Title = a.ChildText("span.mtl")
movie.MovieLink = a.Attr("href")
movie.ImageURL = a.ChildAttr("img", "src")
movie.QualityLevels = a.ChildText("span.hda")
movie.Year = a.ChildText("span.hdy")
movieList = append(movieList, movie)
})
})

c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting: ", r.URL)
})

err := c.Visit(newURL)
if err != nil {
log.Fatal(err)
}

ch <- movieList
}

func scrapeMoviesWithoutGoroutines(genre string) []Movie {
c := colly.NewCollector()

movieList := []Movie{}

newURL := fmt.Sprintf("%s-%s", URL, genre)

c.OnHTML("div.dflex", func(e *colly.HTMLElement) {

e.ForEach("div > a", func(_ int, a *colly.HTMLElement) {
var movie Movie
movie.Genre = genre
movie.Title = a.ChildText("span.mtl")
movie.MovieLink = a.Attr("href")
movie.ImageURL = a.ChildAttr("img", "src")
movie.QualityLevels = a.ChildText("span.hda")
movie.Year = a.ChildText("span.hdy")
movieList = append(movieList, movie)
})
})

c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting: ", r.URL)
})

err := c.Visit(newURL)
if err != nil {
log.Fatal(err)
}

return movieList
}
  • writeMoviesToCSV :This function receives a csv.Writer and a list of Movies which it iterate through this list and writes it to our CSV file using the writter and prints an error if an error occurs when writing to the file
  • scrapeMoviesWithGoroutines : wg *sync.WaitGroup: It takes a pointer to a sync.WaitGroup, which is used for synchronization to wait until all goroutines finish their work. genre string: The genre of movies to be scraped. ch chan<- []Movie: A channel of type chan<- []Movie, where the scraped movie data will be sent. The defer statement ensures that wg.Done() is called when the function exits, decrementing the WaitGroup counter. This is used for synchronization and indicates that the goroutine has completed its execution. We then create a new instance of the Colly web scraping collector. Create an empty list of Movies and construct our URL.c.OnHTML("div.dflex", func(e *colly.HTMLElement) {...}) Defines an event handler that is triggered when an HTML element with the selector “div.dflex” is encountered. Iterates over child elements “div > a” within the selected element and extracts movie data. Initiates the scraping process by visiting the constructed URL. The Visit method triggers the registered event handlers. Then log a fatal error if there is an issue during the scraping process. and finally send the scraped movie data to the specified channel for further processing.
  • scrapeMoviesWithoutGoroutine : We have the same logic as the scrapeMoviesWithGoroutines but here we before storing the data scrapped into our channel we just return the list of Movies scrapped.

With that we are now ready to run our file with the different cases and compare their runtime.

  1. Let’s start without and record its runtime
Without Go Routines

2. Now with go routines

With Go Routines

Conclusion

The significant contrast in time illustrates the power of goroutines in concurrent execution. In the scenario with goroutines, the script concurrently visits multiple URLs for different genres, leveraging the concurrent processing capability of goroutines. This results in a substantial reduction in the total execution time compared to the non-goroutine version, where each URL is visited sequentially, causing a considerable increase in the overall time.

Advantages Of Goroutines:

  • Concurrency: Goroutines enable concurrent execution, allowing multiple tasks to progress simultaneously. This is particularly advantageous for tasks like web scraping, where waiting for network requests can be a bottleneck.
  • Efficiency: Goroutines are lightweight, and their overhead is minimal. They efficiently manage concurrent tasks, making them suitable for scenarios with high levels of concurrency.
  • Non-Blocking: Goroutines operate independently and are non-blocking, meaning that the execution of one goroutine doesn’t hinder the progress of others. This makes them well-suited for scenarios where tasks can be performed concurrently without waiting for each other.
  • Improved Performance: The concurrent nature of goroutines can significantly enhance the performance of applications by utilizing available resources more effectively.

In summary, the time comparison demonstrates that utilizing goroutines for concurrent tasks, such as web scraping, can lead to substantial improvements in efficiency and performance. Go’s concurrency model, with its simplicity and efficiency, makes it a powerful tool for handling concurrent tasks, contributing to faster and more responsive applications.
Hope we’ve all learnt something today. Thank you for reading

--

--