Implementing Akka Crawler

Mladen Bolic
5 min readNov 22, 2018

--

Introduction

Building thread safe, highly concurrent applications in Java can be hard. As a developer you need to take care about thread synchronization, deadlocks, race conditions and due to introduced thread boilerplate, this kind of applications can be hard to read and maintain. Luckily, Akka can help us here to manage application concurrency, thereby reducing the overall complexity of the code.

So what is Akka?

Akka

Akka is a set of libraries for building highly scalable, resilient and responsive applications using Actor Model. Actor is the main building block of an Akka application. By using actors, developers are freed from responsibility of dealing explicitly with threads and thread management.

Actors are very lightweight and there can be several million actor instances running within a single application.

Actors communicate with each other by exchanging asynchronous messages. We can think of actor’s messages as commands and events that are interchanged between actors. In a way messages are revealing actor’s public API, thus it is important to use descriptive names and rich semantics to describe their role inside actor-based system. Since messages are shared between different threads, they should be immutable.

So how do we start building one Akka application?

We start by creating one or more Actors using ActorSystem. All other actors should be created by the same ActorSystem or by one of its children actors.

Additionally, every actor has a createReceive method for defining type of messages it is going to handle.

As for common programming practices, it’s usual to keep the Actor and its messages as close as possible i.e. messages are defined as static inner classes inside an actor. This makes it easier to understand how actor is handling its messages and its business logic.

Akka enforces parental supervision, hence every actor is going to be supervised by its parent. What this means is that if an error occurs in a child actor, his parent is going to decide how to handle this error. It could resume, stop, restart a child actor or escalate an error up the hierarchy.
This is one of the main features of Akka which allow us to build fault tolerant applications.

Ok, now that we know what Akka is, let’s start implementing our crawler.

Implementation

Our goal will be to create a website crawler that is going to traverse a website and download all his files to local file system.

Let’s start by describing the overall architecture of our application.

Our Akka crawler we will be using 3 types of actors: SupervisorActor, FileDownloadActor and LinkExtractActor.

SupervisorActor is going to supervise other two actors and is going to hold current crawl status. FileDownloadActor is going to be responsible for downloading a file and saving it to the local file system. The LinkExtractActor is going to read the content of a downloaded file, extract the links from it and send back list of urls to the SupervisorActor. SupervisorActor is going receive the list of urls, add them to the list of remaining urls and distribute new urls to FileDownloadActor. Crawling stops when all urls are processed and all website files are downloaded to local file system.

Supervisor Actor

As stated before SupervisorActor is going to be responsible for orchestrating FileDownloadActor and LinkExtractActor and delegating work to them.

We trigger the website crawling, by sending StartCrawling message to our SupervisorActor. StartCrawling message will contain the http link of the website we are going to crawl and whose content we are going to download.

Upon receiving the message, SupervisorActor will save the url to the list of remaining urls and pass the url to FileDownloadActor.

File Download Actor

FileDownloadActor is going to receive DownloadFile message sent from SupervisorActor, read the url part from it and download a file from specified url. Once it’s done with downloading, it will notify the SupervisorActor that file is downloaded and will send back the location where the file is saved.

There are two strategies we can use to handle errors during execution of an Akka actor. Propagate the thrown error to the parent actor and let the parent actor handle it depending on its supervisor strategy, or we can send an error message to the parent actor.

For file download we will use the latter solution and send FileDownloadError if something goes wrong.
Using error message will allow us to send additional data to parent actor. In our case we will send url within the message and that will give us the ability to keep track of all file urls that couldn’t be handled by FileDownloadActor.

Link Extract Actor

Once file is downloaded, SupervisorActor will get the FileDownloadResult message, read the file path from it and call the LinkExtractActor passing him the path information within ExtractLinks message.

LinkExtractActor will receive the message, read the file content, extract all links (JavaScript, CSS, images) and send back the result to the SupervisorActor.

SupervisorActor will add new urls to the list of remaining urls and pass them to FileDownloadActor for further processing.

If an error occurs during link extraction, we are going to notify the parent actor about it, but this time we are going to let the child actor to fail (i.e. throw LinkExtractException) instead of sending error message. This way we are letting the parent actor to handle the error depending on its SupervisorStrategy. In our case, SupervisorActor will match the exception and restart the actor, therefore clearing its internal state.

Once we process all urls, we can print the final crawling result.

Now, if we run the application, everything should work as expected.
SupervisorActor will receive StartCrawling message and start delegating the work to FileDownloadActor and LinkExtractActor. FileDownloadActor and LinkExtractActor are going to do their job, they are going to be download files, extract links…But, can we do more? Can we speed the things up?

The answer is “Yes”. Let us see how to do this.

Routing

If you remember, we are using ActorSystem to initialize SupervisorActor. At the same time we are defining how should we create FileDownloadActor. Same is true for LinkExtractActor but we will omit its code here to make our example more readable.

If you take a look at getFileDownloadActorCreator method, you will see that we are creating one actor instance for handling file downloads. This is not very efficient. In order to speed things up we will introduce Round Robin Router.

By using the Round Robin Router, actor system is able to distribute messages to multiple actor instances allowing them to work in parallel in order to get the job done.

In our case, SupervisorActor will distribute DownloadFile messages to multiple instances of FileDownloadActor. The rest of the implementation will stay the same.

So, what we basically did is, we boosted our application by introducing several actor instances to handle our job instead of using only one and all that by changing one line of code. Pretty cool, right?

Conclusion

As we saw from previous example, Akka Actor Model is a very powerful tool for building reactive, high performance, concurrent applications. It releases developers from responsibility to explicitly handle the threading issues and allows them to concentrate on writing business logic.

By introducing routing we are getting additional benefit of speeding up the application, allowing actor system to distribute the work among several actors instances instead of using one actor to handle the same job.

The full source code of the example is available on Github. For more info on how to run the example, please see the README file.

--

--