The Tech Behind BlockBust

BlockBust
7 min readMar 4, 2016

There has been an ongoing battle between users and publishers since the advent of digital advertising. Publishers want to serve ads to generate revenue, and site users want to block ads for a more pleasant viewing experience. Although ad blockers have been around for quite some time, their use was not as widespread it is today. Publishers of all sizes have watched the revenue that supports their sites shrink steadily thereby threatening their businesses.

Since we work at an ad tech company, this dilemma was all too familiar to us.

During one of our company sponsored hackathons, we decided to make a tool to do something about it. From our perspective, publishers are typically non-technical and do not understand why ad blockers bar their content. To help them better understand, we created http://blockbust.io, a web service that scans your site and identifies issues that would cause an ad blocker to stop resources from loading or remove pieces of HTML. We believe that this tool will help guide publishers by showing them what parts of their site are affected by ad blockers and recommend changes to the site when possible so that ad blocker cannot prevent their ads from displaying.

Go

From the start, we knew that we wanted to use Go for the backend process that checks a site’s HTML against known block lists. We realized that this was a massively parallelizable problem, and Go was well suited for concurrent processing. We also wanted to make our code as modular as possible by completely separating the front-end from the back-end and having them communicate over HTTP, which Go is also designed to do very efficiently. Since we were making this during a hackathon and speed is of the essence, we also needed something that was easy to build and deploy. Go met all these requirements.

Microservice container architecture

Because BlockBust was a greenfield project without dependencies on other code bases, we had the opportunity to use some new pieces of technology. We have been testing Docker containers for some time but have not used them in production yet.

The appeal of running BlockBust inside Docker containers is twofold: 1) we wanted to use the best tool for the job for different pieces of our architecture which require drastically different dependencies and 2) we wanted to independently scale each component of our architecture based on the individual load it was experiencing. The frontend of the application is HTML and JavaScript and the primary backend is written in Go. Initially we used Go to fetch the HTML, but then we realized that we needed to be able to execute the JavaScript on the page as opposed to just getting the source HTML. Many websites will use JavaScript to place additional elements on the page and if we only used Go’s http.Get to fetch the HTML we may be giving back potentially incomplete information about the state of blocked items on the page.

Enter Phantom JS

Conveniently, one of our other developers had just finished creating a NodeJS and Phantom JS web scraper for another project. He graciously added a simple Express front end and we were off to the races. Now that we were adding NodeJS to our stack it was going to make our deployment more complicated than simply uploading a new Go binary, updated HTML, and JavaScript to our production environment.

Initially, what we decided to do instead was containerize the NodeJS application. We would then use a subprocess to run a new ephemeral container and pass the URL from which we wanted the HTML. The scraper would then return the output on stdout.

This approach worked well, but we quickly realized that we would need to heavily scale out this portion of our software stack because it can take anywhere from 5 to 20 seconds to thoroughly scrape depending on the size of the resources for any given website. We could go the traditional route and just bring up many instances of our Go backend that uses the Dockerized scraper, but we felt this was an inefficient use of computing resources.

Docker

At this point, we decided we should containerize each component of our entire stack. This provided several advantages as the project grew and required additional services. A huge benefit was that development and production environments were identical except for environment variables used to configure how the application connects to resources such as memcached. For example, in development memcached is another container and in production it is an Elasticache node. Using containers also simplified another developer bringing up their environment. If they already have Docker installed they only needed to run `docker-compose up` and they could test the entire stack. In the past we used Vagrant and Ansible extensively. They simplified the creation of a local development environment, but could take a considerable amount of time to download the Vagrant box and/or provision the box with Ansible. Docker is generally much faster to create or download the images. Additionally, Vagrant boxes provisioned via Ansible at different times were not identical especially if all dependencies were not pinned down. This isn’t a problem until it is.

Running containers inside ECS

After containerizing the entire stack, we began to think about deploying it. As an AWS shop, we heard about Amazon’s Elastic Container Service at re:Invent 2015 and were eager to try it out. We were not disappointed. It allowed us to deploy our frontend and backend code bases with minimal coordination seamlessly. Having a microservice architecture and containerizing it forced us to decouple everything. The result? A clear separation of concerns.

Deploys were as simple as pushing a new docker image to our ECS private repository and then updating our ECS service definition to use that new image. ECS handled rolling updates of the containers in a configurable way and ensured they pass the ELB health check before making them active servers (much like EC2 nodes in an autoscaling group).

Another benefit of ECS and immutable images was that it’s very easy to roll back to the previous image version if we happened to find a bug in production. We could execute this within the ECS UI or from the aws cli tools.

Limitations

One unfortunate downside is that currently there’s no way to scale out the number of container instances automatically. The cluster that your containers are running on could be an autoscaling group, but scaling the cluster would not automatically place new containers on those physical machines.

We met another limitation when placing an elastic load balancer in front of our containers. We were only able to run a single instance of that load balanced container per physical EC2 instance in our cluster because ELB can only map a single external port to a single internal port on your instance. This prevented us from leveraging Docker’s ability to bind to any high-level port on the host OS.

Leveraging Lambda

Everything was working well during the testing and development phases. However, when we asked the company for help during testing, we ran into a memory issue inside our Phantom JS containers. Phantom JS basically runs a headless Chrome browser. Chrome is not only an excellent browser, but a memory hog as well. Our scrapers would work fine until each one was handling too many concurrent requests and then the container would hit its memory limit and crash. ECS would bring up a new container automatically (which was a delightful surprise), but this approach wouldn’t scale in production.

One of our ambitious developers found a solution by creating a quick proof of concept of the scraper running inside AWS Lambda. After a bit of tweaking, we were able to get the Lambda version to return the same results as the stand alone scraper.

With Lambda, we didn’t have to worry about scaling out this portion of our architecture. If we didn’t use Lambda, we would have needed to implement a job queue for the scraping, which would have complicated the process. The only change we had to make to our backend to support moving the scraper to Lambda was updating the environment variable for the scraper host URL.

Conclusion

Overall we learned a lot from this project to apply to current and future endeavors. Despite our widespread internal use of provisioning tools such as Ansible, we found that working with Docker’s immutable infrastructure was a pleasurable experience. It allowed us to have a very high level of confidence that if it worked locally in our development environment, there was a high likelihood that it would work in production because the images running the system were identical. As an added benefit, it made rolling back changes dead simple and when a container crashed ECS knew exactly how to spin up a new one and do so in a timely fashion. We also found a huge benefit in dividing our architecture into microservices as opposed to one gigantic service. It allowed us to deploy components independently because their implementations were decoupled, which is always a programming best practice. The other benefit of a microservice architecture is that our components can scale independently on different platforms. If we ever experience hockey stick growth, we know that we will only scale the components doing the heavy lifting (in our case probably the web scraper).

--

--