The Tech Behind BlockBust
There has been an ongoing battle between users and publishers since the advent of digital advertising. Publishers want to serve ads to generate revenue, and site users want to block ads for a more pleasant viewing experience. Although ad blockers have been around for quite some time, their use was not as widespread it is today. Publishers of all sizes have watched the revenue that supports their sites shrink steadily thereby threatening their businesses.
Since we work at an ad tech company, this dilemma was all too familiar to us.
During one of our company sponsored hackathons, we decided to make a tool to do something about it. From our perspective, publishers are typically non-technical and do not understand why ad blockers bar their content. To help them better understand, we created http://blockbust.io, a web service that scans your site and identifies issues that would cause an ad blocker to stop resources from loading or remove pieces of HTML. We believe that this tool will help guide publishers by showing them what parts of their site are affected by ad blockers and recommend changes to the site when possible so that ad blocker cannot prevent their ads from displaying.
From the start, we knew that we wanted to use Go for the backend process that checks a site’s HTML against known block lists. We realized that this was a massively parallelizable problem, and Go was well suited for concurrent processing. We also wanted to make our code as modular as possible by completely separating the front-end from the back-end and having them communicate over HTTP, which Go is also designed to do very efficiently. Since we were making this during a hackathon and speed is of the essence, we also needed something that was easy to build and deploy. Go met all these requirements.
Microservice container architecture
Because BlockBust was a greenfield project without dependencies on other code bases, we had the opportunity to use some new pieces of technology. We have been testing Docker containers for some time but have not used them in production yet.
Enter Phantom JS
Initially, what we decided to do instead was containerize the NodeJS application. We would then use a subprocess to run a new ephemeral container and pass the URL from which we wanted the HTML. The scraper would then return the output on stdout.
This approach worked well, but we quickly realized that we would need to heavily scale out this portion of our software stack because it can take anywhere from 5 to 20 seconds to thoroughly scrape depending on the size of the resources for any given website. We could go the traditional route and just bring up many instances of our Go backend that uses the Dockerized scraper, but we felt this was an inefficient use of computing resources.
At this point, we decided we should containerize each component of our entire stack. This provided several advantages as the project grew and required additional services. A huge benefit was that development and production environments were identical except for environment variables used to configure how the application connects to resources such as memcached. For example, in development memcached is another container and in production it is an Elasticache node. Using containers also simplified another developer bringing up their environment. If they already have Docker installed they only needed to run `docker-compose up` and they could test the entire stack. In the past we used Vagrant and Ansible extensively. They simplified the creation of a local development environment, but could take a considerable amount of time to download the Vagrant box and/or provision the box with Ansible. Docker is generally much faster to create or download the images. Additionally, Vagrant boxes provisioned via Ansible at different times were not identical especially if all dependencies were not pinned down. This isn’t a problem until it is.
Running containers inside ECS
After containerizing the entire stack, we began to think about deploying it. As an AWS shop, we heard about Amazon’s Elastic Container Service at re:Invent 2015 and were eager to try it out. We were not disappointed. It allowed us to deploy our frontend and backend code bases with minimal coordination seamlessly. Having a microservice architecture and containerizing it forced us to decouple everything. The result? A clear separation of concerns.
Deploys were as simple as pushing a new docker image to our ECS private repository and then updating our ECS service definition to use that new image. ECS handled rolling updates of the containers in a configurable way and ensured they pass the ELB health check before making them active servers (much like EC2 nodes in an autoscaling group).
Another benefit of ECS and immutable images was that it’s very easy to roll back to the previous image version if we happened to find a bug in production. We could execute this within the ECS UI or from the aws cli tools.
One unfortunate downside is that currently there’s no way to scale out the number of container instances automatically. The cluster that your containers are running on could be an autoscaling group, but scaling the cluster would not automatically place new containers on those physical machines.
We met another limitation when placing an elastic load balancer in front of our containers. We were only able to run a single instance of that load balanced container per physical EC2 instance in our cluster because ELB can only map a single external port to a single internal port on your instance. This prevented us from leveraging Docker’s ability to bind to any high-level port on the host OS.
Everything was working well during the testing and development phases. However, when we asked the company for help during testing, we ran into a memory issue inside our Phantom JS containers. Phantom JS basically runs a headless Chrome browser. Chrome is not only an excellent browser, but a memory hog as well. Our scrapers would work fine until each one was handling too many concurrent requests and then the container would hit its memory limit and crash. ECS would bring up a new container automatically (which was a delightful surprise), but this approach wouldn’t scale in production.
One of our ambitious developers found a solution by creating a quick proof of concept of the scraper running inside AWS Lambda. After a bit of tweaking, we were able to get the Lambda version to return the same results as the stand alone scraper.
With Lambda, we didn’t have to worry about scaling out this portion of our architecture. If we didn’t use Lambda, we would have needed to implement a job queue for the scraping, which would have complicated the process. The only change we had to make to our backend to support moving the scraper to Lambda was updating the environment variable for the scraper host URL.
Overall we learned a lot from this project to apply to current and future endeavors. Despite our widespread internal use of provisioning tools such as Ansible, we found that working with Docker’s immutable infrastructure was a pleasurable experience. It allowed us to have a very high level of confidence that if it worked locally in our development environment, there was a high likelihood that it would work in production because the images running the system were identical. As an added benefit, it made rolling back changes dead simple and when a container crashed ECS knew exactly how to spin up a new one and do so in a timely fashion. We also found a huge benefit in dividing our architecture into microservices as opposed to one gigantic service. It allowed us to deploy components independently because their implementations were decoupled, which is always a programming best practice. The other benefit of a microservice architecture is that our components can scale independently on different platforms. If we ever experience hockey stick growth, we know that we will only scale the components doing the heavy lifting (in our case probably the web scraper).