Scaling up your SaaS product infrastructure: High-availability

Ofir Chakon
7 min readJun 3, 2017

--

Many SaaS (software-as-a-service) products store massive amounts of data while required to provide seamless user experience when the data is being accessed. When serving low volumes of traffic, this mission is considered relatively simple, by operating one main server for HTTP responses, static assets, databases and background tasks. Things become more complex when higher traffic volumes get into the picture.

We’ve recently reached this point of evolution in ClickFrauds, which made us restructuring our servers’ infrastructure to become robust as well as highly-scalable. In ClickFrauds, we help Google AdWords advertisers monitor their campaigns against fraudulent clicks, and automatically prevent from fraudsters to interact with the ads.

A super sensitive application component which is exposed to the world-wide-web is the tracker which get hit thousand of times per minute and responsible for:

  1. Storing data from each request (click)
  2. Redirecting the request to its final destination (the advertiser’s website)

By the nature of the tracker, it goes without saying that malfunctioning of this component is not something we can afford our system to have.

When we started, we compared the main cloud providers out there to make a “smart enough” temporary decision for the next few generations of the product. Choosing DigitalOcean to be our cloud provider was an extremely wise decision. DigitalOcean has a seamless easy-to-use dashboard with a convenient pricing model. Droplets (server units) are highly-scalable, robust and contain most of the features used by an early-stage startup. The support team is very helpful, and the greatest thing about DigitalOcean is the endless collection of tutorials that are also ranked at the top of many Google search results.

Infrastructure 2.0

Starting with the requirements, we’ve listed three main concepts that must be taken into account when redeveloping the new servers’ architecture of ClickFrauds:

  1. High-availability and redundancy.
  2. Security layers against different potential cyber attacks.
  3. Monitoring mechanisms to ensure healthy infrastructure overtime

Now we will dive into the steps that were taken to fulfill each of the above requirements.

High-Availability

High-Availability illustration

High-availability means avoiding a single point of failure in each and every component of the system. When considering many different moving parts with different objectives as well as critical hardware issues that might happen but cannot be controlled or expected, this is not a trivial task. However, there are great open-source tools and mechanisms that if used properly, can achieve a highly-available infrastructure in an elegant way.

Load Balancing

To eliminate a single point of failure on machines that are exposed to the web and respond to HTTP requests, we use redundancy. Redundancy means duplications of machines that are aimed to fulfill the exact same set of tasks. We control and distribute the load over the redundant machines utilizing a load balancer. Now, the load balancer becomes the only machine that is exposed to external requests, while the duplicated nodes are hidden inside the internal network.

To implement a basic yet reliable load balancer we can use Nginx with a round-robin algorithm for distributing the requests between the nodes. Nginx will take care of everything for you, from periodic health-checks for each node, to managing the nodes and getting them back to the line after a failure recovery. Additional configuration should be implemented in order to ensure proper cookie sticking, consistent and reliable headers (like client’s IP address) and proper caching.

At this point, you may ask yourself, what happens if the load balancer goes offline? This is a great question that has to be answered to achieve a full highly-available system.

Floating IP addresses

Another great feature offered by DigitalOcean (for free!) is the floating IP address. As you may already know, DNS A records might take sometime to update from one server to another. In case of a load balancer failure, we need to change the main A record, for example, of the subdomain tracker.clickfrauds.com to point from the unavailable load balancer to an available backup one. Floating IP address enables us to achieve this goal seamlessly. It is an external IP that can be attached to different machines immediately. This way, in case of a load balancer failure, we leave the DNS A record from tracker.clickfrauds.com pointing to the floating IP as is, and reattaching the floating IP address to the backup machine.

Floating IP address illustration

Databases

Like any other component in a multi-layered architecture, also databases might fail from many different reasons, either hardware or software related. Databases are highly sensitive components, that are required at any time during the application life cycle. Therefore, it is extremely important to maintain a sustainable database access using a method called replication. Replication, in its most simple form, means holding your data in several different accessible places, rather than on a single machine. Replication eliminates the scenario of a single point of failure in your data system, and provide your application multiple access points to your data. Multiple access points are helpful for distributing I/O (input/output) operations from different machines within your infrastructure. In other words, different requests from users may reach different nodes (machines) to provide them with the data they need in order to balance the load of requests.

In order to accomplish proper replications of a database, your data is either distributed among different machines without duplications, or duplicated between the machines and therefore identical from one machine to another. In the latter, synchronization is a consideration that has to be taken into account — once something changes on one machine, it has to be updated on all the rest. Fortunately, the popular databases contain robust replication mechanisms, both SQL and NoSQL based databases. SQL (relational databases), will mostly use the duplication based replication method, while NoSQL, which was built for scale and for holding huge amounts of data, will use the distributed replication where on each node you will find different parts of you data.

It is common in mature applications to use both types of databases: SQL and NoSQL-based for different purposes. In ClickFrauds, we made a decision to use PostgreSQL, which work seamlessly with Django web framework, as our relational database for storing user-related data, and Redis, which is an in-memory super-fast storing system, for storing cache data, and holding a queue for all the background processes in our system. Both databases are replicated on several machines, which prevents a single point of failure and extends dramatically the I/O capabilities of each of the databases.

Celery Workers

Analysis of clicks data is a computationally heavy operation. Especially when the analysis that we perform on each click is based on a deep investigation of every aspect of the click, using technologies like supervised machine learning models, as well as graph database for pairing clicks that have a high likelihood to originate from the same source. Therefore, this analysis must not be done at the expense of user experience, rather is has to be performed in a background process. On the other hand, the analysis is the most effective when it is completed as closest as possible to the click operation in order to make a decision whether a specific IP address should be blocked immediately. Therefore, in order to accomplish our task at ClickFrauds, we have a requirement for a component that will manage the queue of thousands of background tasks per minute (a.k.a broker) and endpoints that will consume the tasks of click analysis and execute them successfully (workers). To accomplish that, we’ve decided to work with Celery which has high capabilities for managing and consuming background processes in large scale. Celery cannot operate without a broker to store its task queue, which we decided as a Redis database. To scale up the task consuming capabilities, all we have to do is to duplicate the celery worker machine and then immediately the consumption velocity is increased.

Once you reach the point of scaling up your SAAS infrastructure, the main concept that you need to bear in mind is to be able to run new servers in seconds, that will be integrated into the infrastructure seamlessly for doing their job. You can do it easily by taking snapshots of machines just before you launch your new infrastructure.

The next posts in the series will discuss how to secure your highly-available infrastructure and how to monitor it from potential issues. All of the topics mentioned here can be easily discussed further in details and will be discussed in next posts, please feel free to comment about specific topics that interest you the most.

Published originally on CodingStartups — coders with entrepreneurial mindset

--

--