My Take on Web Scalability for Startup Engineers — Summary and Book Review

Ahmed Ibrahim
21 min readMar 23, 2023

--

Ever wondered how to jump into web scalability and get a grasp of how startups grow to support millions of users? Artur Ejsmont’s “Web Scalability for Startup Engineers” is your starting point and this article gives you a summary of the most important points.

Artur Ejsmont designed the book to include both high-level concepts and deep-dive details of web scalability in addition to giving selected examples of real-world technologies.

In the following sections of this article, I have assembled the points that I believe to be the most important points from different chapters, wrote them in my own words, and reordered them in way that has helped me present the essence of this book to you in the form of an article.

Introduction

In this section, I lay the foundations for the rest of the points that I will talk with you about in this article.

Scalability

To tackle scalability you need to understand it well first. Let’s first make sure that we are not confused between scalability and performance, they are related, but they are not the same. For simplicity, you can think of performance as the time taken for each task, and scalability as the ability to grow and handle more load, but if we stop here it would be an incomplete definition for scalability, as it focuses on scaling up and leaves scaling down out of the equation which can be basically the root of unjustified costs.

In case of growing startups, vertical scalability (adding more resources without adapting the architecture) is not enough. Growing startups target horizontal scalability.

And the type of scalability that growing startups target is known as horizontal scalability. It is about splitting your system into parts and scaling each part independently by adding multiple servers according to the traffic volume and server load. After a certain threshold of growth, this is considered even less expensive compared to vertical scalability which is basically adding more resources without adapting the architecture.

And to design for scale there are 3 basic techniques:

  • Adding more interchangeable clones so that a request can be sent to any of them with no difference.
  • Functional partitioning by extracting subsets of functionality that are closely related into independent subsystems.
  • Data partitioning by keeping disjoint subsets of the data on each machine instead of cloning the entire data. This achieves a principle called Share-Nothing, where nothing needs synchronization, and each node is fully autonomous. However, this is complex, and locating nodes can be very challenging.

Statelessness

Being stateless is about delegating the responsibility of keeping any information that requires synchronization to an external service, instead of keeping it on your servers. Statelessness leads to having interchangeable servers which one of the keys to horizontal scalability.

Self-Healing

Self-healing is a concept that is higher than graceful failure handling, it is the ability to detect and fix problems automatically. Ideally, the system should be what is called Crash-Only, ready to crash and continue without human intervention on reboot. And for a better availability, you should strive to minimize the mean time to recovery.

Infrastructural Components

To introduce to you the components that you need to know before diving into the rest of the article, here’s a diagram showing some components that can typically be in your data center:

Data center infrastructure summary- by Ahmed Ibrahim

Some tips to keep in mind

  • Simplicity is key, it keeps your system easy to comprehend as it grows. Avoid upfront assumptions about the future as it is a recipe to over-engineering, and always try to use test-driven development.
  • Promote loose coupling by reducing the amount of information that components know about each other, to avoid complexity whenever you want to make a changes.
  • Don’t Repeat Yourself (DRY) which is promoted by automation, prohibiting copy-paste programming, and not inventing the wheel even if it is tempting.
  • Code to a contract, to be able to modify client and providers independently, but this requires exposing only what you need to expose, to avoid future liability and contract renegotiation when making changes.
  • Draw diagrams, basically draw a use case diagram, then a sequence diagram, then a class diagram, and maybe a module diagram. This helps you see a 360 degree of your design with its dependencies so that you can validate your design.

The Front End Layer

The front end should only be responsible of the user interface, without any state or business logic. It should be the skin as it is called in the book. It should be considered as a plugin to your business logic.

It is everything from the client to the data center components that respond directly to the client. It is your first line of defense, design it well to be able to scale it to meet its high throughput and concurrency demands.

Scalable Front End

Here’s a diagram showing the components that can be used to create a scalable front end:

Scalable front end summary - by Ahmed Ibrahim

Some notes to complement the diagram:

  • Auto-scaling is about adapting to the changes in traffic volume and server load by adding and removing virtual servers from your clusters. This can save you as much as 25%-50% of your overall hosting costs. But this requires stateless servers to avoid problems if any of them is removed.
  • Do your homework before choosing a hosted load balancer such as Elastic Load Balancer, check that expected sudden spikes can be accommodated. If not, then something like Nginx (reverse proxy) or HAProxy (open-source LB) might be more suitable so that you can use them according to your needs.
  • Keep your load balancers stateless and interchangeable, so that you can scale them horizontally and use a round-robin DNS to distribute traffic among them.
  • As your startup grows, hosting on your own hardware becomes cheaper than renting, however be careful with your design and plan for scale-out as you will not be able to implement auto-scale or provision hardware on demand. In all cases, using third-party DNS and CND providers would still be recommended, even if the rest of the stack is managed by you.

Achieving Statelessness in the Front End Layer

Here are some examples:

  • Sessions are considered as some sort of state, there are 2 recommended approaches. Storing all state data in cookies, or storing it in dedicated low latency data store (e.g. Memcached, Redis, DynamoDB, Cassandra). Always avoid relying on a load balancer to track your sessions (sticky sessions), as it would break the principle of statelessness and prevent auto-scaling.
  • Instead of storing files on servers, you can delegate this to a third party. Amazon’s Simple Storage Services (S3) is a good example, and it provides both public and private buckets. However, if you decide to implement your own file storage, I recommend to use an available data store such as MongoDB to leverage its partitioning and replication instead of inventing the whole wheel — remember DRY concept.
  • Delegate cache management to a dedicated data store to avoid the pain associated with cache invalidation. The only case in which you can keep caches on your servers, is when these cached objects don’t need to be synchronized or invalidated, in other words, objects that expire based on their absolute Time to Live property.
  • This is also applicable to locks, it can be delegated to a global locking service instead of handling it locally on your servers. It is worth noting that locks should be balanced between many find-grained locks vs few coarse locks, so as not to increase complexity and lose clarity, or have multiple threads locked waiting on the same lock respectively.

The Web Services Layer

Before diving into this topic I want to clarify that it is not a rule carved in stone that you should create web services. You shouldn’t always go blindly for it. Ask yourself, is this complexity truly necessary? Should I split this functionality into a separate service? Am I working on a prototype with loosely defined requirements? Be quick and keep it simple, and as your startup grows and the product matures, gradually move from the prototyping and POC phase towards a service-oriented architecture. Don’t invest too much upfront, be pragmatic.

Quick Note on Types of Web Services

Function-centric services and resource-centric services are the two types of services. Examples of both are Simple Object Access Protocol (SOAP) and Representational State Transfer (REST) respectively. There’s no better in this comparison, they are just alternatives. However, REST is convenient for startups due to being lightweight and less strict and this allows faster changes, less complex integrations and deployments, and easier scalability.

Achieving Statelessness in Web Services Layer

Guidelines mentioned in the section of front end layer, such as delegating cache management and having a global locking service, are also applicable here as well.

Functional Partitioning

Web services layer should be divided into smaller independent functional units and always strive to reduce the dependencies between them, ideally each web service should be fully independent. This partitioning of closely related functionality subsets allows you to see the different access patterns of each partition and their different needs such as caching, data store, etc. This will definitely improve your scalability.

Distributed Transactions

They are a set of internal service steps and external web service calls that either complete together or fail entirely to avoid inconsistencies. However, they are not easy to be implemented, so check if not supporting it will only cause minor inconsistencies which your business is fine with.

It is usually implemented as a 2 Phase Commit (2PC) which requires all services involved in the transaction to maintain persistent connections and application resources to rollback any taken action in case of the failure of any service to commit the transaction. But this is not recommended as transactions become more prone to failure when more services are involved, and it introduces scalability and availability issues.

Some alternatives to avoid issues of 2PC:

  • Implementing graceful failure handling rather than preventing it.
  • Designing your transactions to use the atomic operations provided by your data store.
  • Implementing a compensating transaction to revert the taken actions before failure instead of making services wait and hold resources for the whole transaction duration.

The Data Layer

It is important to understand very well that as much as it is important to design your data layer well to allow scalability, yet data stores should be treated as an easily replaceable plug-and-play extension, just like caches, search engines, message queues, etc. None of them is a central piece that should drive the evolution of the system.

2 Approaches for Scaling the Data Layer

First approach — Replication:

You go with this approach if your goal is to boost concurrent reads scalability, and it is a good availability tool as well.

It is about having multiple copies of data stored on different machines. Where all writes are executed on a single server (master) then replicated asynchronously to others (slaves). This makes a slave failure a nonevent, however master failure needs a lot of work to promote the most up to date slave, especially when you’re using MySQL. So you can choose to one the following 2 approaches:

  • Deploying two masters that accept writes and replicate them to each other.
  • Creating an identical recovery group, with its master, ready for failover.

Second approach — Data Partitioning (Sharding):

You go with this approach if your goal is to improve writes as well as overall data set scalability. It also allows your system to be robust against partial system failures.

It is about choosing a sharing key to evenly divide the data into disjoint partitions where each machine deals only with one partition independently in a share-nothing way. However, it adds complexity. For example, queries spanning multiple shards require a lot of work to fetch a large data set and compute the final result in the application layer.

Another challenge is the migration associated with adding a new server, especially you rely on a simple algorithm that uses modular operations. You can tackle this challenge in one of 2 approaches:

  • Keeping the mappings in a central database to add more flexibility.
  • Creating a number of identical logical databases on a low number of physical machines, to give the illusion as if you have physical partitions, and turn them into actual physical partitions when you need to scale-out. All you do is move the entire database from one server to another with minimal downtime.

Quick Note on Indexing

Without indexing you will need to make full table scan to find a piece of data in your data base, which has a linear complexity O(n), so you need to create good indexes to reduce and sort the data set size according to it to be able to reduce the complexity of searching to a logarithmic one O(log(n)).

Choose the index field based on its cardinality (the number of unique values that this field can take), the higher the cardinality, the better. But this is not the only criteria, you also have to choose an index that guarantees equal distribution of data.

If you cannot find one best index, you can choose two fields and use the second field to narrow down your search and improve efficiency. This is called a compound / composite index.

Using NoSQL

Unlike using relational databases which starts with optimizing data and relationships in mind then make indexes to speed up queries, NoSQL thinks of data as if it were an index already, instead of thinking about it as tables with relationships.

Using NoSQL starts with queries in mind, leading to data models optimized for executing the queries for all of your uses cases, however this requires denormalizing your data and allowing redundancy so that you request all the required data with a single document access to allow optimizing and scaling. But you have to be careful as it sacrifices some flexibility for the sake of performance and scalability, and this might lead you to situations where you need additional tables and more redundancy to be able to access data efficiently. It is always recommended to re-validate your data model against the list of known use cases once you think that your data model is complete.

Another difference is that NoSQL data bases were built to reach higher availability and horizontal scalability levels than relational databases could ever offer, but at this level of distributed systems, you cannot have it all, you have to accept some sacrifices. Next, I will mention 2 related points, the CAP theorem and Eventual Consistency.

CAP Theorem:

Stands for Consistency, Availability, Partition tolerance (functioning even at network failures when nodes cannot communicate). When dealing with CAP, “pick two” is your catch phrase as you should prioritize these characteristics based on your requirements, and simply the pick two. For example, MongoDB is a CP data store which automatically shards and distributes data among multiple servers, but if a server fails, operations corresponding to its data will fail as well. You can improve this through redundancy by using replica sets, but take care that replication is done asynchronously. You can configure it to be synchronous to achieve the common definition of data consistency but it is practically expensive for MongoDB.

Eventual Consistency:

This means that different nodes may have different versions of data, but state changes eventually propagate to all of the servers through on-going data synchronization.

This was introduced by Amazon’s Dynamo data store where they had to relax the requirement of full consistency, to enable high availability. However conflicts can arise when multiple clients update the same data at the same time, but there are multiple approaches to solve these conflicts:

  • Pushing the responsibility of resolving conflicts to clients (which what Dynamo DB does)
  • Accepting that “the most recent write wins”
  • Apply self-healing strategies, for example let a percentage of reads trigger a background read repair.

There’s also another type of consistency called Quorum Consistency, which requires the majority of replicas to agree on the result, so that when you write, the majority of servers need to confirm that they have committed the change, and when you read, the majority need to respond in order to be guaranteed to always get the most up-to-date data and ensure read-after-write semantics.

Commonly Used NoSQL Data Stores

  • Key-value data stores: they are the least complex as they support only the most simplistic access patterns and they can implement automatic sharding based on key. Examples are: Dynamo, Riak. Other related examples are: Redis but it has more to offer than just key-value mappings, and also Memcached but it is considered a key-value cache as it doesn’t persist data.
  • Wide columnar data stores: they model data as if it was a compound index, so you need to provide the row key and the column name as they are both indexed and sorted to perform fast search to retrieve your data efficiently. Also the idea of having no predefined schema allows adding more columns/fields dynamically. They usually provide data partitioning and horizontal scalability out of the box.
  • Document-oriented data stores: they allow more complex objects to be stored and indexed. They are a good use case for systems where data is difficult to fit into a predefined schema and scalability is required at the same time. Examples are: MongoDB, CouchDB, and CouchBase
  • Other types: graph databases and object stores.

A good example of a commonly used database is Cassandra, which is a wide columnar data store. So let’s talk about it a little bit:

  • It provides redundancy, but there’s no master-slave relationship, replication is based on quorum consistency based on your configuration for the max number of copies of data you want to keep in a cluster.
    This redundancy, allows self-healing against failures so that when a node fails, you can read/write successfully using other nodes, until the failed node comes back and catches up with the data it missed.
  • It is also worth noting that in Casandra any node can be the session coordinator as all nodes know the cluster topology and which data each other node is responsible for, so that the client can connect to any node and it can simply be its session coordinator while keeping the topology hidden from the client.
  • Cassandra is optimized for writes to tackle the issue of writing to multiple tables to keep them in sync due to denormalization and redundancy.
  • Cassandra is an eventually consistent database that applies self-healing strategies to solve conflicts between clients, as 10% of its reads, triggers a mechanism that fetches the requested data from all replicas and repairs any inconsistency. This adds overhead but makes your system more resilient.

Caching

It is one of the tools to increase performance and scalability without enduring the high costs of re-architecture.

Cache Effectiveness

The more times you can reuse the same cached response, the higher the effectiveness, and this can be measured using the cache hit ratio which is affected by:

  • Cache key space, the smaller it is, the better. So choose the key wisely to be able to reuse the same cache for multiple users.
  • Average size of objects and the size of your cache, as they both determine how many items you can store in your cache before running out of space and starting to evict objects which reduces the cache hit ratio.
  • Time to Live of the cached object, which is how long it can be cached before it expires. Caching objects that change frequently is useless. The ideal scenario is caching the whole response can be cached forever, but you can still try to cache partial results focusing on parts that don’t change frequently.

Note that the higher up the call stack you can cache, the more resources you can save. This means that caching an entire page fragment is better than just caching the database query. Of course, the ideal case is when the entire requested page is cached, this allows saving 100% of your resources.

Also deciding what to cache is important and you should rely on metrics not just gut feelings, you can use metrics such as the aggregated time spent (which is the time spent per request * number of requests), this can give you insights of the pages in which you should invest your cache.

Caching in the HTTP Layer

All of technologies in HTTP layer are read-through caches which act as a proxy that transparently adds caching to HTTP connections in a pluggable way without the need to change any of the clients. Some of the technologies that can be used for this type are:

  • Browser caching, which is enabled by headers sent by the web server.
  • Caching proxy can be installed into local corporate networks or by the internet server provider and the larger the network, however it is not very effective due to SSL as we cannot intercept requests.
  • Reverse proxy work exactly as a regular caching proxy, but it is placed in your own data center to reduce the load off your web servers. They can be also deployed between your front-end and web services layers. And they can be scaled horizontally easily by adding more of them and using a load balancer to distribute the traffic among them.
  • CDN is a distributed network of cache servers that helps you improve user experience by pushing content closer to users. An advantage of CDNs is that they scale transparently as well as charging you flat fees per million requests or per GB of data. Another advantage is that they can mitigate distributed denial of service attacks as CloudFlare does.

Some tips for a better HTTP caching:

  • Make sure your GET handlers are read-only with no side-effects, to be able to cache responses.
  • Don’t rely your monitoring reports on the number of GET requests reaching your servers, since caching would skew these reports.
  • Don’t cache frequently changing objects, this leads to inconsistencies.
  • Make GET requests public to have a single cached object for each URL, unless you really need this request to be specific to each user.
  • Bundle static files like CSS and JS files under unique URLs, so that they can be cached forever (or for one year which is the standard max limit for HTTP), and also to be able to do versioning to guarantee that different clients can always download compatible versions.

Caching Application Objects

Object caches are cache-aside which are not transparent, they are seen as independent key-value data stores that are actively used by the application to store and retrieve objects to save time and resources.

Common types:

  • Client-side cache, which is when browsers allow JavaScript to store app data directly in a key-value store on the user’s device. Although the storage is permanent, it can be expected to be cleared by the user at any time, so keep in mind that it is only cache rather than a reliable data store.
  • Caches co-located with code, which can be directly in the application’s memory, or in shared memory segments reachable by multiple processes of the same machine.
  • Distributed object caches, they are usually simple-key value stores such as Redis or Memcached deployed on a cache server that can be accessed by front-end and web services layer. They are better at scaling, as you can create a pool for your cache and add more servers whenever you want to scale the overall memory or throughput. Also removing objects is more efficient which makes cache invalidation easier.

Tip: Even if you are starting with one server, I recommend deploying for example, Redis or Memcached on this one server to make it easy to move your cache later-on to a dedicated cluster without modifying your code.

Asynchronous Processing

It is about issuing requests that don’t block the execution, which means that the caller fires the request and then forgets about it, so that temporal coupling is removed and the caller can continue its work until it gets notified once the operation is finished and a callback function is called to handle the result. This way the weakest link doesn’t dictate the overall latency.

A pizza restaurant is a simple example that embodies asynchronous processing. The restaurant keeps getting orders, the chefs in the kitchen prepare orders in the background without freezing the whole restaurant until each order is done. When an order is done, the chefs make a notification so that the waiter (the call back function) can deliver the pizza to the customer.

Message Queues

They are a great tool for achieving asynchronous processing even if your programming language runtime doesn’t support asynchronous processing.

There main function is to buffer and distribute asynchronous requests that are assumed to be one-way, fire-and-forget requests from producers to consumers either in a push or a pull model. If message queues provide more functionalities like permission control, routing, failure recovery, etc. In such case they are like an independent application and they referred to as message brokers or message-oriented middlewares (MOM).

Message brokers support different ways of routing:

  • Direct worker queue method in which messages are added to a single queue and each message is routed to only one consumer. This is suited for time-consuming tasks across multiple worker machines.
  • Publish / subscribe method in which messages are published by producers to a topic (not a queue) and can be delivered to the private queues of all the subscribed consumers to be consumed independently.
  • Custom routing rules where a consumer can decide more flexibly what messages should be routed to its queue. For example RabbitMQ provides it using bindings to create routing rules based on text pattern matching.

Tip: worker machines should be isolated into a separate set of servers to scale them independently, they should also be stateless and should get all of their data from the queue and external data stores. This way machine failures and scaling out will not be a problem.

Some benefits of using message queues:

  • Enabling asynchronous processing to avoid blocking. Can be also a good solution to keep receiving requests for critical features even if the corresponding servers are down for any reason.
  • Decoupling producers and consumers which gives more flexibility even in choosing the technology in which each of them is implemented, they only need a contract for the message format.
  • Making scalability easier due to the nature of deferred processing as multiple servers can fire-and-forget requests in parallel, as well as scaling the consumers by adding more machines to handle messages in parallel.
  • Evening out traffic spikes as you can keep enqueuing messages, and consumers eventually catch up with the messages and drain those queues even if some messages sit a little longer in the queues.
  • Isolating failures because producers don’t depend directly on consumers being available, this also allows maintaining and deploying consumer servers at any time.
  • Recovering from failure by simply replacing the failed server and the system heals itself by catching up with the queues and draining messages over time, essentially promoting self-healing.

Guidelines to avoid problems associated with message queues:

  • Build a system to assume that messages can arrive in random order, or use a message queue that supports partial message ordering guarantee such as ActiveMQ which allows dividing messages into groups, where messages of each group are delivered to the same consumer to be consumed in order to prevent race conditions and other problems related to out-of-order consumption.
  • Make all of your consumers idempotent, which means that they can process the same message multiple times without affecting the final result. Or you can add an extra layer of tracking and persistence instead if it covers all the failure scenarios without adding much complexity.
  • Document your system, routes, and message flows through your system so that developers can have it as a reference. And whenever possible let this documentation be auto-generated based on the code itself to make your life easier.
  • Don’t treat it as a TCP socket that is used as a return channel for consumers to send messages back to producers, this creates coupling and you end up with a synchronous application that suffers excessive resource consumption, exactly the opposite to your goal.
  • Don’t treat it as a database where you can randomly access queue elements for deleting or updating them. It should be treated like a FIFO.
  • Never add anything to the message requires producer and consumer to have the same class, this creates nasty coupling and requires them to be implemented even in the same technology. Messages should only be data transfer objects.
  • You cannot to assume that messages are always valid, this would make you vulnerable to a message of death that would crash your consumer (also known as a poison message). Even auto-respawning of consumer you may end-up with a pipeline freeze where all consumers keep crashing. So implement poison message handling and also remove failing messages from the queue after a certain limit.

Quick Note on Event-Driven Architecture (EDA)

To understand EDA, you need to shift your thoughts from thinking about software in terms of requests and responses, to thinking about it in terms of components announcing things that have already happened (aka. events) instead of requesting work to be done.

If you think about it, even when using a message queue to create an asynchronous communication between producers and consumers, yes the producer is not temporally coupled to the consumer since it is asynchronous, yet the producer knows what the consumer has to do, rendering it logically coupled to the service somehow.

On the other hand, in event-based interaction, the event publisher has no idea about the presence of the consumers, it just announces to the event-driven framework, which can actually be a message queue, don’t get me wrong, it is about how you use the message queue, not about the message queue itself of course.

EDA is asynchronous and highly decoupled by nature. Producers just announce events, and the framework is responsible for routing them to the consumers that declared their interest in these types of events. So it is clear that the only possible point of coupling here is the event definition itself.

Book Review

The book is very well structured and written. In my opinion, the used language is simple, however it needs someone who is not completely new to web development. But I believe a junior/mid-level software or DevOps engineer should be able to understand the used language easily. The only point that some people might argue if it is positive or negative, is that it covers many topics instead of focusing on a single topic and diving deep into it, but I actually see it as a good point as it gives the reader a big picture about web scalability, and also the amount of details given about each topic is rich.

Final Words

The points discussed in this article are the points that are the most important in my opinion. I hope I have presented you with an interesting read through this article and that I succeeded to help you get ready to tackle any topic related to web scalability with a good background, however if you are a book worm, I definitely encourage you to read the book, it is simply a masterpiece.

--

--