Tips: Design for Scale

13 min readJun 10, 2024

Scalability is a critical factor in the design and success of modern software systems. It’s a property of a system to handle a growing amount of work with same set of resources.

Scaling is not new in the paradigm of system architecture. It’s there from long time and being applied across linux kernels and application development in many ways.

1. Whether you create multiple instances/threads from same process to handle growing number of requests.
2. Scheduling those instances on some logical processors.
3. Creating and maintaining priority queues.
3. Distributing requests across multiple threads.
4. Avoid context switches.
5. Avoid blocking calls.
6. Avoid any synchronization issues that results in deadlock.

These are some of the enhancements being done in many kernel modules/user applications that has the direct impact of how many requests can be processed across systems with available resources over the many years.. But with large distributed systems, scaling procedures became more complex and least understood. Many new techniques introduced to make the architecture more scalable and resilient. This blog is going to talk about all these techniques in details.

Key Points

Where to store data, especially in case where you have big dataset.
Which database to be used, in which you can store, search and manage data efficiently?
How to ensure large RPS or QPS not to affect overall latency.

How to Scale?

There are some common techniques that make a system scalable, such as defining stateless instead of stateful services, having a load balancer, sharding databases, partitioning, messaging queues and many more.

Stateful and Stateless services
“Stateless microservices do not maintain any state within the services across calls. They take in a request, process it, and send a response back without persisting any state information.” In case of instant’s crash, new instance being spawn that restart the process from initial state irrespctive of what the previous state was . On the other hand, stateful microservices maintain processing state (about the client’s session) in RAM or on disk so that when a new request or event is received, the service instance can retrieve the previous processing state, and use that state to correctly process the new request.

It is easier to horizontally scale your microservices (adding more processing threads as workload increases) when the services are stateless. Each instance of the microservice operates independently. Whereas in case of stateful services, the previous state derives the current state of newly orchestrated service.

Some key points while designing a service as a stateful are:

1) In a stateful architecture, users are typically bound to a specific server for the duration of their session to ensure consistent access to their session data. (session affinity or sticky session)
2) Persist the session data in session memory.
3) The response to a request can depend on past interactions.
4) Find a way to share state amongst service instances.
5) Session data must be synchronized to ensure that users’ session states are consistent across all servers.
6) Scaling the database layer to handle the increased load and storage requirements
7) Balanced resource allocation across servers while minimizing resource contention and maximizing utilization.
8) Stateful architecture for applications that handle sensitive data or require strong data consistency and integrity, as maintaining session state can provide additional security measures.

But for stateless services, we don’t need to care about any of the above points. From the scalability point of view stateless services are highly scale, easy to create and maintain, but have it’s own challenges like lack of functionality, increased latency and having security concerns.

For more information about stateful and stateless services, refer

https://medium.com/@dmytro.misik/building-scalable-go-microservices-15d3bc7e28bb

Load Balancing

Load balancing is a technique that helps distributing the workload evenly across multiple servers, ensure that no single server is overwhelmed. Having an effective load balancing solution can significantly improve your system’s performance and resilience under heavy workloads.

Some of the major benefit of loadbalancer are:

1) Each server node should receive an equal share of the workload.
2) Should keep track of which nodes are unavailable or not in use.
3) Effectively manage/distribute work to ensure that it is finished on time.
4) Distribution should be done to maximize speed and use all available capacity.
5) Ensure proper distribution not resulted in resource utilization stagnation.

Load balancers ensure high scaling, high throughput, and high availability by having balanced resource distribution.

To know more about designing a robust loadbalancer, you can refer the following link. System design of Loadbalancer

Asynchronous Messaging

Asynchronous messaging refers to communication where the exchange of messages between sender (Publisher) and receiver (subscriber) does not require both parties to be actively engaged in the interaction at the same time. The asynchronous messaging with buffering and multiqueue support at each api endpoint ensure that the sent messages can be received at the receiver convenience.

Producer sends the messages to specific queue or topic and the subscriber gets notified and receive based on whether it subscribed to that particular topic.

Asynchronous messaging can handle a large number of messages as they do not require constant connectivity or real-time interaction. That’s why it’s critical component of modern communication systems, providing flexibility and efficiency across various applications and industries. More information:

Partitioning or Chunking

Data partitioning is the process of dividing a large dataset into smaller subset, more manageable subsets called partitions. These partitions can then be stored, accessed, and managed separately. Each partition also ensure of having a separate CPU and Memory allocated to it, as well as it can live as a separate instance of the larger entity.

Partitioning your data can help make your application more scalable and performant, but it can also introduce significant complexities and challenges. More information:

Sharding

Sharding is a technique that involves splitting your data into smaller, more manageable partitions (or shards) and distribute them across multiple servers. This approach can help reduce the load on individual nodes, improve query performance, and enable horizontal scaling.

Sharding is all about:

1. Partitioning data into more manageable pieces called shards.
2. Each shard to be identified by a unique shard key.
3. Distributing shards to different nodes in cluster.
4. Routing data to specific shard that it belongs.
5. Reshard or redistribution.

Note: Shards and partitions are almost similar, but shards applicable across multiple database instances, whereas partitions are distribution of data across multiple table in same database instance.

Distribution with Consistent Hashing

Implement consistent hashing to evenly distribute data partitions across nodes in the cluster. Each data partition is assigned a token, and consistent hashing determines which node is responsible for storing and serving data for a given partition. As the cluster scales horizontally by adding more nodes, consistent hashing ensures that data distribution remains balanced and optimized. More information:

Handling Large Data: Implement Pagination

Implement pagination to divide the list of articles into manageable chunks, displaying only a subset of articles per page. When users navigate through pages, the website retrieves and displays the next set of articles dynamically, reducing load times and improving responsiveness.

With pagination, the news website efficiently handles large amounts of data, ensuring fast loading times and a smooth browsing experience for users. This approach optimizes server resources and network bandwidth while accommodating varying user preferences and browsing behaviours. More information: pagination and filtering.

Monitor and Optimize Performance

Regularly monitoring your system’s performance is crucial for identifying bottlenecks and areas for optimization. Using some system insights like response times, error rates, and resource utilization we can make data-driven decisions about optimizations, infrastructure upgrades, and other improvements that enhance our system’s scalability.

Handling Traffic Spikes: Use Autoscaling

Implement autoscaling to automatically adjust the number of web server instances based on traffic demand. When traffic increases, the autoscaling system dynamically provisions additional server instances to handle the load. Conversely, when traffic decreases, it scales down the number of instances to minimize costs.

With autoscaling, the e-commerce website can seamlessly handle sudden increases in traffic without manual intervention, ensuring optimal performance and availability during peak periods. This approach improves scalability, reduces downtime, and optimizes resource utilization, leading to a better overall user experience.

Multi-Region and Multi-Availability Zone Deployment

Distribute your workloads and data across geographically dispersed locations of availability zone, enhances redundancy, improves latency, and ensures that systems can scale to accommodate users worldwide.

Caching and Indexing

Adopting the caching and indexing strategies to optimize search performance, storing frequently accessed data in memory and maintaining a comprehensive index of web content. These techniques helps to deliver fast, accurate search results, even as the volume of data and user queries continues to grow. More information:

Design Pattern

Some of the common design patterns applicable to large scale distributed systems are:

Queue based load levelling pattern:

It introduces the queue between the task and service to prevent service being overloaded with messages and fail. This pattern useful for operations that don’t need to show immediate results. It increase throughput by processing a fixed number of operations at a time. Reject the request from user when resource utilization reach limit. This pattern uses a queue to manage the load on a system by smoothing out spikes in traffic. This pattern ensures that the system can handle bursts of load without becoming overwhelmed.

// Represents a unit of work.
type Job struct {
    ID int
}

func worker(id int, jobs <-chan Job, wg *sync.WaitGroup) {
    defer wg.Done()
    for job := range jobs {
        fmt.Printf("Worker %d processing job %d\n", id, job.ID)
        time.Sleep(time.Second * time.Duration(rand.Intn(3)))
    }
}

func main() {
    const numWorkers = 3
    const numJobs = 10

    // Processes jobs from the jobs channel (buffered channel)
    jobs := make(chan Job, numJobs)
    var wg sync.WaitGroup

    // Start workers
    for i := 1; i <= numWorkers; i++ {
        wg.Add(1)
        go worker(i, jobs, &wg)
    }

    // Send jobs to the queue
    for j := 1; j <= numJobs; j++ {
        jobs <- Job{ID: j}    // Adds jobs to the channel. 
    }
    close(jobs).     // Closes the channel to signal no more jobs.

    // Wait for all workers to complete
    wg.Wait()
}

This implementation distributes jobs to workers, which process the jobs concurrently, ensuring balanced load handling.

2. Command Query Responsibility segregation(CQRS):

Separate update as command and read operation as query for datastore. It aggregate multiple read request into single command. A pattern for maximizing performance, scalability, and security. It uses the event sourcing pattern where application state stored as a sequence of events. Each event update the data from current state by replaying events. CQRS separates the read and write operations for a data store. In Go, you can implement CQRS by defining separate command and query handlers.

package main

import (
    "fmt"
    "sync"
)

// CommandHandler for handling write operations
type CommandHandler struct {
    mu    sync.Mutex
    store map[int]string
}

func (ch *CommandHandler) Create(id int, value string) {
    ch.mu.Lock()
    defer ch.mu.Unlock()
    ch.store[id] = value
}

// QueryHandler for handling read operations
type QueryHandler struct {
    store map[int]string
}

func (qh *QueryHandler) Get(id int) (string, bool) {
    value, ok := qh.store[id]
    return value, ok
}

func main() {
    store := make(map[int]string)

    // Initialize handlers
    commandHandler := CommandHandler{store: store}
    queryHandler := QueryHandler{store: store}

    // Execute commands
    commandHandler.Create(1, "item1")
    commandHandler.Create(2, "item2")

    // Execute queries
    if value, ok := queryHandler.Get(1); ok {
        fmt.Println("Item 1:", value)
    }

    if value, ok := queryHandler.Get(2); ok {
        fmt.Println("Item 2:", value)
    }
}

3. Asynchronous Request and Reply:

With every request there is a response associated. Every http request by client looks for responses before the flow completes. This may result in additional latency in case some task being long running, that block other task waiting for resources. This can be handled by having separate end point for request and response/status. A simple exmple where we simulate the asynchronous messaging service using channels.

// Request represents a request structure
type Request struct {
    ID      int
    Message string
}

// Response represents a response structure
type Response struct {
    ID      int
    Message string
    Result  string
}

// Asynchronous handler function
func handler(reqChan <-chan Request, resChan chan<- Response) {
    for req := range reqChan {
        // Simulate processing time
        time.Sleep(2 * time.Second)

        // Create a response
        res := Response{
            ID:      req.ID,
            Message: req.Message,
            Result:  fmt.Sprintf("Processed: %s", req.Message),
        }

        // Send response to response channel
        resChan <- res
    }
}

func main() {
    reqChan := make(chan Request)
    resChan := make(chan Response)

    // Start handler goroutine
    go handler(reqChan, resChan)

    // Send requests asynchronously
    go func() {
        for i := 1; i <= 5; i++ {
            req := Request{
                ID:      i,
                Message: fmt.Sprintf("Request %d", i),
            }
            reqChan <- req
        }
        close(reqChan)
    }()

    // Receive responses asynchronously
    for res := range resChan {
        fmt.Printf("Received response: ID=%d, Message=%s, Result=%s\n", res.ID, res.Message, res.Result)
    }
}

Asynchronous request and reply patterns allow for non-blocking communication, which can improve the performance and responsiveness of an application. Backend function takes the queued work (using message broker) item and execute them. http polling to check status at status endpoint.

4. Scatter and Gather Pattern:

The scatter-gather pattern is used to distribute a task among multiple workers (scatter) and then collect the results (gather). The requests scattered to all the replicas and each replica does small amount of processing and return fraction of results. The root server combines all and send complete response back to client. More information on replication.


// Worker function
func worker(id int, jobs <-chan int, results chan<- int, wg *sync.WaitGroup) {
 defer wg.Done()
 for job := range jobs {
  // Simulate work with a square operation
  results <- job * job
 }
}

func main() {
 const numWorkers = 3
 const numJobs = 5

 jobs := make(chan int, numJobs)
 results := make(chan int, numJobs)
 var wg sync.WaitGroup

 // Start workers
 for i := 1; i <= numWorkers; i++ {
  wg.Add(1)
  go worker(i, jobs, results, &wg)
 }

 // Scatter: Send jobs to the workers
 go func() {
  for j := 1; j <= numJobs; j++ {
   jobs <- j
  }
  close(jobs)
 }()

 // Gather: Collect results from the workers
 go func() {
  wg.Wait()
  close(results)
 }()

 // Print the results
 for res := range results {
  fmt.Println("Result:", res)
 }
}

5. Saga Pattern or choreography:

It’s a pattern for managing distributed transactions. Transaction is the operation performed on data. Distributed transaction when update data on two or more distinct nodes of a distributed datastore. Choreography will be implemented with asynchronous messaging pattern. Client request publishes messages to messaging queue. Messages pushed to the subscriber. Subscriber performed task and pushed the reference back to messaging queue and then to client. If failed can be retried. More information about distributed transactions here.

6. Sequential convoy pattern:

The Sequential Convoy pattern processed tasks in a specific order, but due to the nature of resource locking or other constraints, one task can hold up others, leading to inefficiencies. That’s why the tasks handled sequentially while avoiding the convoy effect by evenly distributing the workload among multiple workers.Push related messages categorises within the queuing system and queue listener lock and pull only from one category. One message at a time.

type Task struct {
 ID int
}

func worker(id int, tasks <-chan Task, wg *sync.WaitGroup) {
 defer wg.Done()
 for task := range tasks {
  fmt.Printf("Worker %d processing task %d\n", id, task.ID)
  // Simulate task processing time
  time.Sleep(time.Second * 2)
  fmt.Printf("Worker %d finished task %d\n", id, task.ID)
 }
}

func main() {
 tasks := make(chan Task)
 var wg sync.WaitGroup

 // Start workers
 for i := 1; i <= 3; i++ {
  wg.Add(1)
  go worker(i, tasks, &wg)
 }

 // Create and send tasks
 go func() {
  for j := 1; j <= 5; j++ {
   tasks <- Task{ID: j}
   fmt.Printf("Task %d sent to the queue\n", j)
   // Simulate delay between task submissions
   time.Sleep(time.Second * 1)
  }
  close(tasks)
 }()

 // Wait for all workers to finish
 wg.Wait()
 fmt.Println("All tasks completed")
}

7. Event sourcing pattern:

Event sourcing is a pattern in which state changes are logged as a sequence of events. Instead of storing the current state of an object, you store the history of events that have occurred to it. The sequence of events recorded in a append only store. Event store publishes these events, so consumer can be notified and can handle them if needed. Multithreaded applications and multiple instances of application store events in the eventstore.

You can find more details on event sourcing and distributed event processing here.

8. Publisher and Subscriber Pattern:

The Publish-Subscribe (Pub-Sub) pattern is a messaging pattern where publishers send messages without knowing who will receive them, and subscribers receive messages without knowing who sent them. pub sends out message or events into topics and subscriber express interest and receive that message. This decouples the producers of messages from the consumers.

// Publisher struct
// Publisher Struct: Maintains a list of subscribers and provides methods 
// to subscribe, unsubscribe, and publish messages.
type Publisher struct {
    subscribers map[chan string]struct{}
    mu          sync.Mutex
}

// Separate channel for subscriber, unsubscriber and publish.  

// Subscribe adds a new subscriber channel
func (p *Publisher) Subscribe() chan string {
    p.mu.Lock()
    defer p.mu.Unlock()

    ch := make(chan string)
    p.subscribers[ch] = struct{}{}
    return ch
}

// Unsubscribe removes a subscriber channel
func (p *Publisher) Unsubscribe(ch chan string) {
    p.mu.Lock()
    defer p.mu.Unlock()

    delete(p.subscribers, ch)
    close(ch)
}

// Publish sends a message to all subscribers
func (p *Publisher) Publish(message string) {
    p.mu.Lock()
    defer p.mu.Unlock()

    // allowing multiple subscribers to receive messages from a single 
    // publisher in a decoupled manner.
    for ch := range p.subscribers {
        ch <- message
    }
}

func main() {
    publisher := &Publisher{
        subscribers: make(map[chan string]struct{}),
    }

    // Subscriber 1
    sub1 := publisher.Subscribe()
    go func() {
        for msg := range sub1 {
            fmt.Println("Subscriber 1 received:", msg)
        }
    }()

    // Subscriber 2
    sub2 := publisher.Subscribe()
    go func() {
        for msg := range sub2 {
            fmt.Println("Subscriber 2 received:", msg)
        }
    }()

    // Publish messages
    go func() {
        for i := 0; i < 5; i++ {
            publisher.Publish(fmt.Sprintf("Message %d", i))
            time.Sleep(time.Second)
        }
        publisher.Unsubscribe(sub1)
        publisher.Unsubscribe(sub2)
    }()

    // Give some time for subscribers to receive messages
    time.Sleep(10 * time.Second)
}

So scaling a system is a very complicated topic to be discussed and the point I have covered it here is just a fraction of all possible strategy that can be adopted. There are more points I will add to make this blog really useful for newbie to the world of Distributed design. So keep following for more and updated info.

Tips: Design for Scale

How to Scale?

Design Pattern

Written by Crazy Geek