Building The Next-Generation E-Commerce Platform For 10X Growth

Siddharth Patnaik
Walmart Global Tech Blog
9 min readOct 23, 2020

Written by Siddharth Patnaik

Photo credit: Chandler Cruttenden

Introduction

In this post, we will discuss the architecture of a cloud-native futuristic E-Commerce platform on the public cloud. We will discuss the problems on hand, the motivation behind building such a new platform, the Architecture guidelines and the operating principles we followed, the technology stack we choose and the rationale behind such decisions. This post will be focused on the architectural aspects of cloud-native so I suggest you refer to my earlier blog on cloud-native application architecture as a pre-read.

The context is one of Walmart’s E-Commerce business which makes around 5 Billion dollars in annual revenue. The business is growing at 30–40 % YoY and there is an expectation from the business that the technology can scale-up to be able to support this hyper-growth. In reality, the current E-Commerce application has been around for a few years now, built on a Commercial-off-the-shelf (COTS) monolith software, does not scale & expensive from development, support and hardware cost perspective. We face all sort of problems seen in a typical monolith application like extremely difficult to scale beyond a point, delayed released cycles, requires huge coordination & communication between the teams during deployments, slow development, long QA cycles, not able to innovate by rapid experimentation and leveraging new technologies.

These problems motivated us to build a new platform from scratch using modern technologies on the public cloud with the following high-level objectives:

  • High performance, Massively scalable — Should be able to scale to 10X of the current throughput. In E-Commerce applications, the throughput is normally measured using PVM (Page view per minute) and OPM (Orders per minute). We wanted to scale to 10X of these current metrics
  • High quality — Bug-free high-quality software
  • Time to market — Should be able to quickly build new features, deploy frequently
  • Elastic — We run multiple promotional events in a year. The platform should scale up during such events and scale down post these events. Follow a pay-per-use model
  • Innovate — Should be able to experiment & adopt new technologies at speed
  • Cost-effective — The solution should be cost-effective. Should scale gradually with business
  • Highly available — System downtime is directly linked to a loss in revenue. The system should provide an uptime of 99.99

Architecture Guidelines & Operating Principles

We set a few architecture guidelines and design patterns and followed religiously. We made sure these principles are followed across microservices and the respective teams. In the subsequent sections, we will deep dive into each of these areas and discuss implementation details from a Cart & Checkout service perspective.

  • Best Of The Breed Technology — Use Multi-cloud strategy and use best of breed technology from different public cloud providers like Azure, GCP etc. Also, prefer PAAS over IAAS. Most of our services are built on Azure. There are specific services which leverage GCP as well. Critical services like Payments, Tax run on the private cloud
  • Domain Oriented Microservice Architecture — We broke the monolith into multiple microservices based on a domain boundary. We ended up with services like Catalog, Search, Cart, Checkout, Promotion, Payment, Inventory, Order, Return etc. These services are designed independently in such a way that they are managed by their respective teams with clear ownership, offer an API contract, use the most suitable technology stack, independently tested & deployed with a clear separation of concern
  • Containerised Deployments — Microservices are packaged as lightweight containers using docker technology and orchestrated in a Kubernetes platform
  • Stateless, Omni-channel & Massively Scalable — no shared state, massively scalable using NoSQL databases (eventual consistency wherever possible), caching wherever possible, multi-region active-active deployment topology
  • Elastic — Scale-up & down on demand. Autoscale at all levels from containers to database
  • Resiliency At The Core — The system should be resilient to failures. Also should minimise the blast radius by impacting as fewer customers as possible
  • STOSA — Single team owned service architecture. The team is responsible end to end from development to deployment to supporting at production
  • Agile DevOps & Automation — Continuous integration & deployment strategy to automate everything from dev to deployment. Frequent deployments (multiple times a day)
  • Observability — Every service declares their service level objectives (SLO) with service level indicators. For example average response time, P95 response time, % of errors, Throughput (requests per second). All these SLIs are monitored using dashboards. Alerts are defined which gets triggered once the thresholds are breached and appropriate action is taken
  • Cost-effective — Right technology should be selected keeping cost considerations. Also, cloud resources are monitored regularly to ensure effective resource usage

Microservice Deployment Architecture

Kubernetes has become the de-facto platform for running containerised workloads. All our microservices are deployed into Azure Kubernetes service. Kubernetes provides many of the important features like Service discovery /load balancing, Scheduling of the Pods/containers across VMs based on configured CPU/Memory, Deployment strategies (Blue/green, rolling updates), Autoscaling based on horizontal pod auto-scaler & Node auto-scaler, managing the lifecycle of application pods. We also use Istio as the service mesh technology which uses a sidecar model for getting more flexibility in the areas of traffic management, load balancing, canary deployment, secured communication between microservices and observability of the complete service topology.

We followed an ACTIVE_ACTIVE model for achieving high throughput and HA/DR. The services are deployed in Azure EAST & WEST regions. In future, we plan to deploy to more regions.

Deployment Topology

Horizontal Scalability

To achieve linear scalability the architecture is built using various strategies like Asynchronous processing, stateless, caching, Autoscaling, high-performance NoSQL database, multi-region deployment etc. Let us look into each of these aspects in detail.

Asynchronous Processing

Services like Cart & Checkout, depend on multiple remote services part of their processing logic. These services have been implemented using Asynchronous processing frameworks like Webflux, Completable Future, Async HTTP Clients. This non-blocking and reactive nature of processing helps us achieve high throughput with relatively smaller hardware configurations

Non-Blocking IO

Autoscaling In Kubernetes

Horizontal pod auto-scaler rules have been configured for auto-scaling. Whenever the traffic spikes up, new pods and nodes get added and the service scales. Also as shown in the above picture, the services have been deployed into multiple Azure regions.

Autoscaling in Kubernetes

Azure Cosmos Scalability

Our earlier application was based on RDBMS with hundreds of tables and we have faced a lot of issues wrt scalability. While designing the new platform we were sure NoSQL is the way forward. We selected Azure Cosmos DB which is a globally distributed, horizontally scalable, schema-less database designed for planet-scale. The data model is carefully designed with the right partition key and access patterns. For our use cases, the access is always based on the key (getDocumentByID) which gives us less than 10 Ms latency for 99 percentile of our calls. From a consistency perspective, we use session consistency. Cosmos ensures consistent reads when the right session token is passed. So after every writes, the token is stored for the corresponding document in a key-value store (Redis) and for the subsequent reads this token is used.

Azure Cosmos

Resiliency / Chaos Engineering

Everything fails, all the time! In cloud architecture, While we cannot avoid complete failures, we can plan for the failures. Do not accept the failures! So how do you define reliable applications? Here are the characteristics of a resilient application:

  • Recover gracefully from failures and continue to operate
  • Highly available (HA) and run as designed with no significant downtime

We built resiliency in our architecture by following many design patterns. Few of the examples are given below:

  • Build redundancy everywhere. No SPOF. Deploy the services into multiple regions to circumvent complete regional failures
  • Throttling to withstand sudden spikes in traffic and bot attacks
  • Inter-service communication
  • Configure Timeouts & retries for every remote service call
  • Circuit breakers for each remote service. Fall back logic in place to handle failures
  • Bulkheads to avoid complete failure when a part of the system fails

Fault Injection Testing

To simulate failures, faults are injected and the behaviour is verified. We use various tools like Istio, Gremlin for this purpose

Agile & DevOps Practice

Every microservice is designed in such a way that it is owned by a single team. The team operates using STOSA model (single team owned service architecture) where they have the end to end responsibility from architecture to development to testing to deployment to monitoring to addressing production issues. The development is done using agile methodologies. The source code is stored in Git and the team follows Trunk based development as the branching strategy. For testing, the team heavily depends on automated tests and the code is continuously tested using CI/CD best practices. Here is the CI/CD model the team follows

CI/CD

Observability

One of the most important characteristics of service is it’s observability aspects. Each service pushes the application logs to a centralised Splunk cluster for aggregation and search. For real-time monitoring & alerting, we use Prometheus as the metrics store and Graffana for the dashboards. We initially started with Azure Application insights but later moved to Splunk and Prometheus for performance reasons.

Monitoring Setup

The following metrics are part of the dashboard.

  • Kubernetes Dashboard
  • CPU / Memory at the pod level
  • Tomcat thread status at the pod level
  • Garbage Collection stats at the pod level
  • Circuit breaker status
  • Availability
  • Throughput
  • Response time (Avg, 95P)
  • Errors (based on HTTP status codes)
  • All Dependent Service (Response time, Availability, Errors)
Dashboards

Cost Management

Costing is a very important aspect of any architecture and we kept this in mind right from the beginning. In this section, we will see the guidelines we followed and the actions we took to be able to build the most cost-effective solution.

Technology Selection

Cloud providers offer multiple PAAS services for a domain. For example for database storage, there are services like Cosmos, Postgress, MySQL, SQL Server and every offer has its own pricing model. Also, the pricing differs based on the features selected. For example, Cosmos Multi write has a different pricing model than a single write.

Optimum Cloud Resource Usage

One of the biggest promises of cloud computing is you pay for what you use. So it is extremely important that we monitor the resource usage and optmise the usage. In our case, we used the Autoscaling features wherever possible. For example use AKS auto-scaling, Cosmos Autopilot which helps in best resource usage and ensure cost-effectivity.

Cost Reports

Part of Azure cost management, it is possible to create reports and slice & dice them from various dimensions. We created reports for every service, defined budgets and alerts for multiple thresholds. This provides us with a very good visibility into the overall cost and the respective cost centre owners are notified on a budget breach.

Here is an example of a cost report:

Cost Report

Conclusion

Though there is a lot of thinking, research and hard work that has gone in building the platform, you don’t get everything right in the first attempt. The idea is to constantly monitor the systems, observe the system behaviour from production traffic and improve from every iteration.

--

--