The 10 Commandments of Building a Resilient, Robust, Scalable, and Never-Fail Web App
Too often, new web apps go live, sometimes with a lot of PR, and then they crash. When I tried to figure out why, I found that it is not because of the quality of the engineers, nor the quality of the IT team. Usually, the reason is the lack of knowledge and experience in designing and building a resilient, robust, scalable, and never-fail web application. Even nowadays, the knowledge is not widespread, and the same mistakes are made repeatedly, causing a lot of embarrassment, and harming our most valuable asset — our customers. In this article, I will describe the ten most important design and architecture components that should be implemented in a system if you want to achieve those goals.
In this article, when I talk about “server”, the meaning is logical server. A logical server can be a physical server, a virtual server, and/or a container (my preference).
Like the Biblical Ten Commandments, all are equally important and the order does not matter.
Put all your read-only assets (CSS, images, video, audio, etc.) on a CDN (Content Delivery Network) service such as Akamai.
Your users will consume all these assets from dedicated servers that are at a nearby site (their country or zone), without affecting your servers and bandwidth, and this will provide the best performance, no matter how many users are accessing your system.
Make sure that all read-only assets in your application have a dedicated separate URL from the app URL.
2. Auto Scale with No SPoF
Duplicate each one of the system components with minimum of two instances, so there will be no single point of failure (SPoF). The actual number of duplicates for each component and each component’s configuration will be according to capacity planning and system sizing. The best practice is to plan for the maximum capacity of a standard business day with 25% extra.
All components must have auto-scaling capability. Whenever the load balancer meets a predefined threshold, additional resources will be automatically set and added to the system on the fly.
All operations, transactions and APIs within the system must be stateless. Even different messages from the same client-form can be routed to different servers, needless to say, every step in a process.
When there is a need to manage a state (user preferences, user credentials, user parameters, wizard data, business process data, session data, etc.), the state data will be stored in an in-memory state database such as Redis. Every user session should have a session token where, in the state database, all state data will be stored in a JSON document with the session token as key.
Managing the session token helps with application security, for all credentials are there, and you can easily drop users that quit the application without logout. You can also drop users that did not interact with the system for a certain amount of time, and in high security systems, replace the token after each transaction.
4. Cache and In-Memory State Database
All state data, system parameters, code tables, metadata, session data, and data cache should be stored and managed in a resilient, robust multi-node in-memory NoSQL database, such as Redis and Memcached.
Consider having two instances of this database: persistent (slower but safe) for process/transaction data, and non-persistent (faster but not safe) for volatile data such as code table, metadata, system parameters, etc.
You must build a process to populate this database when the system loads and update it whenever there is a change in the relevant data. The update process should be triggered from the GUI layer, the business logic (BL) layer, and/or the data access layer (DAL).
Every transaction that does not have to be synchronous must be a-synchronous. It is faster (quicker feedback to the user), safer, fault tolerant, and more secure. A-synchronous processing should be implemented using an enterprise grade, robust and resilient queue, such as RabbitMQ, IBM MQ, Kafka, Redis, etc.
When high performance and real 24/7 are a key part of the system requirements, consider using the queue for all database transactions: read data from the cache, update date (CRUD) via a-synchronous message to the data access layer (DAL), where the DAL will update the database and the cache.
For multistep processes, consider using a workflow or batch framework, such as Spring batch.
6. No ORM
Do not use ORM (Hibernate/NHibernate, Entity Framework, etc.). Build your DAL with native database access (e.g., JDBC, ADO.Net, etc.) using ANSI SQL as much as possible, or at least use a lightweight ORM (such as Dapper in Microsoft’s world).
ORMs impact on performance is between three to ten times slower than using direct access (JDBC, ADO, etc.). The cost of multiplying your infrastructure by three will always be more than the cost of creating your data access layer (DAL) without ORM.
7. Database is Database Only
The database should be a database only; no code should reside in the database. This means that there shall be no stored procedures (atomic actions without business logic stored procedures for select and CRUD are OK), no referential integrity, no constraints, and no triggers in the database. All business logic (including constraints and referential integrity enforcement when they are really needed) must be in the business logic layer (BL), and all data logic must be in the data access layer (DAL).
Using stored procedures prevents scaling options (or scaling is very expensive), creates database dependency and vendor lock, slows the system (business logic is competing against reading and write operations on the same database server resources). In addition, debugging, logging, unit testing, and load/stress testing are very complicated.
If the organization must be SOX-compliant, additional issues arise, for the database firewall cannot prevent illegal operations and cannot track database changes — all it knows is the name of the stored procedure.
8. Multiple Databases
In any modern system, the likelihood of a single traditional OLTP SQL row-based database being suitable for all of an app’s business needs is a big ZERO!. Modern systems have different types of data, different types of business entities, and different types of business operations; each one requires a database system optimized for this specific purpose.
On a typical setup, you will have the following databases:
(1) Row-based tabular SQL for transactions and ACID compliant operations
(2) Document for hierarchical objects (such as insurance policy, sale basket)
(3) Search for text search and logs
(4) In-memory for cache and pre-fetch
and maybe also:
(5) Columnar for batch processes and calculation based on specific rows
(6) Graph for hierarchical search
In most of the cases, you will store the same data in more than one database, each for the appropriate use case, for example, row level SQL for transaction data, join data, and data warehouse and in document database for fast retrieve.
A simple example to illustrate the benefit of storing data twice: SQL and document NoSQL, is showed in the diagram below. Retrieving a typical invoice or sale order will require one disk read when it is stored in a document database and twelve disk reads when it is stored in a SQL database.
9. Near Online Data Warehouse
A common system includes read and write (CRUD) operations, both competing on the same resource — the database servers. Nevertheless, most of the reports, dashboards, and even entity display require joining tables, calculating statistics, summarizing data, sorting data, etc. — each one a very expensive operation database-wise.
You should create a near online data warehouse that is updated almost immediately. Updating the near online data warehouse is the responsibility of the DAL, via a-synchronous messages to a dedicated business logic server.
This data warehouse should include denormalized business entities, summaries and statistics, and any real-time required data for business processes, such as dashboard and reports, alerts, balance, inventory level, etc.
Most of your read operations should be from your near online data warehouse rather than your OLTP database.
10. Load Balance and Three Tiers
Build your system using true three tiers with a load balancer between tiers: load balancer between your users and the web servers (GUI layer), load balancer between your web servers and the business logic servers (BL layer), and load balancer between your business logic servers to your data access layer servers (DAL).
Use load balancers (LB) and not NLB (network load balancer). NLB will balance only connections, where most of the time, what you really need is to balance CPU, memory, IO, and bandwidth, and that is what LB does.
Do not use sticky sessions in your load balancer, as your system must be stateless.
If possible, use more than one database server, either as a cluster or via CRUD node and separated mirrored read nodes.
All layers should communicate with each other via APIs only, preferably RESTful for sync, and message bus\queue for a-sync. The GUI layer will access BL and DAL exactly as external systems will do, therefore their APIs can and should be exposed via an API gateway, micro gateway, or service mesh (K8s).
In your DAL, enforce read/write splitting (one connection string for CRUD operations and one for read operations). This will enable splitting the load between several servers, where one updatable server will handle the CRUD operations and multiple load balanced read replicas will handle the read operations (there are more reads than writes).
The DAL should handle all database operations implementing data access permissions, mapping logical entities to database objects (such as tables and views), exposing data services. The preferred method for DAL APIs is the OData standard.
For read operations that occur at the GUI layer (such as populating a list\grid, populating dropdowns, lookups, displaying entity data, etc.), bypass BL and access directly the DAL via data services.
A proper architecture based on those ten principles can be the difference between a working system and kudus, to a catastrophe of failing system, that embarrasses our organization and requires a lot of firefighting.
Good architecture, good design are the key!