Wix has a huge scale of traffic. more than 500 billion HTTP requests and more than 1.5 billion Kafka business events per day.
Caching the responses and events is critical for having the best performance and resilience for Wix’s 1500 microservices.
As traffic to your service grows, customers can face much longer loading and response times, and not return to your service. Network costs can increase and the unexpected load on your DBs can potentially cause them to fail. One of the ways you can mitigate these issues is to introduce a cache.
A cache will reduce latency, by avoiding performing a costly query to a DB, or HTTP request to a Wix service or a 3rd party service. It will also reduce the needed scale to service these costly requests.
It will also improve reliability, by making sure some data can be returned even if aforementioned DB or 3rd party service are currently unavailable.
Below are 4 Caching patterns that are used by Wix’s 1500 microservices in order to provide the best experience for our users along with saving costs and increasing availability.
Most of these patterns (partially) rely on libraries and services Wix developed in-house. The motivation for this was to reduce costs & operational overhead and to better integrate with existing infrastructure.
1. Locally Persisted (or S3 backed) Configuration Data Cache
Wix’s Koboshi library offers resilient caching of any configuration/resources that a service may need to rely on, but are obtained from an unreliable source (such as other internal services, databases, or any other remote source).
The library user implements RemoteDataSource interface:
On init, Koboshi will read the potentially stale persisted data. If newer data is available from the external source it will try to bring it in the background in order not to block the app from going up, and optionally periodically afterwards.
The data can be persisted to local disk, or to S3 (so persistence will outlive the kubernetes pod’s life).
Local disk or S3 are the only dependencies in order to achieve greater resiliency.
Read-through cache (interaction with data source is done only through the cache)
Data has to fit into memory, as Koboshi brings the entire “dictionary” (no LRU capability).
Best used for
When the service needs an external source for static configuration which may become unavailable.
Best when the the app can live with stale data or even without it (by using fallback values or if it has other crucial flows it wants to serve that don’t need the data)
Not good for
Dynamically updating data or internally updating data which has to always be up-to-date.
2. Kafka topic based 0-latency Cache
You may be familiar with Apache Kafka as a message broker, but Kafka can also be used to store data.
When an application starts, the Greyhound KVCacheReader will consume all the records from the partitions of a cache topic. It will write them into memory, starting from the beginning of the log (aka partitions). When done, the KVCacheReader will contain the most up-to-date key-value map. This is illustrated in the following image.
The user can decide to block reading values until the store is fully initialized (so that the key-values will be up-to-date), the user can also add a health check that will make sure no incoming requests arrive until the cache is initialized.
The user sets a TTL (Time To Live) for KVCacheWriter, which will set Kafka topic retention time accordingly. The Kafka broker will clean up topics on a best-effort basis, so that records with stale values may still be present for a while after they expire.
Because there is a chance that KVCacheReader will read stale values, it is also set with a TTL. When it reads records, it first checks the write timestamp for each record before deciding whether to write it to the in-memory map as demonstrated in the image below:
The cached data can easily be replicated to other data centers by simply replicating the underlying Kafka topic.
Cache-Aside (The application will go to original data source upon cache miss) or Cache-only (The cache is the source of truth for transient data)
Data has to fit into memory, as Greyhound KVCacheReader brings the entire “dictionary” (no LRU capability).
Best used for
Transient Data that needs to be persisted and can fit in-memory and needs to be accessed with 0-latency
Not good for
Very large datasets (Causes long startup times and may cause Out-Of-Memory exceptions)
3. (Dynamo)DB+CDC based Cache
In order to avoid the data size limitation of loading a Kafka topic backed cache into memory, a more traditional DB can be used for persistence.
Wix uses AWS’s DynamoDB as it’s go-to NoSQL store. It delivers low-latency performance at any scale, and has built-in cross-DC replication.
In order to simplify provisioning, Wix has built a simple service called DataStore that manages virtual “tables” in physical DynamoDB tables.
While there is a caching service by AWS called DAX (Amazon DynamoDB Accelerator), it does incur extra latency (in addition to the extra latency that Wix’s dataStore service incurs) and reduces availability.
Wix decided to create an in-memory library to gain 0-latency read performance for the vast majority of read requests.
The library, called CachedKVStore, sets-up a CDC (change-data-capture) Kafka topic that contains all updates done to a specific virtual table in DynamoDB (another service is responsible for processing events from DynamoDB Streams via Kinesis adapter and producing them to the dedicated Kafka topic).
Then, CachedKVStore consumes from the Kafka topic and updates a LRU cache (traditionally implemented with a hashmap and a doubly-linked list) where least-used key-values are discarded when new values from the Kafka topic are inserted.
Wix chose DynamoDB (together with auto-provisioning + CDC + LRU Cache) instead of Redis, because Redis has a higher cost & operational overhead.
Read-through cache (read interaction with DynamoDB is done only through the cache). Also called lazy loading
No limitation on DB data size, LRU cache size is dependent on configuration (which has upper bound of available application/host memory)
Best used for
Make DB read access faster for “popular” values
Not good for
Highly critical application startup information (DynamoDB/Kafka can be inaccessible)
4. HTTP Reverse Proxy Caching (Varnish cache)
A few years ago, as the world moved to view sites on mobile devices, Wix switched to SSR (Server Side Rendering) to render its sites, in order to improve performance by utilizing the Server more powerful CPUs/GPUs.
In order to optimize performance further, a Varnish Cache based layer was introduced between the load balancers and the site rendering services.
A lot of the work has been done to be able to cache all the dynamic content provided by Wix’s many different services (including user defined custom site code).
This is done either by requesting to invalidate the cached data (e.g. when a Wix site is edited and republished), or by encoding the version of dynamic content as part of unique URLs, so invalidation won’t be needed at all.
Cache Invalidation is done through a dedicated Purger service. As illustrated by the image above, on each purge-request Kafka event this service consumes, it sends a purge request to Varnish with the relevant
ETag associated with the cached HTTP request (e.g.
ETag for SiteId when the Site rendering is cached)
Size limit depends on the configured Storage backends
- malloc is memory based — so the size is limited by (more) expensive memory cost
- file is memory-backed-by-disk based — which is limited by less expensive disk cost, but can cause performance penalty due to fragmentation…
- Massive Storage Engine (MSE) — also backed by file, but has a “fragmentation proof algorithm.” part of Varnish Cache Plus (available for paying customers only)
Best used for
Caching of any external URL HTTP Response (Including HTML, Json, even Rest endpoints like app-settings)
Not good for
Caching internal service configuration, critical service startup data.
Each performance and resilience use case has its own optimal caching technique and features.
It’s important to remember that a cache should never be a critical component of the design of your service and your system should be working perfectly fine without it.
This is especially true if you’re looking to optimize performance. Maybe a change in your service code can make its response time drop significantly.
When choosing a caching patterns you should answer the following questions:
- Is your data source reliable?
- Does all data fit into memory?
- Is the data-to-be-cached highly dynamic, and can easily become stale?
- Do you pay high Ingress costs (cost to enter a data center network)?
Below you will find a handy flowchart to make it easy for you to choose the right cache pattern for your use case.
Thank you for reading!
If you’d like to get updates on my future software engineering blog posts, follow me on Twitter and Medium.
You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.
If anything is unclear or you want to point out something, please comment down below.