Creating a Reactive Cache Policy
By Varun Palavalasa, Senior Software Engineer — .NET
We are the Checkout Experience team and we’re the backbone of purchases on Farfetch.com. We deliver an application for the checkout process for any purchase on Farfetch.com. The checkout platform is used by farfetch.com for processing and fulfilling an order.
The Checkout Experience interacts with its backend systems, which is a complex ecosystem that supports cross-platform transactions that are highly scalable in nature. To design platforms of such nature, it is necessary to build resilient systems.
We identify Single Point Of Failure (SPOF) within the Checkout Experience and mitigate it. We rely on various metrics to understand the behaviour of dependent systems of the Checkout Experience. Availability & reliability along with other metrics in our NewRelic, Kibana and Prometheus monitoring dashboards help us to understand the behaviour of dependent systems and their vulnerabilities. Once we detect a SPOF, we brainstorm and apply mitigation techniques using resilience libraries like Polly.
Polly is an open-source resilience framework supported by .NET foundation, that defines a broad set of resilience policies helping systems become more fault tolerant. The sequence diagram below (picked from the Polly Wiki) defines the flow of events in Polly that describes how policies are applied.
Preemptive Cache policy:
The policies, when configured, intercept outgoing client requests and come into action when an exception is raised. The above diagram of Polly, depicts the internal flow of policies that execute in chronological order when enacted. As you can see from this diagram, when cache policy is configured, it applies the client response from cache, before making a call to the endpoint. But, when the cache doesn’t exist, it waits for the response, and once available saves it to cache. The response stays in cache depending on its TTL (Time-To-Live: a configured timestamp that defines the lifetime of the cache key) value. Polly has a built-in interface (IAsyncCacheProvider) that supports configuring the cache policy with Distributed Cache.
Reactive Fallback policy:
Fallback policy, unlike cache policy, returns the fallback value after an exception from the endpoint is thrown back to the caller. Moreover, there is no built-in support for IAsyncCacheProvider in its behaviour.
More details about each policy are documented in Polly wiki.
The nature of Cache Policy is such that it is preemptive, unlike Fallback Policy, and doesn’t wait for an exception to arise from the endpoint. As a result, this will not be an appropriate solution to apply when the endpoint response changes over time and that time frequency is of shorter cycles. Even if we are using an efficient TTL strategy, we end up making too many WRITES to cache and it would raise the risk of many IO operations to make.
Why Reactive over Preemptive?
We can adopt Fallback policy, however, it doesn’t have support for an external cache provider like Redis. So, we decided to use Polly Custom Policy Engine to define the reactive behaviour of CachePolicy. To support our decision, let’s walk through an example:
In the Fig PaymentMethodsFlow shown below, you have a flow explaining how online Payments is used during the Checkout process.
PaymentService aka ECommerce PCI, assists the user in redirecting chosen payments to the corresponding payment gateway.
The user enters Checkout Experience to purchase the items in their shopping bag. Checkout Experience offers a list of payment options like (Paypal, Credit/debit card, Google Pay) in order to facilitate the online purchase options. To make this possible, Checkout Experience makes a call to PaymentService to retrieve the PaymentMethods. PaymentService looks into a set of pre-conditions to execute — the current culture of the user, their currency preferences, purchase history — and returns the payment methods that match their preferences.
The flow depicts how important PaymentService is for a smooth functioning Checkout Experience flow, in order to fulfil an order. Though PaymentService is a SPOF, it is important to understand that we minimize the impact on Checkout Order flow when PaymentService is down. When the user logs into Checkout Experience, a Checkout Order is created for the user. This will track the progress of the order from initiation to fulfilment. So in the view of CheckoutOrder, the order is only considered as complete when the online payment is done successfully. For some reason, if online payment is interrupted, the CheckoutStatus of the order would be frozen, a state where there is no returning back. This will invalidate the order and, even if Payment Service is back online, there is no way to continue with the order.
To mitigate the impact, we identified two important methods of PaymentService which the Checkout flow depends on:
We decided to degrade the user experience and provide payment options relevant to the country, rather than the user. We believe this is a better approach to minimize the problem.
If you go one step further and understand the flow, GetPaymentMethodsAsync is context-specific (i.e. depends on which country you are browsing from) and GetUserTokensAsync is user-specific (i.e., what payment options are associated with a user).
We can still have the Checkout Flow running without relying on UserTokens, i.e., when we have an exception from a Client, we can choose to return Empty Tokens to Checkout Flow so as to let it assume that it is a new user with no user-specific payment option. We also return Empty Tokens as Fallback policy when PaymentService is down.
Since GetPaymentMethodsAsync is context-specific, we wanted to cache PaymentMethods and return only those payment methods that are associated with the country. Now we need to cache the PaymentMethods relevant to a country but we cannot achieve that with Fallback policy, it’s only possible through Cache Policy. However, the default behaviour of Cache policy will let you serve PaymentMethods without verifying whether the PaymentService is giving an exception or has a different set of PaymentMethods, and this is exactly where Reactive behaviour of Cache Policy makes the difference.
How to define Reactive behaviour:
Using Polly’s Custom policy engine patterns, we can define the reactive behaviour as shown below:
Step1: Define a class and name it AsyncReactiveCachePolicy. Extend the class from AsyncPolicy.
Step2: You need to define a CacheProvider (either InMemory or Distributed Cache), but a CacheProvider supported by Polly interface — IAsyncCacheProvider. Additionally, ITtlStrategy — a strategy that defines the TimeToLive factor of the cache key and CacheKeyStrategy to define a structure for CacheKey and feed it to the context and retrieve it at the time of execution.
Step3: Define callback hooks for probing and logging purposes.
Step4: Create an internal constructor. Throw necessary exceptions to bubble them up to this higher level of hierarchy.
Note: onCacheGet, onCacheMiss, onCachePut are callback hooks which are called during execution during cache operations. onCacheGetError and onCachePutError are callback hooks that are raised when an exception occurs during cache operations.
Step5: override the base class virtual method ImplementationAsync to define the behaviour of the policy at the time of execution.
Note: You can define custom behaviour in the override and throw relevant events to the callback hooks, to let the user define further actions.
You can see onCacheGet and onCacheMiss callback hooks in case cache isn’t found at query time. You can use Callback hooks to define your custom telemetry like logging or custom monitoring events.
Step6: In Step5, we are trying to save the value to cache when the client returns the call successfully, however, if there is an exception, then we are trying to retrieve the value from the cache. This is exactly how we are making the cache policy reactive and will act based on the return result from the client.
Step7: Define a static class, named CustomPolicy, that will contain the overloads of how to create an AsyncReactiveCachePolicy object
The approach described above brings certain advantages in controlling the flow of execution of the policy. It checks for failure from the service and in eventuality of failure, attempts to retrieve the value from cache, and thus acts like a safety net. This will also improve the reliability factor of your application, and helps you identify how many such requests were degraded due to service failures, but were saved by the cache policy through defining effective telemetry using callback hooks. You can also choose to safely filter certain exceptions that are not meant to be applied to the cache policy.
Originally published at https://www.farfetchtechblog.com on July 12, 2020.