Resilience Testing in NewsKit API

Stoyan Z Yanev
NewsKit design system
8 min readApr 23, 2021

NewsKit API is the News UK data service that consolidates different data sources across the business and provides a single federated GraphQL endpoint to the applications powered by NewsKit Components.

The idea of NewsKit API is based on federated architecture and Apollo Federation. It brings the benefits of federated architecture and the flexibility of GraphQL together.

The architecture for the NewsKit API is laid out below. Each data source is accessed via a GraphQL wrapper or a service. The difference between a service and a wrapper is that a wrapper acts as a transformer between two data entities ( models ), while a service is responsible for serving the data directly.

This distinction doesn’t bear any significance to our clients and we use them interchangeably in this document. These services/wrappers translate data into well-defined GraphQL schemas which are federated by the NewsKit API Gateway.

The Gateway’s responsibility is solely to federate the individual services. The services in NewsKit API will resolve the Graph requests passed down from the Gateway respectively and return the data to clients in a coherent GraphQL schema. The approach also enables NewsKit API to future-proof the data contract against changes in the underpinning data sources.

Why did we build NewsKit API?

We built a Design System to make our front-end composable. In order to provide a seamless development experience for the engineering teams in other business units, we also needed a composable back-end.

GraphQL solves the challenge for dynamic data fetching from front-end applications. Clients can now get all of the required data to construct complex UI components via a single query.

Introducing GraphQL into an established tech stack is no mean feat. Over the years, some of our existing products have grown organically, both in complexity and size. In order to construct a single web page, one team had to make over 20 requests to different REST API endpoints. Worse, only a tiny portion of the fetched data is used, with most of it discarded. This is inefficient and expensive.

We took an incremental approach for our GraphQL adoption, through GraphQL federation and the idea behind Federated architecture (FA). This enabled us to iteratively expand our Graph schema to make the API usable immediately, delivering value to the business. The discoverability of the data in GraphQL schema reduces the risk of different teams building the same things in isolation.

What is Resilience Testing ?

Software testing, in general, involves many different techniques and methodologies to test every aspect of the software regarding functionality, performance, and bugs.

Resilience is the ability of a system to minimise the impact of failure. Failure happens, all the time. Resilient systems take progressive steps to allow the most useful parts of the system to still serve their purpose.

The tests aim is not only to provide information regarding failures of the system — resilience, but also to improve performance and observability.

When we consider resilience we mainly focus on the availability — are the services available during the stress testing, is there any downtime in our service etc. We have to analyse the results and decide if our infrastructure reacts as we hoped/planned, and what would be the potential improvements.

When we talk about performance — this is where we think about stressing resources — spikes in : CPU, Memory and other resource constraints.

And there is also the benefits of observability — the least talked about benefit, but it is really important one. Being able to check if we have the right alarms in place, are they triggered when they were supposed to. If those alarms are in place, is the team ready to react to them effectively.

The product that we are building is a GraphQL federation (Newskit API) deployed on EKS. Due to the complex infrastructure and the high traffic, we rely on a thorough testing strategy, which includes resilience tests,

What is AWS Fault Injection Simulator ?

According to AWS:

Fault Injection Simulator is a fully managed chaos engineering service that makes it easier for teams to discover an application’s weaknesses at scale in order to improve performance, observability, and resiliency. Chaos engineering is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling, observing how the system responds, and implementing improvements.

With Fault Injection Simulator, teams can quickly set up experiments using pre-built templates that generate the desired disruptions, such as server latency or database error. Fault Injection Simulator provides the controls and guardrails that teams need to run experiments in production, such as automatically rolling back or stopping the experiment if specific conditions are met. With a few clicks in the console, teams can run complex scenarios with common distributed system failures happening in parallel or building sequentially over time, enabling them to create the real world conditions necessary to find hidden weaknesses.

Benefits of using AWS FIS

(Diagram from AWS re:Invent)
  • Improve application performance, resiliency, and observability
  • Validate how your application performs on AWS — Amazon EC2, Amazon EKS, Amazon ECS, and Amazon RDS
  • Safeguard chaos experiments — AWS Fault Injection Simulator provides the fine-grained controls that teams need to define the specific conditions under which they want to stop an experiment or roll back to the pre-experiment state.
  • A fast and easy way to get started with chaos engineering — AWS Fault Injection Simulator provides prebuilt templates that enable teams to set up and run high quality experiments in minutes. Fault Injection Simulator structures the chaos engineering process so that teams can quickly run chaos engineering experiments by following the step-by-step process in the console and selecting from a predefined list of actions.
  • Get superior insights by generating real-world failure conditions
  • Integrated security model — AWS Fault Injection Simulator is integrated with AWS Identity and Access Management (IAM) so that you can control which users and resources have permission to access and run Fault Injection Simulator experiments, and which resources and services can be affected.
  • Visibility throughout an experiment — AWS Fault Injection Simulator provides visibility throughout every stage of an experiment via the console and APIs. As an experiment is running you can observe what actions have executed. After an experiment has completed you can see details on what actions were run, if stop conditions were triggered, how metrics compared to your expected steady state, and more.

The conditions during the tests will be real — it is not false metrics or manipulated envs. The described events in the templates will really be happening. Example — Memory utilisation tests — the memory described will actually be used during the testing, and not just have false alerts.

Use Cases:

  • Periodic Game Days — This is a completely hands-on opportunity for technical professionals to explore AWS services, architecture patterns, best practices, and group cooperation.
  • Continuous Delivery Pipeline Integration — You can integrate AWS Fault Injection Simulator into your continuous delivery pipeline. This will enable you to repeatedly test the impact of fault actions as part of your software delivery process.

In Practice what experiments can we run ?

It is a flexible framework that can be used for pretty much every type of chaos engineering.

(Diagram from AWS re:Invent)
(Diagram from AWS re:Invent)

There are 3 CORE Components of each experiment:

  • Action — the fault injection, data being lost from DB, container resources being constraint, certain services being unavailable, latency problems etc. We can combine several simple actions into some much more complicated scenarios (several performance degrading events at once). We can build or timeline of events (in sequel or parallel) and test scenarios which are very hard to mimic in real life.
(Diagram from AWS re:Invent)
(Diagram from AWS re:Invent)
  • Targets — A target is one or more AWS resources on which an action is performed during an experiment. You define targets when you create an experiment template. You can use the same target for multiple actions in your experiment template. To identify your target resources, you can specify resource IDs, filters and tags.
(Diagram from AWS re:Invent)
(Diagram from AWS re:Invent)
  • Stop Condition — stop condition is a mechanism to stop an experiment if it reaches a threshold that you define as a Amazon CloudWatch alarm. If a stop condition is triggered during an experiment, AWS FIS stops the experiment, and in some cases rolls back the state of the target resources.
(Diagram from AWS re:Invent)
(Diagram from AWS re:Invent)

Other alternatives for chaos engineering include the open-source Chaos Monkey, and the paid offerings from Gremlin, but after a thorough research we decided to choose AWS FIS. The main goal is to test Kubernetes clusters, deployed using AWS EKS.

In Newskit API we plan to test our resilience by terminating EKS node groups, throttling our API and many custom actions. By running experiments on a regular basis that simulate outage, we want to able to identify any systemic weaknesses early and fix them.

We are still working on our implementation, hence why there is an example template provided, and not a particular implementation. We will share our progress and potential findings in the near future.

Useful Resources

--

--