Image for post
Image for post
Photo by Taylor Vick on Unsplash

In part 1, we discussed some of the concepts and theory around application resilience and provided some guidelines to help maintain an excellent user experience.

In part 2, we explored some of the infrastructure patterns that can help to provide a sound foundation for building resilient applications.

In this installment, we walk through some of the architecture patterns that build on top of the theory and infrastructure that work together to ensure applications are resilient to failure.

Architecture Patterns

Application resilience is the ability of an application to react to failure in one or more of its components and still provide the best possible experience to the user. …


Image for post
Image for post
Photo by Ali Yılmaz on Unsplash

In part 1, we discussed some of the concepts and theory around application resilience and provided some guidelines to help maintain an excellent user experience.

In this newsletter, we walk through some of the infrastructure patterns that can be utilised to help implement some of the ideas discussed in part 1.

Infrastructure Patterns

Having a solid foundation upon which to build your application is vitally important to its long-term resilience. Here are some of the infrastructure patterns that are well-established that help you to build resilient applications:

Redundancy

Redundancy is the duplication of components of a system. This is where multiple instances of your component, e.g. a web server, is deployed such that traffic can be handled by any instance. …


Image for post
Image for post
Photo by Rami Al-zayat on Unsplash

In today’s digital world, having your app, product, or service globally available 24/7/365 is more crucial than ever before.

What it means for your applications to be resilient is not well understood and can mean different things to different groups within an organization. I wanted to take some time to talk through what you should know about application resilience and ways to adapt a mindset that can give you the tools to prioritize your efforts.

Definition

Application resilience is the ability of an application to react to failure in one or more of its components and still provide the best possible experience to the user. …


Update: Further research has demonstrated that People Data Labs did not own the IP as listed previously, but it does appear to be a very similar dataset. This article was updated to reflect these new details.

There have been several significant data leaks in the past couple of weeks. I wanted to take some time to dissect one of them and talk through some preventative measures you can put in place to help minimize the risk to your organisation.

Exposed Elastic Search cluster

Data Viper recently found a publicly exposed, unauthenticated 4TB Elastic Search cluster. This cluster contains personally identifiable information (PII) such as names, emails, phone numbers, social media information (LinkedIn, Facebook, Twitter, and Github) of over 1.2 …


Some data breaches happen through a complex set of interactions between multiple systems. Others are down to pure ignorance and/or negligence. An article I read recently talked about a mobile dating app that had some of the worst practices I have seen in quite some time.

I wanted to take some time to dissect what was happening and make some suggestions for making sure similar exposures do not happen in your organisation.

First, some background context. Many dating apps now use geolocation to enable you to find like-minded people in your local area. In order to achieve this, your location is first determined by your device. The mobile app then sends this geolocation information to the backend systems. Depending on the permissions you’ve granted the app, this information may be continuously updated or only when the app is in the foreground. …


Image for post
Image for post

Service meshes are a powerful way to manage network traffic at runtime. They work best when the mesh encompasses every endpoint. If you already have a Kubernetes cluster running in production, introducing a service mesh such as Istio can be hard.

Real Kinetic has helped clients deploy Istio to production with great effect, and I wanted to talk through some of the tips and strategies we’ve employed to achieve that.

Getting started

To start with, install Istio into lower, non-prod environments. There are a plethora of Helm chart configuration options that can be used to fine-tune the deployment of Istio to the cluster. …


Image for post
Image for post

Building and maintaining infrastructure, especially in the cloud, is becoming more and more complex. Infrastructure as Code (IaC) has become an essential part of managing that complexity. We at Real Kinetic have worked with many teams to help implement and maintain large deployments across AWS and GCP.

Terraform

Both AWS and GCP come with their own flavors of IaC — CloudFormation and Cloud Deployment Manager, respectively. Both have their pros and cons, but we have found that HashiCorp’s Terraform is the simplest, best documented, and most widely supported. Many of our clients find Terraform to be the best option.

Repository structure

When maintaining infrastructure through Terraform, we recommend that a two-repo structure is used. The first is the modules repo. This is where the blueprints of the infrastructure are stored. This is a shared repo where product and operations teams would contribute their infrastructure definitions. Standard SDLC applies to this repo, i.e. pull requests, code review, tagging, and releasing. …


Image for post
Image for post

In July 2019, Capital One was breached and around 30GB of credit application data was exfiltrated, impacting around 106M people.

There are plenty of sites that can give you an in-depth technical breakdown of how the breach occurred so I won’t go too far into the issue here. This is what we know:

  • A misconfigured firewall unintentionally exposed a service to the public internet.
  • A vulnerability in this service allowed the attacker to execute arbitrary commands remotely. Some signs point to SSRF, but again, the details aren’t specific.
  • By querying the internal metadata service that AWS provides, the attacker was able to gain the credentials associated with the instance that was executing the commands. …

Image for post
Image for post
Photo by Jordan Harrison on Unsplash

On June 6, Google Cloud suffered a major degradation in service that caused many US based services to experience high levels of packet loss and latency. Popular services included:

  • YouTube
  • Compute Engine (including GKE & Cloud Load Balancer)
  • Cloud Storage
  • App Engine

It would appear that non US regions experienced little impact unless they were requesting resources from the affected areas. In total, the outage lasted for around 4.5 hours.

Human error

Based on my reading of the post mortem, it appears that a series of misconfiguration events caused the network control plane capacity to be significantly reduced, causing the network to fail into a static mode where changes to the data plane were not possible (but data was still able to flow). After a few minutes, BGP routes were automatically switched away from the impacted areas causing a massive amount of traffic to be handled by other physical locations in the same region thereby causing the ‘network congestion’. …


Image for post
Image for post

GKE is a managed Kubernetes offering by Google Cloud Platform (GCP). The services that you deploy work together to form the application. Each service needs to be able to communicate with its neighbours and that communication typically needs to authenticated and authorised. This post is going to walk you through setting up and using Google Cloud service accounts to authorise access to Google Cloud Services such as storage and KMS.

When you create the cluster, you provide a service account and set of scopes (or permissions) that make up the default credentials that the underlying nodes (aka VMs) will use to access other Google Cloud Services. …

About

Nick Joyce

Cloud herder. Code monkey. Wood worker. Husband. Human. Managing Partner at Real Kinetic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store