Dealing with S3 backups in a creative way

Calin Florescu
7 min readJul 6, 2023

--

Introduction

Hello there; similar to my last article, I would like to share with you the experience I had implementing an exciting feature with, hmm, let’s say, an interesting setup. The purpose of this, like always, is to help others in a similar situation, if not with a complete solution, at least with an idea that can get you moving forward.

The setup

Within the current project on which I am working, we have a microservice architecture with services running in a Kubernetes cluster. Since the services are responsible for many functionalities, it was expected that the need for an external storage solution would appear for some of them (to store images or documents), so the answer was to use the Cloud Storage service (GCP version of AWS S3).

Since we have a multi-cluster environment, we decided to empower the developers to control their storage instances and distribute the load previously on the platform team managing them.

We did that using Crossplane, a tool that allows you to manage external infrastructure within the Kubernetes cluster.

Crossplane uses predefined libraries called providers (similar to Terraform ones) that install the required CRDs to the cluster to manage the external infrastructure resources. For example, if you want to create a Bucket in GCP, you would use a GCP provider like this one, and after you install it in the Kubernetes cluster and provide it with the required permissions on the GCP side, you will have a new resource type to manage your Buckets.

apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
annotations:
meta.upbound.io/example-id: storage/v1beta1/notification
labels:
testing.upbound.io/example-name: bucket
name: bucket-${Rand.RFC1123Subdomain}
spec:
forProvider:
location: US

Now that we understand what setup we are working on let’s tackle the problem!

The challenge

The target is to create a disaster recovery plan for the data stored in the Cloud Storage instances so that we can revert the state of the storage instance to a previous version in case of a malicious attack. The built-in versioning needed to be more sufficient for this use case because we needed the data replicated in multiple regions, and the original instances had a regional-type location. Also, we wanted to restore the instance as a whole.

The backup should happen daily, and we must keep a couple of versions of the data.

As I said, there is a pretty exciting setup and challenges ahead 😀

Searching for a solution

I started splitting the requirement into atomic tasks representing achievable milestones. After some brainstorming, I concluded that at a high level, I needed the following:

  • For each storage instance, I need to create a backup
  • A system is required to do the daily syncing between the buckets
  • The solution must be integrated into the existing system and as automated as possible.
  • Data recovery should be a straightforward process

The first task is relatively simple to implement since we already have a Crossplane composition that creates the Storage Instances; we can create a new composition based on the initial one and add one more resource. (We want two compositions because some clients may not opt for the disaster recovery plan, so there is no need to force this flow that increases costs)

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xbackupinstance
labels:
crossplane.io/xrd: xbackupinstance
spec:
compositeTypeRef:
apiVersion: clouds.storage.org/v1alpha1
kind: XBackupInstance
resources:
- base:
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
name: crossplane-bucket
spec:
providerConfigRef:
name: gcp-provider
forProvider:
uniformBucketLevelAccess: true
storageClass: "REGIONAL"
deletionPolicy: "Orphan"
# Backup bucket
- base:
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
name: crossplane-bucket-backup
spec:
deletionPolicy: "Orphan"
providerConfigRef:
name: gcp-provider
forProvider:
uniformBucketLevelAccess: true
storageClass: "MULTI_REGIONAL"

(I deliberately simplified the code and the configuration because that is not the purpose of the article)

For the second task, I was looking for something similar to a cron job that can be programmed to run periodically and execute a sync between the two storage instances.

A solution that came to my mind was creating a CronJob in Kubernetes that uses rclone, for example, to sync the two buckets.

In these scenarios, I usually try to avoid reinventing the wheel because someone smarter than me designed a better solution for this problem. My impact is combining the technologies in a form suitable for the customer that considers the business model, budget, scalability and maintainability.

With this in mind, I started looking for a solution on the GCP side, so I found the Storage Transfer Service that ticks all my requirements:

  • Managed service by GCP (we are not increasing the technology stack)
  • Has the possibility to schedule jobs and to sync data between storage instances (functional)
  • It’s using the internal network of GCP, so there is no egress traffic (cost-saving)
  • It can exclude from the sync operations the objects that were not modified, so it’s decreasing the number of operations that it’s doing (cost saving again)
  • Being a GCP service, the API is probably supported by the Crossplane provider that we are using (maintainability)

Using this solution, the last two tasks are checked because we can easily integrate them into our current flow. Since it’s GCP, the recovery process is well-documented and easy to do with proper access.

Design that I had in mind

Looks like we found a winner!!

I created an action plan:

  • Research the Crossplane Provider documentation for Storage Transfer API
  • Test it to see how it’s working
  • Integrate it into the existing setup
  • Test the backup concept
  • Focus on automation

Too good to be true

In this project, we use the Upbound Crossplane provider to interact with the GCP API.

At the time of implementation, my assumptions were wrong, and they were not supporting an interface for the Storage Transfer API, unfortunately.

My solution came to a dead end. Or not.

Back to the drawing board

With some research, I found that Upbound allows you to create your provider based on their template.

Creating a new provider is well documented, and this is an excellent solution for creating your library with just the resources you need; of course, this flexibility increases the work you put in to maintain the API definitions and updates. (Maybe this should be discussed in another article)

Bringing the concept to reality

First of all, I did the work to create the provider that has the support for GCP’s Storage Transfer API. It was pretty straightforward, and in no time, I had my required CRD in the cluster from which I could create objects.

apiVersion: storage.gcp-tj.upbound.io/v1alpha1
kind: TransferJob
metadata:
name: backup-transfer-job
spec:
providerConfigRef:
name: provider-gcp-tj
forProvider:
description: Backup transfer job
schedule:
- repeatInterval: 0
scheduleEndDate: {}
scheduleStartDate: {}
startTimeOfDay: {}
transferSpec:
- gcsDataSource:
- bucketName: ""
gcsDataSink:
- bucketName: ""

After this step, I could integrate it into the backup composition to create the resources simultaneously.

Said and done, the composition was modified:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xbackupinstance
labels:
crossplane.io/xrd: xbackupinstance
spec:
compositeTypeRef:
apiVersion: clouds.storage.org/v1alpha1
kind: XBackupInstance
resources:
- base:
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
name: crossplane-bucket
spec:
providerConfigRef:
name: gcp-provider
forProvider:
uniformBucketLevelAccess: true
storageClass: "REGIONAL"
deletionPolicy: "Orphan"
# Backup bucket
- base:
apiVersion: storage.gcp.upbound.io/v1beta1
kind: Bucket
metadata:
name: crossplane-bucket-backup
spec:
deletionPolicy: "Orphan"
providerConfigRef:
name: gcp-provider
forProvider:
uniformBucketLevelAccess: true
storageClass: "MULTI_REGIONAL"
- base:
apiVersion: storage.gcp-tj.upbound.io/v1alpha1
kind: TransferJob
metadata:
name: backup-transfer-job
spec:
providerConfigRef:
name: provider-gcp-tj
forProvider:
description: Backup transfer job
schedule:
- repeatInterval: 0
scheduleEndDate: {}
scheduleStartDate: {}
startTimeOfDay: {}
transferSpec:
- gcsDataSource:
- bucketName: crossplane-bucket
gcsDataSink:
- bucketName: crossplane-bucket-backup

(Again, the code is simplified to offer just an example)

It’s working on my machine and every other machine

It looks like I had everything required to test this new flow. So I created a template for a Storage Instance that should use the new composition.

apiVersion: clouds.storage.org/v1alpha1
kind: BucketInstance
metadata:
name: calin
namespace: nebula
spec:
parameters:
namespace: nebula
bucket:
name: calin
transfer:
schedule:
startDate:
- day: 9
month: 4
year: 2023
compositionSelector:
matchLabels:
crossplane.io/xrd: xbackupinstance

After some tests, I saw that we have a working product, although something needs to be added.

We want a couple of versions from which we can recover the data. To achieve that, I created one more transfer job that synced the data in the same backup storage instance but under a different path.

This means that the composition will have one more transfer job added to it, and the Storage Instance template will change to accommodate the setting for the second Transfer Job:

apiVersion: clouds.storage.org/v1alpha1
kind: BucketInstance
metadata:
name: calin
namespace: nebula
spec:
parameters:
namespace: nebula
bucket:
name: calin
transfer:
schedule:
startDate:
- day: 9
month: 4
year: 2023
##############
- day: 10
month: 4
year: 2023
##############
compositionSelector:
matchLabels:
crossplane.io/xrd: xbackupinstance

It’s not so nice to always look at the calendar to see what dates you need to use to configure the transfer jobs, so I created a Helm function that does that.

{{ $firstIteration := now | date "2006-01-2" }}
{{ $secondIteration := now | date_modify "+24h" | date "2006-01-2" }}

{{- define "start-dates" }}
startDate:
- day: {{ (split "-" $firstIteration)._2 | int }}
month: {{ (split "-" $firstIteration)._1 | int }}
year: {{ (split "-" $firstIteration)._0 | int }}
- day: {{ (split "-" $secondIteration)._2 | int }}
month: {{ (split "-" $secondIteration)._1 | int }}
year: {{ (split "-" $secondIteration)._0 | int }}
{{- end -}}

This implementation also supports future extensions if needed.

My use case implied working with GCP as an infrastructure provider, but a similar approach can be achieved using AWS since Upboud has providers for AWS.

Conclusion

During this implementation, I learned how important it is to dedicate time to understand the situation you are dealing with. With a proper understanding, the next big thing was to split the impossible task into smaller ones that seem reachable and prevent you from getting off track when things are not working correctly.

As I said in the article, technologies are the tools that allow us to build solutions for our clients, and there are multiple ways to reach excellence in a domain like this.

You don’t need to be part of the team that created GCP, Crossplane, Helm or Kubernetes if you can understand what those great people created and make the best use of those tools and combine them in a solution that perfectly suits the needs of your client and makes its life easier, you are also one of the greatest, at least in my eyes.

--

--

Calin Florescu
Calin Florescu

Written by Calin Florescu

Platform Engineer passionate about problem solving and technology.

No responses yet