AWS S3 — Disaster recovery using versioning and objects metadata

Jacek Małyszko
Fandom Engineering
Published in
14 min readNov 6, 2020

AUTHOR: Jacek Małyszko, Data Engineer @ Fandom

Accidental removal of data on S3 is something that no Data Engineer on AWS wants to be involved in. Unfortunately, storage of data on S3 may be expensive so from time to time you may need to get rid of some terabytes here and there. Such data deletions may be performed automatically or manually. As we’re “only humans”, some mistakes may occur. For example, once I had this sad situation; instead of removing all files under 2020/05/01 prefix, I removed 2020/01.

Fortunately, the bucket had Versioning enabled, so the data was not completely lost; it was just hidden by deletion markers (see https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html). Still, bringing back those files wasn’t straightforward. You cannot just “undelete” all files from the given prefix; S3 API does not provide us with an appropriate method. Also, this would not be a wise idea anyway, because files under the prefix may have been removed intentionally on a different occasion than this accidental data removal which we’ve performed just 5 minutes ago. We need a good, battle-tested way to bring back those files that we removed accidentally, and only those.

This article contains all the code and comments/descriptions/explanations necessary to guide you through disaster recovery in case of accidental removal of data on versioning-enabled S3 bucket. We also analyse what to do when we accidentally overwrite some versioned objects and want to bring back older versions as the default ones.

In this article, I conduct an exercise, where I:

  1. prepare a new bucket with sample data and versioning enabled
  2. simulate “disaster” (accidental data loss) by randomly deleting and overwriting some files
  3. identify which files were affected by analysing objects’ metadata
  4. restore the bucket to the state from before the “disaster”

You’ll find plenty of code samples, which hopefully you may find useful and with small or no changes you may use in your disaster recovery scenario. The general procedure presented in this article is as described below.

General Procedure for disaster recovery using S3 versioning and objects’ metadata

  1. Identify the exact time when the erroneous operation was performed. Define START_TIMESTAMP and END_TIMESTAMP of the operation which caused the data loss
  2. Identify roughly what data may have been affected, i.e. see on what bucket and prefixes in this bucket the operation may have been performed
  3. If possible (e.g. if the data is small enough) make a copy of the affected prefixes in the bucket, including historical versions, so that if the recovery goes wrong, you still have the backup. This is an optional step, as no always you may want to backup for example 100 terabytes of data
  4. Identify affected objects based on the defined timestamps (START_TIMESTAMP and END_TIMESTAMP). Using AWS Console or AWS API, you may list all objects and versions under a given prefix and check when the version was created. Thus, we may identify all versions that were created in the defined period when the disaster happened. If in this period no one else was messing with the bucket and prefixes that we analyse, we can get the list of all objects affected by the “disaster”
  5. Remove versions that were identified in the previous point
  6. If everything went well, delete markers and incorrect new versions are now gone and the old versions from before the accidental operation are now used by AWS as the current versions of these objects — we’ve “undone the disaster”

Warning!

In the code supplied with this article we upload and remove data automatically using boto3. There is a risk of accidental real data loss, e.g. if some S3 paths get confused during the exercise. To make sure that nothing wrong happens, I suggest to:

  • either use your private S3 account without any production data (it should not incur costs higher than a few cents)
  • if you use AWS account in which there are some real, valuable data stored on S3, you should create a new IAM user with programmatic access and grant this user permissions to read/write on S3 to only selected buckets, that is those with names starting with e.g. disaster-recovery-test. Then, make sure that in the experiment you use a bucket with this prefix and that you use AWS profile with credentials of this new user with limited permissions!

OK, let’s do this!

The rest of the article is actually an export form jupyter notebook. There are therefore some sections of code, generated output and regular text with description of what’s happening. Therefore please keep in mind that some variables defined in one code block are available in code blocks below as well.

Please fill your AWS profile from .aws/credentials as PROFILE_NAME in the block below. All operations will be conducted on a newly created bucket with a random name.

Output:

disaster-recovery-test-43c357fc-1d59-11eb-8069-acde48001122

Let’s define a function that will later help us to analyse and obtain listings of files and versions in our bucket. Below I define two python functions, these are:

  • get_versioning_delete_overwrite_status identifies which object keys were affected (deleted or overwritten) in a given period of time. It's the most important assumption in this exercise that we know when the disaster happened (e.g. when the script causing data loss was fired and when it ended its execution). The function list all versions (including delete markers) in the given bucket and for each version it checks its creation time. If this creation time is between the provided timestamps, the version (or delete marker) is identified to be a result of the "disaster"
  • print_versions_summary - this one just nicely prints the results of get_versioning_delete_overwrite_status

These functions are a bit long so you may either read through them right now or come back here later when we’ll demonstrate their usage.

Let’s establish a boto3 session and create a bucket for our exercise. We enable versioning for the bucket.

Output:

{'ResponseMetadata': {'RequestId': '9D30CF0133077377',
'HostId': '5JcxJj2gw+O4OH17M4YTTlBjHR2wBFZX2GbpvU9uIQYjwHVd1U1UyyS96nCB1j4tWrFh6csYqrU=',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amz-id-2': '5JcxJj2gw+O4OH17M4YTTlBjHR2wBFZX2GbpvU9uIQYjwHVd1U1UyyS96nCB1j4tWrFh6csYqrU=',
'x-amz-request-id': '9D30CF0133077377',
'date': 'Mon, 02 Nov 2020 22:18:01 GMT',
'content-length': '0',
'server': 'AmazonS3'},
'RetryAttempts': 0}}

We create a certain number of files in the bucket. Object names are just incremental integers, and contents of the created files is just the filename concatenated with the “contents” word.

Let’s see how many files are listed now in the bucket.

Output:

Found files: 26

If you like, go ahead and log in to your S3 console and examine the created bucket to see the created files.

Accidental data removal — how to identify what’s “lost”

Actually “lost” word is inaccurate; if we remove some objects with versioning enabled the data is still there, but it’s hidden by a delete marker.

Let’s remove the first file returned by S3 when listing the bucket. Let’s assume that this remove operation is intentional — I really want to remove this file. Using this assumption we will later be able to see if we can identify intentional vs unintentional data removals if those happened at different moments in time.

Output:

Intentionally removing file 00.txt

Now, let’s randomly delete some files in the bucket. In the code below there’s 25% chance for each file to be removed. Let’s assume, that this operation is unintentional; e.g. it’s executed on the wrong S3 bucket. This is our “simulated disaster” and later in the article, we will want to revert those delete operations.

The crucial factor here is that we need to know when we executed the erroneous operation; by checking when delete markers were created and comparing those timestamps to the period when the operation was performed, we are able to identify which delete markers were created by accident. To make things simpler, I record the start timestamp and end timestamp for the accidental data removal operation in the code. Also, I make sure that at least 60 seconds passed from the intentional file removal to be able to easily differentiate those periods.

Output:

Let's wait another 10 seconds to make sure that some minor changes in time synchronization will not break our logic...
deleting 05.txt
deleting 09.txt
deleting 12.txt
deleting 13.txt
deleting 18.txt
deleting 19.txt
deleting 23.txt
Waiting 10 more seconds after the delete...
Done

Regular listing of files will return now only non-removed files (those without delete markers).

Output:

Found files: 18

Let’s see the results of the version-aware listing of this bucket. We use here values of START_TIMESTAMP and END_TIMESTAMP recorded while deleting files in the simulated disaster. Here, we use the two functions defined in the article before. By defining start_date and end_data we filter the results to only those events, which happened in the relevant period.

Output:

For the given period the following changes were found:
The following delete markers were created in the analyzed period:
{'Key': '05.txt', 'VersionId': 'Tprx_1S2c9JbKWx0hxFu_vyV6UnKhMco'}
{'Key': '09.txt', 'VersionId': 'rk7gDGApWPinmMWr8GkhfxXN0EuC4S.V'}
{'Key': '12.txt', 'VersionId': 'PGO4o2vKTCv3hqtPWFAjm.EvbfYebNOZ'}
{'Key': '13.txt', 'VersionId': 'TyUDpCD4qOU04o64w6rtpnpcjkIgRC8t'}
{'Key': '18.txt', 'VersionId': 'j5ut3iQWL0KaUgq4G4MKJbGs2mJ0PKrC'}
{'Key': '19.txt', 'VersionId': '1qNd2Ozz0grg_7g5daNzE3_SbnX5iUJU'}
{'Key': '23.txt', 'VersionId': '6Ghvijy9my5kM9LdU0VHS1IbDOuwQfGW'}
The following files are also hidden behind delete markers, but those markers were added in a different time period:
00.txt
No versions overwriting older files were uploaded

Please note that the identified files are the same as those printed before when randomly removing objects. Also, the delete marker created before the relevant period is correctly identified as not relevant and kept separately so that we know that we should not restore that file ( its delete marker should be left intact).

Remove identified delete markers

In delete_markers list we have keys of delete markers, together with versionIds, which need to be specified when talking to S3 to point to delete markers and not some other versions of these objects. To restore old files "hidden" by the delete markers, let's just remove those delete markers and see if the files are visible again when listing the bucket.

First, let’s make sure that the files are not returned by a regular, non-version-aware listing.

Output:

05.txt is not returned by listing operation, it's probably hidden by a delete marker.
09.txt is not returned by listing operation, it's probably hidden by a delete marker.
12.txt is not returned by listing operation, it's probably hidden by a delete marker.
13.txt is not returned by listing operation, it's probably hidden by a delete marker.
18.txt is not returned by listing operation, it's probably hidden by a delete marker.
19.txt is not returned by listing operation, it's probably hidden by a delete marker.
23.txt is not returned by listing operation, it's probably hidden by a delete marker.

Now, let’s remove the delete markers! That will hopefully restore the files.

WARNING! Deleting objects via boto3 with VersionID specified as shown below removes the version entirely and beyond the possibility to restore. So we need to be sure that what we’re deleting are indeed those delete markers and not some valid file versions. Any mistake in here may result in the fact that some data is completely lost, gone, bye bye, like in the Parrot Sketch:

This parrot is no more! He has ceased to be! ‘E’s expired and gone to meet ‘is maker! ‘E’s a stiff! Bereft of life, ‘e rests in peace! If you hadn’t nailed ‘im to the perch ‘e’d be pushing up the daisies! ‘Is metabolic processes are now ‘istory! ‘E’s off the twig! ‘E’s kicked the bucket, ‘e’s shuffled off ‘is mortal coil, run down the curtain and joined the bleedin’ choir invisibile!! He’s f*ckin’ snuffed it!….. THIS IS AN EX-PARROT!!

https://www.dailymotion.com/video/x2hwqnp

Enough of Pythons, in short: if you’re executing the code below you should really make sure that you’re running it on the appropriate bucket and prefixes.

Let’s see the code for removing the delete markers:

Output:

attempting to restore {'Key': '05.txt', 'VersionId': 'Tprx_1S2c9JbKWx0hxFu_vyV6UnKhMco'}
attempting to restore {'Key': '09.txt', 'VersionId': 'rk7gDGApWPinmMWr8GkhfxXN0EuC4S.V'}
attempting to restore {'Key': '12.txt', 'VersionId': 'PGO4o2vKTCv3hqtPWFAjm.EvbfYebNOZ'}
attempting to restore {'Key': '13.txt', 'VersionId': 'TyUDpCD4qOU04o64w6rtpnpcjkIgRC8t'}
attempting to restore {'Key': '18.txt', 'VersionId': 'j5ut3iQWL0KaUgq4G4MKJbGs2mJ0PKrC'}
attempting to restore {'Key': '19.txt', 'VersionId': '1qNd2Ozz0grg_7g5daNzE3_SbnX5iUJU'}
attempting to restore {'Key': '23.txt', 'VersionId': '6Ghvijy9my5kM9LdU0VHS1IbDOuwQfGW'}
Finished

Now, the files should be returned while a regular listing of the bucket. There are no delete markers, so the files are “restored”:

Output:

The file with key 05.txt is returned by a regular, non-version-aware listing operation.
The file with key 09.txt is returned by a regular, non-version-aware listing operation.
The file with key 12.txt is returned by a regular, non-version-aware listing operation.
The file with key 13.txt is returned by a regular, non-version-aware listing operation.
The file with key 18.txt is returned by a regular, non-version-aware listing operation.
The file with key 19.txt is returned by a regular, non-version-aware listing operation.
The file with key 23.txt is returned by a regular, non-version-aware listing operation.

If for each file affected by our simulated disaster there is such line in the output it means that we’ve successfully restored accidentally removed files. One last time let’s make sure that delete markers from the relevant period are no more, but that delete markers from other time are still there (because they were added intentionally):

Output:

For the given period the following changes were found:
The following delete markers were created in the analyzed period:
The following files are also hidden behind delete markers, but those markers were added in a different time period:
00.txt
No versions overwriting older files were uploaded

Overwriting files — how to identify accidental overwrites and restore older versions

Overwriting happens if we upload a file which path and name (bucket + key) is the same as the name of some object already stored in the same bucket on S3. In such a situation, if we have versioning enabled, the old file will still be stored, but it will be a “hidden, historical version” of the object with the given key; the newly uploaded file will be the “current” version and it will be returned by default when fetching or scanning the object under a given key. The old file may still be retrieved by specifying the versionId of the object under the given key. We may also “restore” some old version of an object to be the current one by removing newer versions, which overwrote it before.

Let’s overwrite the first file. Again, similarly to as it was done with delete markers, let’s assume, that we overwrite this file intentionally.

Output:

Intentionally overwriting file: 00.txt

Now, let’s overwrite files randomly, again keeping track of the start and end timestamps. We skip the first file (it was overwritten intentionally in the step before). We also upload some additional files, which were not uploaded before (completely new objects, these will be the first versions of these objects in the bucket).

Output:

Accidentally overwriting file: 07.txt
Accidentally overwriting file: 13.txt
Uploading new file: 46.txt

Let’s list the contents of the bucket and see if we are able to identify accidentally overwritten files based on the creation date of respective versions. Function get_versioning_delete_overwrite returns two lists corresponding to new versions created in the relevant period:

  • new_overwriting_versions - these are versions which overwrote some older versions, that is keys of these files are the same as keys of objects already existing in the bucket before we started the upload
  • new_versions - these versions have keys that did not appear on the S3 in this bucket before. That means that these versions did not overwrite any data existing before.

The contents of those lists are appropriately printed below.

Output:

For the given period the following changes were found:
The following delete markers were created in the analyzed period:
The following files are also hidden behind delete markers, but those markers were added in a different time period:
00.txt
The following versions overwriting older files were uploaded:
{'Key': '00.txt', 'VersionId': 'VeXV9wy6C7tIhUMJ7iKa3UMAUMFxmSK0'}
{'Key': '07.txt', 'VersionId': 'MFWLe8nZCo.xxoDkFli8GxIZhqSVQJC3'}
{'Key': '13.txt', 'VersionId': '.9dxk9s3.Y0IAYiJ2T5vxw6yabjWib7i'}
The following were newly uploaded files found (not overwriting any older files):
{'Key': '46.txt', 'VersionId': '0rzLzSiOa7kOay3yDPzNCbmukpuACp0Z'}

Please see that the file that was overwritten “intentionally” does not appear anywhere in the results displayed above.

Now, for each key found to be overwritten, let’s fetch and display a detailed list of all versions of these keys:

Output:

All versions for key 00.txt
ID VeXV9wy6C7tIhUMJ7iKa3UMAUMFxmSK0, IsLatest: True, 'LastModified: 2020-11-02 22:19:36+00:00
ID QZeuwqtryA43kx6dvtpsjEqif2oXWcjF, IsLatest: False, 'LastModified: 2020-11-02 22:18:02+00:00
All versions for key 07.txt
ID MFWLe8nZCo.xxoDkFli8GxIZhqSVQJC3, IsLatest: True, 'LastModified: 2020-11-02 22:19:38+00:00
ID VZwz2FRJeUm8etPAOMRq3Y4Fj9_AOaA7, IsLatest: False, 'LastModified: 2020-11-02 22:18:03+00:00
All versions for key 13.txt
ID .9dxk9s3.Y0IAYiJ2T5vxw6yabjWib7i, IsLatest: True, 'LastModified: 2020-11-02 22:19:38+00:00
ID liw6uFc4ziXnKmVSP5713S_I5sc3vR9J, IsLatest: False, 'LastModified: 2020-11-02 22:18:05+00:00

Cool. Each of those files should have exactly two versions; one is the original one and the other is overwritten accidentally. Let’s remove those accidentally uploaded versions.

Again, please note that you need to be extra careful because removing versions as below deletes them beyond the possibility to restore them.

Output:

Removing accidentally uploaded version {'Key': '00.txt', 'VersionId': 'VeXV9wy6C7tIhUMJ7iKa3UMAUMFxmSK0'}
Removing accidentally uploaded version {'Key': '07.txt', 'VersionId': 'MFWLe8nZCo.xxoDkFli8GxIZhqSVQJC3'}
Removing accidentally uploaded version {'Key': '13.txt', 'VersionId': '.9dxk9s3.Y0IAYiJ2T5vxw6yabjWib7i'}
Finished

Now let’s display all versions of those keys. Please see that newer versions uploaded “accidentally” in the relevant period are removed. This means that the old versions, ones from before the disaster, are the default versions right now and are used when e.g. scanning those files.

Output:

All versions for key 00.txt
ID QZeuwqtryA43kx6dvtpsjEqif2oXWcjF, IsLatest: False, 'LastModified: 2020-11-02 22:18:02+00:00
All versions for key 07.txt
ID VZwz2FRJeUm8etPAOMRq3Y4Fj9_AOaA7, IsLatest: True, 'LastModified: 2020-11-02 22:18:03+00:00
All versions for key 13.txt
ID liw6uFc4ziXnKmVSP5713S_I5sc3vR9J, IsLatest: True, 'LastModified: 2020-11-02 22:18:05+00:00

Yaay! We’ve “undone” the disaster!

Procedure for disaster recovery — in details

Before the final summary, let’s write down what steps should be taken in case of accidental data loss when we have versioning enabled. In the intro to the article I’ve already presented such a procedure, but now let’s extend it with pointers to appropriate functions and pieces of code presented in the article.

The starting point is when we realise that we’ve accidentally removed or overwritten some data and we want to revert our actions. Starting from here:

  1. Identify the exact time when the erroneous operation was performed. Define START_TIMESTAMP and END_TIMESTAMP of the operation which caused the data loss
  2. Identify what data may have been affected, i.e. see on what bucket and prefixes in this bucket the operation was performed
  3. If possible (e.g. if the data is small enough) make a copy of the affected prefixes in the bucket. This is an optional step.
  4. If the number of files affected was small enough and easy to identify using AWS Console (Web GUI) you may manually (using AWS Console) remove delete markers or incorrect current versions to revert to previous ones
  5. If you cannot easily identify affected files, identify affected versions automatically based on the defined timestamps (START_TIMESTAMP and END_TIMESTAMP) using get_versioning_delete_overwrite_status function. List the identified potentially incorrect versions using print_versions_summary method
  6. Analyse print_versions_summary output. Pick some random versions identified to be created in the given period and make sure that the output is correct (i.e. that indeed these versions were created at this time).
  7. It is possible that the output of print_versions_summary will return versions created at the same time but created by some other operation, which just happened to be performed in the same time. If this is the case, try to further limit the part of the bucket being listed using the Prefix parameter to still include results of the unintentional data loss but to skip versions created correctly by some other operation
  8. When you’re sure that you have a complete list of incorrect versions (including delete markers), proceed with removing them using appropriate s3.delete_object operations with specified object Versions IDs, as shown in the example above.
  9. If the previous step was successful, delete markers and incorrect new versions are now gone and the old versions from before the accidental operation are now used by AWS as the current versions of these objects.
  10. Celebrate!

In this article, we’ve seen that, if we know when the data was deleted or overwritten by accident, we are able to identify erroneous versions and delete markers merely by utilising versioning, bucket listing, and file metadata. We do not need any logs to analyse. We may also automatically revert those erroneous actions by appropriate python code. Still, the crucial thing is that we need to know exactly when the data was deleted or overwritten.

Wrapping things up, if you are a Data Engineer working with S3 on a daily basis, I’d recommend running such a “simulated” disaster as an exercise. If some bad stuff happens — you’ll have some experience on what to do and the whole situation will be far less painful, hopefully. Thanks for reading!

Originally published at https://dev.fandom.com.

--

--