Replaying events using AWS S3

Marko Savic
SMG Real Estate Engineering Blog
3 min readApr 8, 2022

In the world of event-driven microservices one of the important tasks is to have a mechanism to recover from failed events, as well as to have possibility to replay certain events. We at Homegate deal with serverless microservices. If, at any of these services, we encounter a failure during processing of an resource, we might need to recreate the last event or series of events for given resource.

Usecase

The central part of ingestion process is AWS StepFunctions that lives in a, so called, importer service. It is responsible for various enrichments and conditional actions on resource. The last step of the step function is sending an event to the AWS SNS which is consumed by many downstream services as you can see in the following diagram.

Step functions as central part of ingestion process

Here, many open questions can be identified:

  • How to deal with failure within AWS StepFunctions?
  • How to recreate SNS event if needed for downstream system?
  • Would it be possible to replay the whole step function?

Solution

Possible answer to these questions would be to store events that goes through AWS StepFunctions into some kind of event storage, in Homegate case, we chose AWS S3. In order to support the whole history of events for given resource, S3 bucket versioning is used. This is quite handy for replaying series of events or even for debugging purposes. Here is a sample CloudFormation which enables that functionality:

EventLogBucket:
Type: 'AWS::S3::Bucket'
Properties:
BucketName: 'sample-unnique-bucket-name'
AccessControl: Private
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Tags:
- Key: service
Value: sample-service
- Key: env
Value: !FindInMap [ Stages, !Ref Stage , Env ]
VersioningConfiguration:
Status: Enabled

That way we can replay the whole step functions flow using last logged event for given resource. That functionality could be available from different places, like DLQ or via api endpoint. In the following diagram and the description below it will be explained how all these components communicate with each other.

S3 event log bucket as a central part of replay mechanism

As you can see, in this case, the first step in given AWS StepFunctions would be to store the event into corresponding S3 bucket. As a key for S3, resource id can be used. Here is the example of S3 structure with versioning enabled, where for the same object id we have all the history of events available:

Object versioning in S3

In case of a failure simple DLQ could be used for the error handling. Here, different approaches would be possible like having an auto replay mechanism which will, on schedule, trigger lambda function to fetch messages from DLQ and trigger replay. If all conditions are satisfied (for example if error is retriable) given lambda will fetch latest event for given resource and trigger step function again.

Replay functionality is also available from outside of the service in case we need to repeat importing of certain resources. In order to have more flexibility, this is exposed over AWS API Gateway where consumer of the endpoint can provide various input params like multiple resources, or all the resources that belongs to a specific user, or if the new execution would need a special treatment (replaying series of events or have specific data manipulation). To support this flexibility, a Replay SQS is added between Replay Handler Lambda and Replay Listener Lambda.

Finally, Replay Listener is responsible to fetch the events from S3 and trigger step functions according to given input.

To conclude, in modern system architecture, having a possibility to replay events is a necessity and has many usecases. As we saw in the simplified real-life example from above, AWS S3 could be used as a simple, yet very effective helper tool to replay events and help to ensure consistency of intercommunicating microservices.

References

See also

--

--