Design Patterns / Tools / Curated Articles

Julien Hurault
4 min readOct 30, 2023

--

Data Eng Weekly — Ep. 3

Design Pattern of the week: idempotency (2/2)

Last week, we had a look at the definition of idempotency in the data engineering context. A lambda function is said to be idempotent if it returns the same output even if it receives the same message multiple times.

Implementing idempotency protections from scratch can be quite costly in terms of time and quickly generates a large amount of code not containing any business logic that will have to be maintained.

This week we will see how the python library aws-lambda-powertools can help us to set up idempotency quickly. The lib’ actually contains a broader panel of useful utils (logging, tracing, batch processing) that encapsulate design best practices.

One submodule of the library is called “idempotency”: it provides a decorator @idempotent that you can simply add to your lambda handler (and/or specific idempotent function).

from aws_lambda_powertools.utilities.idempotency import (
DynamoDBPersistenceLayer, idempotent
)
persistence_layer = DynamoDBPersistenceLayer(table_name="IdempotencyTable")@idempotent(persistence_store=persistence_layer)
def handler(event, context):
payment = create_subscription_payment(
user=event['user'],
product=event['product_id']
)
...
return {
"payment_id": payment.id,
"message": "success",
"statusCode": 200,
}

The decorated handler writes automatically for each event its processing status (“INPROGRESS” or “COMPLETED” or “EXPIRED”) to a persistent layer (most of the time a table in Dynamo DB).

The event table of the persistent layer in dynamo DB has the following structure:

  • id = Partition key of the table. Hashed representation of the payload
  • expiration = Unix timestamp of when record expires
  • status = “INPROGRESS” or “COMPLETED” or “EXPIRED”
  • data = Stores results of successfully executed Lambda handlers
  • validation =Hashed representation of the parts of the event used for validation

A typical lambda workflow would then be similar to this:

Thanks to the persistence layer, we can make sure that only one execution will be done:

  • if a duplicated event arrives during the execution of the original one
  • if a duplicated event arrives after the execution of the original one (the validity time of an invocation is defined by the expires_after_seconds parameter passed to the decorator)

Additionally, it can also log other details such as the start and end time of event processing, any errors encountered, and more.

The options to customize the decorator are quite large, indeed you can:

  • use an event subset for idempotency
  • attach the decorator to a custom function (outside of the handler)
  • extract the idempotency unique ID automatically from the event itself (leverage jmespath for complex JSON parsing)
  • replace dynamo DB with any other persistent storage (they provide a customizable abstract class) or the lambda cache

To conclude, AWS Lambda Powertools makes it really easy to build a persistence layer across your stack. You can leverage this layer to easily track the progress of events, troubleshoot issues, and improve the observability of your data platform.

Video/Article of the week

An interesting survey in the SeattleDataGuy’s Newsletter about the data stack used by ~400 companies.

Key takeaways:

  • 49% of respondents rely on multiple analytics platforms (bigquery, snowflake, Databricks)
  • orchestration tools chaos
  • Surprisingly Airflow still has a large market share even with the rise of Dagster and Prefect.
  • data team’s problems:

It seems the gap between talent demand and offer is still not closed in the data engineering industry and won’t be in the next few years. Still, lots of opportunities in the area.

Tool of the week

The modern data stack is extremely DevOps intensive. It quickly requires a large number of resources to synchronize and catch up with the latest updates of different open-source tools. Plural automates this process by providing data stacks ready to deploy in your cloud.

In one command you get a complete data stack deployed on Kubernetes in your cloud account and…. for free. They offer an “Open-source” tier with an unlimited number of apps in your stack.

Looks promising!

https://www.plural.sh/

thank you for reading.

-Ju

I would be grateful if you could help me to improve this newsletter. Don’t hesitate to share with me what you liked/disliked and the topic you would like to be tackled.

P.S. you can reply to this email; it will get to me.

--

--