Creating an AWS data pipeline using Terraform Modules

Iva Boishin
Converteo
Published in
14 min readFeb 6, 2023

Recently, I had the opportunity to design a data pipeline for data warehousing and visualisation purposes. The pipeline was built using Terraform to facilitate resource management. For those that are unfamiliar with Terraform, it is a cloud-agnostic IaC (Infrustructure as Code) tool. Terraform allows you to define the infrastructre that you need to deploy. Using Terraform, you can deploy the same architecture on different environements (ex. dev and prod) and it helps you easily manage your resources.

Terraform is a great service but can quickly become messy and overwhelming as your infrastructure grows. During the build phase of the data warehouse project, I was disatisfied with the amount of copy and pasting of resources that I was doing and at the amount of time I spent trying to locate a specific resource in the code. I found it hard to find resources online that explained how to reduce the repetition and improve the clarity. With this article, I aim to help those who are starting out with Terraform with those aspects.

The goal of this article is to demonstrate how to go from an architecture using only Terraform resources to one using Terraform modules. Terraform modules can be understood as functions with certain input variables. Instead of having to write out all of the steps necessary for a specific task, those steps are aggregated into a function. In the Terraform world, instead of having to specify all of the resources needed for a certain task, modules allow the programmer to group the resources that are often deployed together.

The architecture for this project was done in AWS for one simple reason : data privacy. With the prevalence of data privacy concerns in the past couple of years, my client insisted that the data be stored exclusively in the country of operation : France. At this time, Google Cloud Platform did not have data centers in France. Thus, the choice of Amazon Web Services was made for this project.

Below is a (significantly) simplified version of the architecture that was built.

Just by looking at the architecture, there seem to be elements that are repeated, notably the Lambda Functions and their triggers. Before going into the Terraform modules, here is the basic Terraform code that deploys the above architecture:

##################################################
################## LAMBDA ROLE ###################
##################################################

## Role for Lambda Functions
resource "aws_iam_role" "iam_for_lambda_analytics" {
name = "analytics-lambda-role"

assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}

resource "aws_iam_policy" "iam_for_lambda_s3" {
name = "analytics-lambda-role-s3-policy"

policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject*"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::mb-analytics",
"arn:aws:s3:::mb-analytics/*"
]
}
]
}
EOF
}

# Needed for S3 access
resource "aws_iam_role_policy_attachment" "iam_for_lambda_s3" {
role = aws_iam_role.iam_for_lambda_analytics.name
policy_arn = aws_iam_policy.iam_for_lambda_s3.arn
}

# Needed for logs
resource "aws_iam_role_policy_attachment" "iam_for_lambda_logs" {
role = aws_iam_role.iam_for_lambda_analytics.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

##################################################
################ LAMBDA FUNCTIONS ################
##################################################

## Zip code for Lambda Functions
data "archive_file" "lambda_code_seo" {
type = "zip"
source_file = "source/seo/lambda_function.py"
output_path = "source/seo.zip"
}

data "archive_file" "lambda_code_sea" {
type = "zip"
source_file = "source/sea/lambda_function.py"
output_path = "source/sea.zip"
}

data "archive_file" "lambda_code_website_events" {
type = "zip"
source_file = "source/website_events/lambda_function.py"
output_path = "source/website_events.zip"
}

data "archive_file" "lambda_code_website_sessions" {
type = "zip"
source_file = "source/website_sessions/lambda_function.py"
output_path = "source/website_sessions.zip"
}

## Deploy Lambda Functions
resource "aws_lambda_function" "seo" {
function_name = "seo"
description = "Lambda function to transform seo data"
role = aws_iam_role.iam_for_lambda_analytics.arn
filename = "source/seo.zip"
handler = "lambda_function.lambda_handler"
memory_size = 256
timeout = 30

source_code_hash = data.archive_file.lambda_code_seo.output_base64sha256

runtime = "python3.9"

}

resource "aws_lambda_function" "sea" {
function_name = "sea"
description = "Lambda function to transform sea data"
role = aws_iam_role.iam_for_lambda_analytics.arn
filename = "source/sea.zip"
handler = "lambda_function.lambda_handler"
memory_size = 256
timeout = 30

source_code_hash = data.archive_file.lambda_code_sea.output_base64sha256

runtime = "python3.9"

}

resource "aws_lambda_function" "website_events" {
function_name = "website_events"
description = "Lambda function to transform website_events data"
role = aws_iam_role.iam_for_lambda_analytics.arn
filename = "source/website_events.zip"
handler = "lambda_function.lambda_handler"
memory_size = 256
timeout = 30

source_code_hash = data.archive_file.lambda_code_website_events.output_base64sha256

runtime = "python3.9"

}


resource "aws_lambda_function" "website_sessions" {
function_name = "website_sessions"
description = "Lambda function to transform website_sessions data"
role = aws_iam_role.iam_for_lambda_analytics.arn
filename = "source/website_sessions.zip"
handler = "lambda_function.lambda_handler"
memory_size = 256
timeout = 30

source_code_hash = data.archive_file.lambda_code_website_sessions.output_base64sha256

runtime = "python3.9"

}

##################################################
################### SNS TOPIC ####################
##################################################

resource "aws_sns_topic" "website" {
name = "website"

policy = jsonencode(
{
"Version":"2012-10-17",
"Statement":[
{
"Effect": "Allow",
"Principal": {"Service":"s3.amazonaws.com"},
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:*:*:website",
"Condition":{
"ArnEquals":{"aws:SourceArn":"arn:aws:s3:::mb-analytics"}
}
}
]
}
)
}


##################################################
############### SNS SUBSCRIPTIONS ################
##################################################

resource "aws_sns_topic_subscription" "lambda_subscription_to_sns_website_events" {
topic_arn = aws_sns_topic.website.arn
protocol = "lambda"
endpoint = aws_lambda_function.website_events.arn

depends_on = [
aws_sns_topic.website
]
}

resource "aws_sns_topic_subscription" "lambda_subscription_to_sns_website_sessions" {
topic_arn = aws_sns_topic.website.arn
protocol = "lambda"
endpoint = aws_lambda_function.website_sessions.arn

depends_on = [
aws_sns_topic.website
]
}

##################################################
################### S3 BUCKET ####################
##################################################
# Necessary for raw data files
resource "aws_s3_bucket" "raw_bucket" {
bucket = "mb-analytics"
}

resource "aws_s3_bucket_acl" "raw_bucket" {
bucket = aws_s3_bucket.raw_bucket.id
acl = "private"
}


##################################################
#################### S3 NOTIF ####################
##################################################
resource "aws_s3_bucket_notification" "s3_bucket_notification" {
bucket = aws_s3_bucket.raw_bucket.id

lambda_function {
lambda_function_arn = aws_lambda_function.seo.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = "seo.csv"
}

lambda_function {
lambda_function_arn = aws_lambda_function.sea.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = "sea.csv"
}

topic {
topic_arn = aws_sns_topic.website.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = "website.csv"
}

depends_on = [
aws_lambda_permission.allow_bucket_config_seo,
aws_lambda_permission.allow_bucket_config_sea,
aws_lambda_permission.allow_invocation_from_sns_website_events,
aws_lambda_permission.allow_invocation_from_sns_website_sessions,
aws_sns_topic.website
]
}

## Permissions to invoke Lambda Function
resource "aws_lambda_permission" "allow_bucket_config_seo" {
statement_id = "AllowExecutionFromS3Bucketseo"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.seo.arn
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.raw_bucket.arn
}

resource "aws_lambda_permission" "allow_bucket_config_sea" {
statement_id = "AllowExecutionFromS3Bucketsea"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.sea.arn
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.raw_bucket.arn
}

resource "aws_lambda_permission" "allow_invocation_from_sns_website_events" {
statement_id = "AllowExecutionFromSNSWebsiteEvents"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.website_events.arn
principal = "sns.amazonaws.com"
source_arn = aws_sns_topic.website.arn
}

resource "aws_lambda_permission" "allow_invocation_from_sns_website_sessions" {
statement_id = "AllowExecutionFromSNSWebsiteSessions"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.website_sessions.arn
principal = "sns.amazonaws.com"
source_arn = aws_sns_topic.website.arn
}

As you can see just by looking at the code above, there are certain code blocks that follow the same logic and are repeated for the different file types. My goal was thus to find the most optimize manner of grouping and standardizing my code. This has two main advantages : (1) improves code readability and thus helps maintainability and (2) facilitates deployment of new use cases in the future. Here are the five tips that I would give to someone starting out with Terraform modules.

Tip 1: As your architecture grows, organize your code

Even with this relatively simple architecture, the code necessary to deploy it is already quite long. The first advice I can give to those starting out with Terraform is to organize your code in functional files. Terraform does not care whether you have one .tf file or many. During the deployment process, it simply concatenates all of the .tf files in the folder and starts deploying in a ‘random’ order. Note : if you need one resource to be deployed before another, add a depends_on clause to that resource declaration because the order that you define the resources in your code is not necessarily the order in which Terraform will deploy them.

Consequently, instead of having one massive main.tf file, it is more readable to split code into the various tasks or services. In this case, I would split my main.tf file above into 3 .tf files for the sources (seo.tf, sea.tf, website.tf ) and 1 .tf file (global.tf) for the shared resources. This allows me to easily duplicate my infrastructure for new use cases. Say in one week there was a need for an email data pipeline. All I would need to do is duplicate one of the source .tf files, update the code with the new use case and add the relevant configurations to the global.tf file. This allows for several advantages :

  • Saves time in development by rendering it easier to locate a section
  • Increases maintainability by providing clear organisation of resources

Tip 2: Group resources into logical modules

Beyond the organisation of the code, skimming quickly through, it becomes quite obvious that several sections have repeated code with minor modifications. This does not respect the basic DRY principle : Don’t Repeat Yourself. As I was creating the various services of this data pipeline, not only was I copying and pasting very often but for each new service, several of the resources were tighly coupled. For example, to create a Lambda Function resource, I needed to zip my python code every time. Thus, it would be logical to create a module of those to resources.

The code would then go from this :

## Zip code for Lambda Functions
data "archive_file" "lambda_code_seo" {
type = "zip"
source_file = "source/seo/lambda_function.py"
output_path = "source/seo.zip"
}

## Deploy Lambda Functions
resource "aws_lambda_function" "seo" {
function_name = "seo"
description = "Lambda function to transform seo data"
role = aws_iam_role.iam_for_lambda_analytics.arn
filename = "source/seo.zip"
handler = "lambda_function.lambda_handler"
memory_size = 256
timeout = 30

source_code_hash = data.archive_file.lambda_code_seo.output_base64sha256

runtime = "python3.9"

}

To this :

module "lambda_s3" {
source = "./modules/lambda_function"

lambda-role-arn = aws_iam_role.iam_for_lambda_analytics.arn
function-name = each.value
description = "Lambda function to transform ${each.value} data"
lambda-code-path = "source/${each.value}/lambda_function.py"
lambda-zip-path = "source/${each.value}.zip"

}

With the source module defined as :

##################################################
############## Lambda Python Code ################
##################################################
data "archive_file" "lambda_code_zip" {
type = "zip"
source_file = var.lambda-code-path
output_path = var.lambda-zip-path
}


##################################################
########### Raw Data Lambda Functions ############
##################################################

# Attendance config lambda function
resource "aws_lambda_function" "lambda_dash" {
function_name = var.function-name
description = var.description
role = var.lambda-role-arn
filename = var.lambda-zip-path
handler = "lambda_function.lambda_handler"
layers = [ "arn:aws:lambda:eu-west-3:336392948345:layer:AWSDataWrangler-Python39:3" ]
memory_size = var.memory_size
timeout = var.timeout

source_code_hash = data.archive_file.lambda_code_zip.output_base64sha256

runtime = "python3.9"

}

This example shows that while this requires more code than basic Terraform to do for just one resource, it can result in significant time savings when the module is used frequently. It also helps reduce copy and paste errors resulting from forgetting to update a part of the code block.

Note the use of var.* in the module definition. When creating modules, a variables.tf file is specified and relates to the variables that can be inputted when calling the module. When a default is specified for a module, that input variable is optional. If there is no default, the specification of that input variable is required. Here is an example of the variable.tf file for the Lambda Function module :

variable "lambda-role-arn" {
type = string
description = "IAM role used for lambda execution"
}

variable "function-name" {
type = string
description = "Name for lambda function"
}
variable "description" {
type = string
description = "Description of lambda function"
}

variable "lambda-code-path" {
type = string
description = "Location of code for lambda function"
}

variable "lambda-zip-path" {
type = string
description = "Location of zip file for lambda function"
}

variable "memory_size" {
type = number
default = 256
}

variable "timeout" {
type = number
default = 30
}

Tip 3 : Create modules for repetitive blocs as you go

When grouping resources into modules, it is not always simple to see where to draw the line. If we take a look at the architecture, it is obvious that triggers are tighly coupled with Lambda Functions : generally, one trigger per Lambda Function. I tried adding the S3 notification resource to the Lambda Function module but in the end found that there can only be one S3 notification resource block per bucket resource. If more are declared, they will overwrite the previously deployed version. Therefore, the notifications cannot be added to the lambda_function module, with the lambda instance and the data file. To avoid having to rethink or refactor the entire deployement of your architecture at the end of your project, my recommendation is to create modules as you go and to approach the process with trial and error.

Modules can be directly created within the Terraform project that you are working on, which allows for easy trial and error. For example, if the basic Terraform code was completely refactored, the folder hierarchy would be something like this :

|____ .terraform
|____ modules
|____ lambda_function
|____ s3_sns
|____ source
|____ sea
|____ seo
|____ website_events
|____ website_sessions
|____ main.tf
|____ providers.tf
|____ README.md
|____ variables.tf

In this case, as shown above, the modules can simply be called in the .tf files using a relative path :

source = "./modules/lambda_function"

Modules can also reside in their own project. The advantage of having modules in a different project is to be able to version them for use in different environements. Versioning can be done through the use of tags that are then specified at the end of the source url (see How to create reusable infrastructure with Terraform modules). This allows developpers to test a new feature or resource in a module in a development environement while keeping the production pipeline on the last stable tag. It also allows you programmers to reuse those modules in other projects. For this project, though, we will stick to the modules within the same project for the sake of simplicity.

Tip 4 : Take advantage of for loops and conditionals

Continuing with the optimization of the basic Terraform code, loops and conditional statements can be used directely in resources or modules to further reduce the amount of code that needs to be written. In the above module example, I demonstrated how to use a module to create one service based on a Lambda Function and zip file. Given that the pipeline for the different sources are very similar, we can loop through the code to deploy multiple Lambda services using the same module.

module "lambda_s3" {
source = "./modules/lambda_function"
for_each = toset(var.use_case_s3)

lambda-role-arn = aws_iam_role.iam_for_lambda_analytics.arn
function-name = each.value
description = "Lambda function to transform ${each.value} data"
lambda-code-path = "source/${each.value}/lambda_function.py"
lambda-zip-path = "source/${each.value}.zip"

}

The only difference between this code and the one above is the for_each line, which permits Terraform to deploy the same modules, replacing the values at the {each.value} specifications. Terraform can loop through a set (i.e. a list of unique values) or a map of strings (i.e. a dictionnary). The elements being looped are specified in the variables.tf file at the root of the Terraform deployment folder.

variable "use_case_s3" {
type = list(string)
default = ["seo", "sea"]
}

The above example creates multiple instances of the Lambda service by adding just one extra line of code. This clearly illustrates the economies of scale when using modules to create multiple services.

Loops can also be used for blocks within a resource. If we consider the triggers of the first two Lambda Functions, the code can be optimized by adding a dynamic block. Notice that you no longer call {each.value} but rather use the name of the block : {lambda_function.value}.

resource "aws_s3_bucket_notification" "s3_bucket_notification" {
bucket = var.bucket_name

dynamic "lambda_function" {
for_each = toset(var.use_case_s3)

content {
lambda_function_arn = "arn:aws:lambda:eu-west-3:${var.aws_id}:function:${lambda_function.value}"
events = ["s3:ObjectCreated:*"]
filter_suffix = "${lambda_function.value}.csv"
}

}

}

This can also be extended to work for the SNS topic blocks.

resource "aws_s3_bucket_notification" "s3_bucket_notification" {
bucket = var.bucket_name

dynamic "lambda_function" {
for_each = toset(var.use_case_s3)

content {
lambda_function_arn = "arn:aws:lambda:eu-west-3:${var.aws_id}:function:${lambda_function.value}"
events = ["s3:ObjectCreated:*"]
filter_suffix = "${lambda_function.value}.csv"
}

}

dynamic "topic" {
for_each = toset(var.use_case_sns.*.sns_topic)

content {
topic_arn = "arn:aws:sns:eu-west-3:${var.aws_id}:${topic.value}"
events = ["s3:ObjectCreated:*"]
filter_suffix = "${topic.value}.csv"
}
}

}

In the case of the SNS topic, a specific key (i.e. sns_topic) is used from the variable to create the set for the topic loop. Since the website source has two different transformation functions, the S3 notification functionality cannot be used to directly trigger the two Lambda Functions. Instead, the S3 notification needs to pass through an SNS topic, which will diffuse the data to the two lambda functions. In order to create this mapping, I used a dictionary variable to specify the two use cases.

variable "use_case_sns" {
type = list(object({
lambda_func = string
sns_topic = string
}))
default = [
{
lambda_func = "website_sessions"
sns_topic = "website"
},
{
lambda_func = "website_events"
sns_topic = "website"
}
]
}

Two seperate list variables could have been used, like in the case of the case of the seo and sea sources, however, this could lead to errors if you have multiple topics and you are not careful with the order of the list elements. This point is illustrated in the resource below, where we have to give permission for the SNS topic to invoke the Lambda Function.

resource "aws_lambda_permission" "allow_invocation_from_sns" {
for_each = {for idx, uc in var.use_case_sns: uc.lambda_func => uc}

statement_id = "AllowExecutionFromSNS"
action = "lambda:InvokeFunction"
function_name = "${each.value.lambda_func}"
principal = "sns.amazonaws.com"
source_arn = "arn:aws:sns:::${each.value.sns_topic}"
}

As a result, using a dictionary variable ensures that there is consistency in the ressource deployement and that the correct pairs are used.

Tip 5: Get creative but remain simple

Conditionals can also be used in modules in cases where you would like to pass different values based on an output from some other resource creation, for example. In this simple example, a conditional statement was ultimately not required. Nonetheless, I did use it initially for the SNS topic creation.

When starting out with modules and loops in Terraform, I was worried that if I specified a resource in the module, at least one of each resource in the module would need to be deployed. Thus I was concerned about the versatility of my module in use cases where I wouldn’t need an SNS topic but only S3 notifications, for example. For this reason, I introduced the logic below.

resource "aws_sns_topic" "dash-s3-event-topic" {
for_each = length(var.use_case_sns) == 0 ? toset([]) : toset(var.use_case_sns.*.sns_topic)

name = each.value
}

In essence, it states that if there are no elements in my use_case_sns variable, then (denoted by ?) use toset([]), otherwise (dennoted by :), use the elements passed in the variable. In the end, I decided to simply specify a default value of an empty list. So, if the use_case_sns variable is not specified when calling the module, an empty list would be passed.

variable use_case_sns {
type = list(object({
lambda_func = string
sns_topic = string
}))
description = "Dictionary of containing elements for each use case having an SNS notification. 'lambda_func' refers to the name of the lambda function. 'sns_topic' refers to the name of the SNS topic."
default = []
}

This allowed me to simplify the above code to simply use the values from the variable passed to it. Code is already hard to read and make sense of several months after you write it, no need to complicate it further and make it completely uncomprehensible. Just because a functionality is available, does not mean that it is always the best solution.

resource "aws_sns_topic" "dash-s3-event-topic" {
for_each = toset(var.use_case_sns.*.sns_topic)

name = each.value
}

That being said, the conditional functionality in Terraform can be useful (as explained by Yevgeniy Brikman in Terraform tips & tricks — loops, if-statements, and gotchas). My advice, not only applicable to code, is to define your need and to start by solving it. Once you have solved the need, see if you can optimize and simplify the solution, instead of getting bogged down and overwhelmed from finding the perfect solution on the first try.

Where we ended up

Using modules and loops, I was able to condense my deployement code to the following 100 lines :

##################################################
################## LAMBDA ROLE ###################
##################################################

## Role for Lambda Functions
resource "aws_iam_role" "iam_for_lambda_analytics" {
name = "analytics-lambda-role"

assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "lambda.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}

resource "aws_iam_policy" "iam_for_lambda_s3" {
name = "analytics-lambda-role-s3-policy"

policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject*"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::mb-analytics",
"arn:aws:s3:::mb-analytics/*"
]
}
]
}
EOF
}

# Needed for S3 access
resource "aws_iam_role_policy_attachment" "iam_for_lambda_s3" {
role = aws_iam_role.iam_for_lambda_analytics.name
policy_arn = aws_iam_policy.iam_for_lambda_s3.arn
}

# Needed for logs
resource "aws_iam_role_policy_attachment" "iam_for_lambda_logs" {
role = aws_iam_role.iam_for_lambda_analytics.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

##################################################
################ LAMBDA FUNCTIONS ################
##################################################

module "lambda_s3" {
source = "./modules/lambda_function"
for_each = toset(var.use_case_s3)

lambda-role-arn = aws_iam_role.iam_for_lambda_analytics.arn
function-name = each.value
description = "Lambda function to transform ${each.value} data"
lambda-code-path = "source/${each.value}/lambda_function.py"
lambda-zip-path = "source/${each.value}.zip"

}

module "lambda_sns" {
source = "./modules/lambda_function"
for_each = toset(var.use_case_sns.*.lambda_func)

lambda-role-arn = aws_iam_role.iam_for_lambda_analytics.arn
function-name = each.value
description = "Lambda function to transform ${each.value} data"
lambda-code-path = "source/${each.value}/lambda_function.py"
lambda-zip-path = "source/${each.value}.zip"

}

##################################################
######### S3 BUCKET + S3 and SNS NOTIF ###########
##################################################

module "raw_bucket" {
source = "./modules/s3"
bucket_name = var.bucket_name
use_case_s3 = var.use_case_s3
use_case_sns = var.use_case_sns
aws_id = var.aws_id
}

Notice the level of succinctness of the last module. Look through the GitHub repo and see if you understand how and why that module was created. The repo contains the entirety of the basic Terraform code at the beginning of the article (main_V0.tf) as well as the modules and updated code (mainf.tf).

--

--