Using Blocks in Terraform as a Data Engineer

Object types, Data sources, and Modules in Terraform.

Published in

CodeX

7 min readMay 5, 2024

In this post, we are going to look at the fundamental building block of Terraform applications in data engineering. Building a well-documented, maintainable, and reproducible data system includes using the right solution to the right problem.

In the case of provisioning multiple resources of the same resource block on AWS, we can easily leverage on list data type or objects or use modules to provision these cloud resources easily. We can always use modules to define a template of the property of the storage engine or data engine we are interested in deploying. Also, we can leverage data sources to call already available resources on our cloud domain and add more features or adjust the system configuration as needed by our data engineering solution.

But, let’s wait a minute: using the right tool and resources in data pipeline deployment is important for documentation, maintainability, reproducibility etc. In a situation where we want to set up AWS multiple Lambda functions on AWS cloud, create Glue Jobs with easy replication to other regions on AWS.

Handling production incidents and coming back online if there is an issue will be easy having all these skills and tools. We are going to look at object type, data sources and modules in Terraform. For example, having a solid understanding of how Terraform object type works is important when we design and implement data sources and modules in Terraform. For the most part, let’s look at using constructors in Terraform.

Constructors in Terraform

One of the powerful data types in Terraform is constructors. Type constructors allow us to specify complex types such as collections such as list(number), set(string), object(string), and map(number). This allows us to leverage the opportunity to use and declare multiple types in a single data type called object. This is the same way as setting up a JSON document, we include the key as a string and then declare the value as whatever data type that we are interested in.

# example JSON document
{
    "id"    : "job2"
    "events" : ["get"]
    "sns_arn" : "arn:aws:sns:us-east-1:xxxxxxxxxxxxx:sns_sample_topic"
    "filter" : {
      "prefix" : "path/to/job2/"
    }
  }

Objects data type in Terraform

We can make our EC2 instance setup more dynamic so that we can use different values of the ami and the instance type by using the object data type. The object data type is also known as the maps data type in Terraform. Object data type allows us to define complex data structures that can contain multiple variables with their data types. For example, the code below shows a sample ec2_instance object variable data type that takes values of EC2 AMI, EC2 instance type and a list of regions.

### ----------------- variables.tf------------------

variable "ec2_instance" {
  type = object({
    ami           = string
    instance_type = string
    region        = list(string)
  })

  default = {
    ami           = "t2.micro"
    instance_type = "ami-0889a44b331db0194"
    region        = ["us-east-1a", "us-east-1b"]
  }
}

Object data type declaration in Terraform

Now we can pass the values of the variable into our EC2 instance resource block declaration. We will do this as shown below:

### ----------------- main.tf------------------

resource "aws_instance" "list_instance" {
  ami           = var.ec2_instance.ami
  instance_type = var.ec2_instance.instance_type
  count         = length(var.ec2_instance.region)

  availability_zone = "${element(split(",", data.aws_availability_zones.available.names[count.index]), 0)}"
}

data "aws_availability_zones" "available" {}

The Terraform code will create EC2 instances in the availability zones in us-east-1a and in us-east-1b with the same AMI and instance type values. Interestingly, we can define more complex data structures with objects as shown below. We can use it to define lists of objects with different data types and iterate the value to create multiple resources at the same time.

### ----------------- variables.tf------------------

variable "jobs_notification" {
  type = list(object({
    id      = string
    events  = list(string)
    sns_arn = string
    filter  = object({
      prefix = string
    })
  }))
}

We can then pass the value of the variable by using the terraform.tfvars file as shown below.

### -------------------- terraform.tfvars -------------------------------

jobs_notification = [
  {
    id      = "job1"
    events  = ["create", "update"]
    sns_arn = "arn:aws:sns:us-east-1:xxxxxxxxxxxxxx:sample_sns_topic"
    filter  = {
      prefix = "path/to/job1/"
    }
  },
  {
    id      = "job2"
    events  = ["delete"]
    sns_arn = "arn:aws:sns:us-east-1:xxxxxxxxxxxxxx:sample_sns_topic"
    filter  = {
      prefix = "path/to/job2/"
    }
  }
]

The usage for the variable called jobs_notification is shown down below, in the Terraform code, multiple job notifications will be created on AWS based on the number of elements that were created in the terraform.tfvars file.

### ----------------- main.tf------------------

resource "aws_sns_topic_subscription" "job_notification" {
  for_each = { for job in var.jobs_notification : job.id => job }

  topic_arn = each.value.sns_arn
  protocol  = "sqs"
  endpoint  = "arn:aws:sqs:us-east-1:xxxxxxx:${each.key}_queue"

  filter_policy = jsonencode({
    "prefix" = each.value.filter.prefix
  })
}

Resource blocks in Terraform

Resources are defined through resource blocks, this is a way of telling Terraform, “Hey Terraform, please create this resource for us”. In the code below, we have asked Terraform to create a text file with the content ‘We love Data Engineers’ in our current directory. Also, we have asked Terraform to create the AWS S3 bucket using the resource name and the local name.

###-----------------variables.tf---------------------------

variable "bucket_name" {
  description = "The name of your s3 bucket name"
  type        = string
  default     = "bronze_data_bucket"
}

After the variable declaration, we are going to define our resource block as shown below to create the text file and the S3 bucket accordingly.

### -----------------main.tf------------------

resource "local_file" "text_file" {
  filename = "/data_engineers.txt"
  content  = "We love Data Engineers!!!"
}

resource "aws_s3_bucket" "bronze_data" {
  bucket = var.bucket_name

  tags = {
    Name        = "Data Lake Bronze Bucket"
    Environment = "Dev"
  }
}

In the above Terraform code, we have specified the resource keyword for the resource type, which in the above example is “local_file” for the text file resource and “aws_s3_bucket” for the AWS S3 bucket resource. We then proceed to provide the local name for the resources, which is in the example text_file for “local_file” resource and “bronze_data” for the “aws_s3_bucket” resource.

Now we can reference the resource in our configuration files in other resource blocks using local_file.text_file.filename and aws_s3_bucket.bronze_data.bucket.

### -----------------main.tf------------------

resource "aws_s3_bucket_object" "object" {
  bucket = aws_s3_bucket.bronze_data.bucket 
  key    = "my_file.txt"
  source = local_file.text_file.filename    
  etag   = filemd5(local_file.text_file.filename)
}

Data sources in Terraform

Data sources in Terraform are used to get information about resources that are external to Terraform and use them to set up our Terraform resources. When we mean external, we mean those resources that were created manually or created outside the current Terraform environment and we would like to use those resources in our present Terraform development.

Data sources allow Terraform to use resources defined by another separate Terraform configuration, or modified by functions. An example usage of using Data sources can be seen by using it to get the ‘id’ and the ‘arn’ values of the already available S3 bucket called ‘bronze-data-bucket’ that contains the Postgres sample data in the development account.

### -----------------main.tf------------------

data "aws_s3_bucket" "external_s3_bucket" {
  bucket = "dvdrental-play"
}

output "external_s3_bucket_id" {
  value = data.aws_s3_bucket.external_s3_bucket.id
}

output "external_s3_bucket_arn" {
  value = data.aws_s3_bucket.external_s3_bucket.arn
}

In the code widget above, we have used the data source resource block to get the S3 bucket properties. We then move forward to output the bucket ARN and the bucket id. The screenshot below shows the result of our Terraform code using terraform plan .

Response for “terraform plan” command in the terminal.

Modules in Terraform

The secret to creating a reusable Terraform code is using modules to manage our deployment. The Terraform module allows us to group resources and reuse this group later, reuse that module in multiple places throughout our code instead of having the same code copied and pasted in the staging and production environments. With this, we would be able to reuse our code from the same module in multiple environments.

To define a Terraform module, we only need to specify the module block with the associated module name and all the necessary variable values. In our case, instead of copying and pasting our S3 bucket resources over and over again to create data buckets. We can put the S3 buckets resource block into the modules folder and reference the module and create as many S3 buckets for our development based on our module.

We will define our module like the one below: we will then make sure to install the S3 module since this is the first time we will be using it. The way it works is that we must always install a newly defined module before we can be able to use the Terraform module. To install/download the Terraform module we are going to do ‘terraform init’, and this command will install the modules and all their dependencies.

###-----------------main.tf------------------

module "s3_bucket" {
  source = "./modules/s3"
  bucket = "api_data_bucket"
}

The Terraform module block is defined in ./module/s3/s3.tf as shown below.

###----------------- ./module/s3/s3.tf ------------------

resource "aws_s3_bucket" "bronze_data" {
  bucket = var.bucket_name

  tags = {
    Name        = "Data Lake Bronze Bucket"
    Environment = "Dev"
  }
}

We have called out our S3 module in the main.tf as in the file. We also provided our bucket name in the module definition since the module requires us to pass in the name into the Terraform execution. We can then use aws s3 ls && terraform initto list the available buckets and install the necessary module in our case the s3 module.

Terraform Response from initialization command “aws s3 ls && terraform init”.

Now with this, we can continue to reference the same module from all of our Terraform applications.

Conclusion

The take from all of these is that development in Terraform will be worthy and interesting if we organise our code and follow the rules of clean code. Using the right Terraform tools for the right solution is important to build a performance and stress-free development: and that’s why we need to use the resource block and the module at the instance where they are most necessary and useful.