Terraform for a Data Engineer

9 min readOct 2, 2023

Building a robust data platform entails integrating various components such as tools, data processing logic, methodologies, architecture, and infrastructure. When it comes to infrastructure, the options range from physical servers for on-premises solutions to cloud resources. This post specifically focuses on harnessing cloud infrastructure within Microsoft Azure.

In a cloud environment, you have the choice between using virtual machines (Infrastructure as a Service — IaaS) or opting for serverless services like Azure SQL and Databricks. From my perspective, serverless services not only provide flexibility but also simplify the configuration process, and maintenance. In this article, I’ll explore the power of Terraform, a robust tool that allows for Infrastructure as Code (IaaC), enabling effortless setup of a cloud data platform.

Terraform empowers engineers to define infrastructure using code, a significant advantage compared to configuring resources through the Azure Portal. By storing the entire configuration in a codebase, which can be managed in a Git repository, you can replicate configurations across different environments and seamlessly integrate them into DevOps pipelines. This approach aligns perfectly with modern Continuous Integration/Continuous Deployment (CI/CD) patterns.

This post will focus on building a foundational data platform incorporating essential Azure services: Storage Account, Key Vault, Databricks, Azure Synapse, and Azure Data Factory (ADF). Through Terraform, I’ll demonstrate how tasks such as saving secrets in Key Vault, deploying Notebooks and ADF pipelines, mounting storage accounts in a Databricks cluster, and managing security can be automated seamlessly.

GitHub - MariuszKu/azure-terraform-duckdb at data_platform

Contribute to MariuszKu/azure-terraform-duckdb development by creating an account on GitHub.

github.co

Terraform

Terraform is a tool for automating and managing cloud infrastructure. It allows you to define your infrastructure in code and then create and maintain those resources on various cloud platforms like AWS, Azure, and Google Cloud. This approach makes infrastructure provisioning and management efficient and repeatable.

The most important components of Terraform script:

Providers

Providers allow Terraform to interact with cloud providers, SaaS providers, and other APIs. Additionally, all Terraform configurations must declare which providers they require so that Terraform can install and use them.

 required_providers {
    azurerm =  {
      source  = "hashicorp/azurerm"
      version = "3.37.0"
    }
}

provider "azurerm" {
  features {}
}

Data resource

Data sources allow Terraform to use information defined outside of Terraform. In the example below, it reads information about Azure configuration (such as tenant_id, client_id, etc.) and a service principal.

data "azurerm_client_config" "current" {
}

data "azuread_service_principal" "this" {
  display_name = "sp-mk-test"
}

Resource

Resources are the most important element in the Terraform language. Each resource block describes one or more infrastructure objects, such as a storage account, Synapse Workspace, or higher-level components such as DNS records.

# Azure Data Factory
resource "azurerm_data_factory" "adf_transform" {
  resource_group_name = var.resource_group
  location            = var.region
  name = "mk-${var.project}-adf01"

  identity {
    type = "SystemAssigned"
  }
}

Variables

In Terraform, we have several types of variables that help us manage and customize our code. When you declare variables in the root module of your configuration, you can set their values using CLI options and environment variables. When you declare them in child modules, the calling module should pass values in the module block. We can compare Terraform modules to function definitions:

Input variables are similar to function arguments.
Output values are akin to function return values.
Local values are comparable to a function’s temporary local variables.

As you can see in the example below, we can use variables to specify a resource group for our infrastructure and Azure regions.

variable "region" {
  type = string
}

variable "resource_group" {
  type = string
}

Modules

Modules are containers for multiple resources that are used together. There are two types of modules in Terraform:

Root Module: Every Terraform script has at least one module called the root module. It consists of resources defined in the .tf files.
Child Modules: These can be compared to functions that contain resource definitions, input and output variables. They can be called multiple times within a root module.

Building a Data Platform

Terraform Installation and Configuration

In this paragraph, I’ll guide you through the process of creating the essential components of a data platform using Terraform. To get started with Terraform, you’ll need to download it from the official Terraform website. Once downloaded, save the files in a dedicated ‘terraform’ folder. Make sure to add this folder’s location to your PATH environment variable for quick and easy access. Additionally, you need to install Azure CLI to authenticate in your Azure subscription using az login.

Within your Azure subscription, it’s crucial to create a service principal that Databricks will utilize, along with a designated resource group. Once you’ve completed these initial steps, proceed by cloning my repository. After cloning, adjust the ‘test.tfvar’ file to match your specific requirements.

Following this, initiate Terraform within your project directory. Finally, execute the Terraform script. These steps lay the foundation for efficiently building your data platform.

# test.tfvar

resource_group = "mk-test"
region = "West Europe"

client_id = "xxxx" # Service principal client id
object_id = "xxx"# Service principal object id
secrete = "xxx"# Service principal secrete
project = "az" # Project name

acr_enable = false

Authenticate in Azure subscription:

Initialize Terraform:

Terraform apply:

In the first step, you can run terraform plan to review the resources that will be created. Alternatively, you can execute terraform apply to preview the changes and confirm by typing 'yes' when prompted.

After Terraform has finished, you should see the result displayed on the screen below.

You can check results in Azure Portal.

Terraform Script Overview

Below, you’ll find an example of a storage account resource in our Terraform script. We specify the resource type as azurerm_storage_account and name it datalake. Inside the block, you'll see the storage account configuration details. Notably, I utilize variables to pass essential parameters such as the resource group name, Azure region, and a name constructed based on the project_name variable.

resource "azurerm_storage_account" "datalake" {
  name                      = "mk${var.project}sa001"
    
  resource_group_name       = var.resource_group
  location                  = var.region
  account_kind              = "StorageV2"
  account_tier              = "Standard"
  account_replication_type  = "LRS"
  access_tier               = "Hot"
  enable_https_traffic_only = true  
  is_hns_enabled            = true
  
  network_rules {
    default_action = "Allow"
    bypass                     = ["Metrics"]
  } 
  
  identity {
    type = "SystemAssigned"
  }
 
}

Within the same main.tf file, you'll discover the definitions for creating storage account containers and assigning the service principal to the blob storage contributor role. In this example, you can observe the utilization of the for_each property, allowing the creation of four containers through a single Terraform resource.

resource "azurerm_storage_container" "container" {
  for_each              = toset( ["landing","bronze", "silver", "gold"] )
  name                  = each.key
  storage_account_name  = azurerm_storage_account.datalake.name
 
}

resource "azurerm_role_assignment" "data_contributor_role" {
  scope                = azurerm_storage_account.datalake.id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = data.azuread_service_principal.this.object_id
}

With Terraform, creating an Azure Key Vault, setting policies for users, and storing sensitive values become seamless tasks. In this example, you’ll see how to leverage data resources to retrieve essential parameters such as tenant ID and service principal object ID. With Terraform’s capabilities, you can efficiently manage Azure Key Vault configurations, ensuring secure and improved secret management processes.

# Key Vault
resource "azurerm_key_vault" "kv" {
  name = "mk-${var.project}-kv002"
  resource_group_name = var.resource_group
  location            = var.region

  sku_name  = "standard"
  tenant_id = data.azurerm_client_config.current.tenant_id
}

# Key Vault police
resource "azurerm_key_vault_access_policy" "sp" {
  key_vault_id = azurerm_key_vault.kv.id
  tenant_id    = data.azurerm_client_config.current.tenant_id
  object_id    = data.azuread_service_principal.this.object_id
  secret_permissions = ["Get", "List", "Set", "Delete"]
  depends_on = [ azurerm_key_vault.kv]
}

# Key Vault secrete
resource "azurerm_key_vault_secret" "secrete_id" {
  
  name         = "secreteid"
  value        = var.secrete
  key_vault_id = azurerm_key_vault.kv.id
  depends_on   = [azurerm_key_vault_access_policy.user]
}

Terraform provides robust automation capabilities for Databricks deployment. With Terraform, you can effortlessly mount a storage account, create clusters, and deploy notebooks, streamlining the configuration and management processes for your Databricks environment.

# Notebook deploy
resource "databricks_notebook" "this" {
  path     = "/Shared/test/test"
  language = "PYTHON"
  source   = "./test.py"
}

# Mount
resource "databricks_mount" "this" {
  name = "landing"
  cluster_id = databricks_cluster.this.id
  uri = "abfss://${azurerm_storage_container.container["landing"].name}@${azurerm_storage_account.datalake.name}.dfs.core.windows.net"
  extra_configs = {
    "fs.azure.account.auth.type" : "OAuth",
    "fs.azure.account.oauth.provider.type" : "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "fs.azure.account.oauth2.client.id" : var.client_id,
    "fs.azure.account.oauth2.client.secret" : "${var.secrete}", # here should be secrete scoup
    "fs.azure.account.oauth2.client.endpoint" : "https://login.microsoftonline.com/${data.azurerm_client_config.current.tenant_id}/oauth2/token",
    "fs.azure.createRemoteFileSystemDuringInitialization" : "false",
  }
  depends_on = [
    databricks_cluster.this, 
    azurerm_role_assignment.data_contributor_role,
    azurerm_storage_container.container
    
    ]
}

The provided code demonstrates the use of references to other resources, such as the storage account and container. By utilizing the resource type and name, we can access properties like the name or the cluster ID. Notably, the code showcases the use of the depend_on property, allowing the definition of dependencies to ensure orderly creation.

In this setup, Terraform will create the mount when the cluster, containers, and roles are created, ensuring a seamless and synchronized deployment process.

Deploying Azure Data Factory pipelines traditionally involves Azure DevOps. However, an alternative method exists: Terraform. This approach offers great flexibility, although it requires pasting JSON code from Azure Data Factory Studio into a file. With Terraform, you can not only deploy pipelines but also create Linked Services, making it a versatile solution for managing your Azure Data Factory configurations.

resource "azurerm_data_factory_pipeline" "databricks_pipe" {
  name                = "databricks_pipeline"

  data_factory_id   = azurerm_data_factory.adf_transform.id
  description         = "Databricks"
  depends_on = [
    databricks_cluster.this, 
    azurerm_data_factory_linked_service_azure_databricks.at_linked
    ]
  activities_json = <<EOF_JSON
        [
            {
                "name": "Transform_Notebook",
                "type": "DatabricksNotebook",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "notebookPath": "/Shared/test/test"
                },
                "linkedServiceName": {
                    "referenceName": "ADBLinkedServiceViaAccessToken",
                    "type": "LinkedServiceReference"
                }
            }
        ]

  EOF_JSON  

}

The final resource I’ll introduce is Azure Synapse, which can be created using the script. The code demonstrates how to generate a password, store it securely in Azure Key Vault, and pass it as a parameter in Azure Synapse configuration. In this way, we can improve security by avoiding passing a password in variables and sharing it between different people.

# password generator
resource "random_password" "sql_administrator_login_password" {
  length           = 16
  special          = true
  override_special = "!@#$%^"
  min_lower        = 2
  min_upper        = 2
  min_numeric      = 2
  min_special      = 1

}

# save to key vault
resource "azurerm_key_vault_secret" "sql_administrator_login" {
  name            = "synapseSQLpass"
  value           = random_password.sql_administrator_login_password.result
  key_vault_id    = azurerm_key_vault.kv.id
  content_type    = "string"
  expiration_date = "2111-12-31T00:00:00Z"

  depends_on = [
    azurerm_key_vault.kv,
    azurerm_key_vault_access_policy.user
  ]
}

# Azure Synapse
resource "azurerm_synapse_workspace" "this" {
  name = "mk${var.project}syn001"
  resource_group_name                  = var.resource_group
  location                             = var.region
  storage_data_lake_gen2_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.sym.id
  sql_administrator_login              = "mariusz"
  sql_administrator_login_password     = azurerm_key_vault_secret.sql_administrator_login.value

  identity {
    type = "SystemAssigned"
  }

  depends_on = [
    azurerm_storage_account.datalake,
    azurerm_key_vault_secret.sql_administrator_login
    ]

}

Summary

In this post, I presented how we can use Terraform to deploy components for our data platform. This basic functionality illustrates that working with Terraform isn’t so complex, but can improve your development automation and CI/CD processes. We can use the script from a local machine or DevOps pipeline to build more sophisticated pipelines that will build infrastructure, test it, and promote changes to production without manual interaction. The presented examples as I underlined are basic, you could use them in a small platform, but in the case of an enterprise environment it will require network configuration that would isolate your environment from unwanted access from outside of your network (You can read how to secure network from my other posts).

Feel free to contact me if you have questions, or if you’re interested in implementing a data platform in your organization.

If you found this article interesting, please consider liking it on LinkedIn and clicking the “clap” button.