Deploy your Azure Data Factory through Terraform

Gerrit Stapper
NEW IT Engineering
Published in
5 min readNov 29, 2021
A hand holding a sapling with 2 green leaves
Foto by Akil Mazumder on Pexels

Just recently I was asked to setup a simple ETL pipeline in the Azure cloud to merge CSV files stored in a blob storage into a single SQL table and have that infrastructure setup via Terraform.

The final infrastructure looked similar to the image below. CSV files are uploaded into an Azure blob storage. A time-based trigger inside Data Factory then uses those files as data sources, merges them (they all share a common identifier) and finally loads the single, “big” table into an Azure SQL database table.

Schematic architecture of the infrastructure. In the beginning, there are 3 CSV files, which are connected to an Azure Blog Storage as the files are uploaded there. This is then connected to an Azure Data Factory, which reads the content of the files. Finally, the Data Factory writes the results into an Azure SQL tables
Schematic architecture of the Terraform infrastructure

While doing it, I learned about the following things:

  • Remote State for Terraform to better secure secrets
  • Data Factory’s Infrastructure as Code
  • SQL credentials obtained from Azure Key Vault
  • Terraform file upload

Let’s go over the different steps to setup our fully scriped ETL pipeline!

Remote State & Secrets

Halfway into the task, I got to the point where I wanted to setup the SQL database and noticed that this is the time where secrets are involved. I surely didn’t want to store them in clear text within the Terraform files (Falk gives good reason not to do that with any secret in his blog post) and started looking into how to avoid it.

Result: Even if you keep the clear text out of the Terraform file, it will still be present in the Terraform state as that always directly reflects the deployed infrastructure. That means as soon as you share the state within the team, you essentially also share the secrets in clear text as well.

That’s when I read this great guide by Yevgeniy Brikman where he talked about storing your Terraform state remotely, e.g. in an Azure blob storage. That way you can still share the code, but the state file now lives inside Azure and can be access-restricted a lot better. I just manually created an Azure storage account in its own Azure resource group for this.

As a next step, this storage can now be referenced from within the “real” Terraform file:

Terraform code that shows the initial setup of the provider and the backend azurerm. This backend requirers a resource group name, a storage account, a container name and they key,
Terraform code referencing the Azure storage account for the remote state

The place where Terraform stores the state file is called backend. The Azure backend first of all requires the resource group and the storage account we just created. Additionally, we need to give it the name of the folder within the storage account that the state file should be stored in. The key is the last attribute that tells Terraform to store the terraform.tfstate, which is the file that Terraform creates locally by default.

Now, whenever we execute a Terraform command, the state change is synced into the cloud, instead of being stored on your machine.

Data Factory Infrastructure as Code

The next thing I was worried about, was the fine-grained control of Data Factory components through Terraform. It’s important to say that I didn’t work with Data Factory before. Because as soon as you do, you will see that there is no reason to be worried about (as long as you can use either GitHub or Azure DevOps) as Data Factory will sync all its infrastructure with a Git repository. Whenever you update your pipeline, Data Factory will sync. the changes.

On top of that, the Terraform resource for Data Factory lets you reference a GitHub repository to bootstrap the Factory from — how convenient:

Terraform resource for the Data Factory including the GitHub repo reference

SQL Credentials from Azure Key Vault

Jumping back to the situation that I outlined above when I introduced the remote state. Again, Yevgeniy Brikman already gave a great hint in his post when he talked about the secret stores. So the idea was simple: Create the credentials inside Azure Key Vault and reference them from within the Terraform files so that the SQL database can be created.

As a first step, we need to create a Key Vault to be already present when we execute the Terraform commands. I’ve created mine manually in the same resource group as the storage account for the remote state. We can then reference that Key Vault via a data source, which allows to leverage already existing infrastructure inside Terraform files:

UPDATE: You can also create the Key Vault through Terraform and let it generate passwords for you.

Terraform code that shows how to setup a datasource referencing an existing Azure Key Vault
Terraform data source to reference an already existing Azure Key Vault

Now we can use that Key Vault to access the secrets within and make them available as data sources as well:

Terraform code that shows to datasources that connect to Azure Key Vault secrets. Here, the SQL database username and password are the secrets
Terraform data sources for the SQL username and pasword secrets stored in Azure Key Vault

Finally, the values can be referenced to create the SQL database:

Terraform resource for the Azure SQL database using the credentials from Azure Key Vault

The credentials cannot be viewed in clear text within the Terraform files and neither can they be inside the state file as that is managed via the remote storage!

Terraform file upload

The last thing I didn’t know was possible is the file upload for the CSV blobs into Azures blob storage. You can simply set the source attribute on the resource and reference the file you want to upload:

Terraform code that shows the Azure blob storage resource including the source property, which allows to directly upload files via Terraform by referencing a file location
Terraform resource for a blob inside Azure blog storage

This allows you to setup the entire blob storage without any manual interaction as all the folders, files and their content can be managed via Terraform.

Conclusion

Despite my worries of certain things just not being possible with Terraform, you can rather easily setup an entire Data Factory including CSV files as a data source and a SQL database as a data sink! And all that while not exposing any secrets or credentials in your source code.

I was actually surprised how quickly I was able to leverage the remote backend for Terraform. Big thanks again to Yevgeniy!

On top of this, you will still need to do a few things: Make sure your IAM is setup correctly to ensure that the Terraform state is actually as secure as possible. Also, introduce secret/key rotation in the key vault to swap out the credentials every now and then. Finally, if the files are sensible, don’t store them inside a public Git repository either.

You can find the code of the Data Factory here and the Terraform code for the setup here.

UPDATE march 10th 2023: Fixed the branch references when creating the data factory instance with a github configuration. The branch required here is the “collaboration branch” not the “publish branch”

--

--

Gerrit Stapper
NEW IT Engineering

Software Engineer, Interested in Software Quality and Teamwork, Cyclist