Deploy your Azure Data Factory through Terraform
Just recently I was asked to setup a simple ETL pipeline in the Azure cloud to merge CSV files stored in a blob storage into a single SQL table and have that infrastructure setup via Terraform.
The final infrastructure looked similar to the image below. CSV files are uploaded into an Azure blob storage. A time-based trigger inside Data Factory then uses those files as data sources, merges them (they all share a common identifier) and finally loads the single, “big” table into an Azure SQL database table.
While doing it, I learned about the following things:
- Remote State for Terraform to better secure secrets
- Data Factory’s Infrastructure as Code
- SQL credentials obtained from Azure Key Vault
- Terraform file upload
Let’s go over the different steps to setup our fully scriped ETL pipeline!
Remote State & Secrets
Halfway into the task, I got to the point where I wanted to setup the SQL database and noticed that this is the time where secrets are involved. I surely didn’t want to store them in clear text within the Terraform files (Falk gives good reason not to do that with any secret in his blog post) and started looking into how to avoid it.
Result: Even if you keep the clear text out of the Terraform file, it will still be present in the Terraform state as that always directly reflects the deployed infrastructure. That means as soon as you share the state within the team, you essentially also share the secrets in clear text as well.
That’s when I read this great guide by Yevgeniy Brikman where he talked about storing your Terraform state remotely, e.g. in an Azure blob storage. That way you can still share the code, but the state file now lives inside Azure and can be access-restricted a lot better. I just manually created an Azure storage account in its own Azure resource group for this.
As a next step, this storage can now be referenced from within the “real” Terraform file:
The place where Terraform stores the state file is called backend. The Azure backend first of all requires the resource group and the storage account we just created. Additionally, we need to give it the name of the folder within the storage account that the state file should be stored in. The key is the last attribute that tells Terraform to store the terraform.tfstate, which is the file that Terraform creates locally by default.
Now, whenever we execute a Terraform command, the state change is synced into the cloud, instead of being stored on your machine.
Data Factory Infrastructure as Code
The next thing I was worried about, was the fine-grained control of Data Factory components through Terraform. It’s important to say that I didn’t work with Data Factory before. Because as soon as you do, you will see that there is no reason to be worried about (as long as you can use either GitHub or Azure DevOps) as Data Factory will sync all its infrastructure with a Git repository. Whenever you update your pipeline, Data Factory will sync. the changes.
On top of that, the Terraform resource for Data Factory lets you reference a GitHub repository to bootstrap the Factory from — how convenient:
SQL Credentials from Azure Key Vault
Jumping back to the situation that I outlined above when I introduced the remote state. Again, Yevgeniy Brikman already gave a great hint in his post when he talked about the secret stores. So the idea was simple: Create the credentials inside Azure Key Vault and reference them from within the Terraform files so that the SQL database can be created.
As a first step, we need to create a Key Vault to be already present when we execute the Terraform commands. I’ve created mine manually in the same resource group as the storage account for the remote state. We can then reference that Key Vault via a data source, which allows to leverage already existing infrastructure inside Terraform files:
UPDATE: You can also create the Key Vault through Terraform and let it generate passwords for you.
Now we can use that Key Vault to access the secrets within and make them available as data sources as well:
Finally, the values can be referenced to create the SQL database:
The credentials cannot be viewed in clear text within the Terraform files and neither can they be inside the state file as that is managed via the remote storage!
Terraform file upload
The last thing I didn’t know was possible is the file upload for the CSV blobs into Azures blob storage. You can simply set the source attribute on the resource and reference the file you want to upload:
This allows you to setup the entire blob storage without any manual interaction as all the folders, files and their content can be managed via Terraform.
Conclusion
Despite my worries of certain things just not being possible with Terraform, you can rather easily setup an entire Data Factory including CSV files as a data source and a SQL database as a data sink! And all that while not exposing any secrets or credentials in your source code.
I was actually surprised how quickly I was able to leverage the remote backend for Terraform. Big thanks again to Yevgeniy!
On top of this, you will still need to do a few things: Make sure your IAM is setup correctly to ensure that the Terraform state is actually as secure as possible. Also, introduce secret/key rotation in the key vault to swap out the credentials every now and then. Finally, if the files are sensible, don’t store them inside a public Git repository either.
You can find the code of the Data Factory here and the Terraform code for the setup here.
UPDATE march 10th 2023: Fixed the branch references when creating the data factory instance with a github configuration. The branch required here is the “collaboration branch” not the “publish branch”