Infra As Code for Data products — part 1 (design use case)
A lot has been discussed about data products in recent years, with the new concept of data mesh further emphasizing the importance of this topic.
The aim of this post is not to delve into data mesh specifically, but rather to use its principles to define a data product in terms of the infrastructure and automation required to consistently and efficiently deliver the bare minimum of data product functionality.
What is data product?
In the context of data mesh, a data product is any representation of data that holds value for a consumer. Martin Fowler has written an informative article that provides a valuable perspective on data as a product, based on Zhamak Dheghani’s data mesh principles.
However, it can be overwhelming to think about all possible data representations as data products. A more practical approach is to start small and focus on the technology that your company is currently using or plans to use in the future.
For example, our fictional company is using Google Cloud Platform (GCP) for its data storage and processing needs. Data is primarily stored in databases or pub/sub topics from microservices, and teams use Spark or BigQuery to build their data pipelines.
Additionally, the teams uses Google Cloud Storage (GCS) to store raw, curated data, as well as data science models and Spark applications. The data is mainly exposed in a structured format on BigQuery, governed by Google Data Catalog, and can be used in any analytics that integrate with GCP.
Therefore, the data products produced and exposed by our fictional company are useful data, created and accessed using GCP technology. The company should focus on delivering data products that align with the majority use cases, and select technologies and platforms that fit well with GCP.
Self-server data infrastructure
Now that we have identified the technology requirements for data products within the organization, the next step is to provide the infrastructure needed to support these products. This can be a complex task, as it involves many different components and may require specialized skill sets that not all teams possess. To address this issue, the Data Mesh principle of self-serve data platform can be employed.
The idea is through self-serve infrastructure provide a high level of abstraction that removes complexity and empowers teams to easily access the infrastructure they need to produce data.
IaC (infra-as-code) appears here to save the day and make the data world beautiful and "simple".
Design data product infrastructure
A useful way to think about the infrastructure components of data products is to divide them into three categories: actors, artifacts, and security objects.
Actors are the entities that interact with the data products, such as groups, users, or service users. They represent the “principals” that will use the artifacts.
Artifacts are the data storage and processing resources that the actors interact with, such as datasets, clusters, projects, folders, and catalogs. They are the actual data products that actors will use.
Security Objects are the permissions, roles, and policies that control access to the artifacts. They establish the connection between actors and artifacts and ensure that the right actors have the appropriate access to the data products.
Let see the diagrams. They can make it easier to process and understand the components of data products infrastructure.
Our use case example is focused on creating a data product within a company that uses Google Cloud Platform (GCP). The infrastructure components that we have identified include:
- Groups: To manage access to the data product, we will create groups and assign roles to them. These groups will include a developer group for members who will design data pipelines, models, and work with data as producers, and an admin group for members with different purposes such as managing member access and controlling billing permissions.
- Service user: The service user is the only entity with permissions to access the production environment, and will be responsible for creating and executing the data product.
- Storage: A dedicated bucket will be created to store the data, applications, local data, parameter files, and credentials for the data product.
- Roles: Roles will be assigned to the groups, using the default roles provided by GCP.
- Services: No artifacts will be created as Infrastructure as Code (IaC) from the beginning, but they can be created later to support data workloads.
- GitHub repository: The data product’s developed code will be stored in a GitHub repository, with version control and automation capabilities.
- Jenkins pipeline: To enforce best practices and ensure consistency, a CI/CD pipeline will be set up from day one using Jenkins.
This is an initial list of components, and other items may be added as necessary. For now, the focus is on producing the data product and not on sharing it. The plan is to code it and improve it as necessary.
Terraform Code
It is important to note that the code provided should be clear and easy to understand, at least in terms of its intention. However, it’s not necessary to go into the details of the entire code. Instead, the focus should be on pointing out specific parts and providing hints that can help in understanding the code better.
One important consideration is the handling of provider credentials. In the provided code, these credentials have been omitted for obvious reasons 😝. In a real-world enterprise setup, these credentials should be stored on the CI/CD machines and properly protected to prevent unauthorized access.
The code provided is designed to create the entire data product infrastructure with minimal input. Specific variables required will vary depending on the use case, however, some general variables that are likely to be the same for all data products.
In this proof of concept (PoC), these variables are set via the terraform.tfvars file, however, they can also be passed as arguments to the terraform plan and terraform apply commands.
Creating a repository for data product code based on a template is a good approach to help teams get started with a solid foundation in place.
The data product team can also be created on GitHub and associated with the repository. This makes it easier to add and remove collaborators for the data product. The technical lead is the maintainer of both the repository and the team.
It was assumed that a data product domain folder already exists as a component, following the data mesh concept of domain, which will aggregate all data products within that domain.
I would like to bring attention to line 119, where the credentials key for the service user is being created. This type of resource is registered in the Terraform state file and includes its value. Therefore, it is important not to push the state file to repositories and instead set up a proper backend, keeping it in a bucket with strict access controls.
A Jenkins folder for the data product is created and assigned to the technical lead with the necessary permissions. A Jenkins pipeline job is also created based on a template.xml file (omitted) located within the data product folder.
The template.xml file includes variable placeholders that are replaced using Terraform’s templatefile function. This template file already includes all the necessary configuration for the pipeline to connect with the GitHub repository created in the IaC process.
One of the key benefits of this approach is that the data product service credentials are already registered within the Jenkins folder and are only accessible within that scope. This ensures that no other job has access to the credentials and that they have not been compromised in any way.
What's Next (updated)
Part 2 is now available
…part 3, when dealing with multiple data products on a project, running the Terraform plan can take a significant amount of time. One possible solution to this challenge is to have each data product use its own state file and backend configuration. This would allow for faster plan runs and better management of the resources for each data product.
I am open to suggestions.