Terraform — how I split my monolithic state

Adrian Arba
8 min readMar 1, 2024

--

When project requirements are not clear from the start and/or you start working on a PoC for a large-scale infrastructure environment, you may find later on in the development process that among other things, you may run into difficulties scaling the infrastructure as new requirements come up or become clear and different components need to either stay tightly coupled or need to fan out into individual isolated components.

If not nipped in the bud early, this has the potential to become a cumbersome task the longer it sits in the backlog, becoming an ever-increasing tech debt.

Terraform code becomes harder to read, variables become harder to maintain, the state file keeps on growing and releases start taking more and more time as more and more components need to be validated against the ever-increasing single-state file.

You may also find that to tackle isolated components, you have to run through the same pipeline where all potential infrastructure changes are checked for just a small real change to that specific component.

Other pipelines/devs can’t interact with that state in parallel and efficiently to test out features or bugfixes and at one point, actually doing the change/split of state file components becomes a task that you have to perform in waves, have to plan for and, since so little people are doing it and so rarely, does not reassure you in terms of predictability.

As a good practice, try to make sure when you start building out the infra, that the requirements are clear, identify what resources need to be tightly coupled, and use different state files and data resources wherever possible where components can exist individually. As soon as you have more details, do the split of state, if necessary, don’t postpone it up to the point it becomes a scary action to do (live Production with tens or hundreds or even more components).

Lessons learned

On a project I worked on, we started by having to perform a PoC that turned out so well, that it became a template for onboarding other applications into the infrastructure we were building.

The problem was that we were tight on time budget and as soon as the PoC was presented, work was prioritized to start for the next phases which meant building out more infrastructure ASAP.

For the PoC, we implemented a state file for applications and a state file for the supporting infrastructure, with both tightly coupled components as well as components shareable between different applications and teams.

The new requirement was that all of these applications and teams would need to exist in their subset of services and live independently as much as possible, and the current PoC to immediately serve Production workloads.

In the infrastructure team, we immediately realized that if we scaled the infrastructure (herein terraform state) horizontally, we would slowly pave the path to chaos.

So, the lessons learned here were:

  • Make sure you understand the elasticity of your future infrastructure, clarify if it needs to scale horizontally or vertically in terms of Terraform resources
  • Understand the needs and urgency of the teams relying on the infrastructure you provision — what can be destroyed/changed without a big impact
  • In general, don’t hold all your eggs in one basket unless explicitly asked to do so (lesson learned from upgrading Terraform from version 0.11 ..)
  • Don’t leave infrastructure rework in the backlog if possible
  • Don’t take PoCs to Production, ensure you have enough time to evaluate the work done and ask yourself what would you do better in version 2

Coming back to our problem, we had two main options:

  • Option A: request for a bit more time to destroy the Production (and all other) environments (pause the dev work) rewrite the terraform components, and rebuild everything;
  • Option B: work on addressing the issue ad-hoc, one environment at a time, rewriting the terraform components, creating new state files per component, and splitting the current monolithic state by migrating components to the new state files, and accounting for only small interruptions to dev work.

I’m going to talk about Option B, the one we chose so that we limit the impact on dev work and balance introducing new infra features with splitting the old infra state on the fly as well as maintaining data added during the PoC unfold.

Below, without going into any other nonessential details, I’ll highlight how we managed to achieve this, at a terraform state level, with the following details:

  • infrastructure was GCP-heavy
  • state files were stored in Cloud Storage buckets, versioned
  • we used Terragrunt to distinguish between environments, so the commands I used all started with Terragrunt, but bear in mind this is all Terrafrom functionality behind the hood
  • I used bash to run through the commands

Step 1 — Add the code of the new component

As a first step, we started writing the new Terraform components in the way that we wanted, accounting for tightly coupled components, reusability, splitting shared components into a new folder, cleaning up variables, and reworking the Terragrunt folder structure to account for each component.

To simplify, I’ll use a tree level that contains only in-scope files (but imagine these paths contain a lot of terraform files):

# pre migration single component

└── project
└── terraform_monolith
├── application.tf
├── database.tf
└── variables.tf


# intermediary step (where we are now)
# this is where you would start adding the new terraform structure - 'database'

└── project
├── terraform_monolith
│ ├── application.tf
│ ├── database.tf
│ └── variables.tf

└── database
├── database.tf
└── variables.tf


# post migration components
# 'terraform_monolith' is renamed to 'application' and cleaned up

└── project
├── application
│ ├── application.tf
│ └── variables.tf

└── database
├── database.tf
└── variables.tf

Step 2 — Don’t remove any components from the monolithic component (yet)

The initial monolithic component was left untouched because we need the state file in its full to be able to ‘move’ infrastructure from the monolithic state to the new states.

So any changes to the monolithic component will happen later, for now, just copy out whatever you need from here to the new components and edit in the new components folders (including the variables.tf files).

Step 3 — Download the monolithic state file locally

To interact with the monolithic state, first we downloaded it locally as a backup and only then we started running Terragrunt commands against the cloud stored one.

# pull the current backend state locally, as a backup
# you have multiple choices:
# - download from Cloud console
# - dwnload via Cloud cli
# - use Terraform's functionality (safest, because it locks the state in the bucket during the operation)

# in your monolithic component
cd project/terraform_monolith

# make sure everything is up to date and you can access the backend
terragrunt init

# get a list of resources in this state file and save it in a text file
# this helps you select individual resources to move
terragrunt state list

# pull the state to a separate location
terragrunt state pull > state_migration/terraform_monolith/monolith.tfstate

Note: If you use a remote bucket as a state backend, versioned, you will see that as you move resources from one state to another, your original state will become smaller and smaller. It is a move action, not a copy action. If you need to restore to the initial state (maybe something goes wrong), you will need to restore an older version of your bucket state, that contained all the resources in your monolithic component. Or, if you have a local backup that you know can’t be stale, you can work up from that.

Note: Terraform keeps a state interaction counter in state files. Each time a state file is changed, the counter is incremented. When working with state files directly, make sure that you take care of any counter issues as well.

Step 4 — Provision the state backends for the new components

In this step, we needed to make sure that a Terragrunt init can be run against the empty components, with a state backend being created in the Cloud. To move the state around, besides the Terraform resources, you also need an output state file and an input state file.

So make sure you run this successfully in your new components (Terragrunt can create the state bucket for you in the Cloud estate, which is nice):

# switch to you new terraform component's root folder (where all the .tf files are).
cd project/database

# run Terragrunt to initialize the backend, download all modules and create a state file in your designated bucket.
# if the command is successfull, you are ready to migrate to this new state file
terragrunt init

Step 5 — Move the resources in the state from monolithic to split components

Remember that list of resources we pulled at Step 3? We will need it here.

# in the same new component folder, continue with the next command
# we are going to pull down the empty, newly initialized state file of the new component
terragrunt state pull > state_migration/database/database.tfstate

# switch back to the monolithic component
cd project/terraform_monolith

# and start moving state to the new component using the destination local state file

# using a few resource as examples
# a module
terragrunt state mv -state-out=state_migration/database/database.tfstate "module.pubsub.google_pubsub_subscription.pull_subscriptions[\"my_pull_subscription\"]" "module.pubsub.google_pubsub_subscription.pull_subscriptions[\"my_pull_subscription\"]"

# a regular resource
terragrunt state mv -state-out=state_migration/database/database.tfstate "google_pubsub_topic_iam_member.subscriber" "google_pubsub_topic_iam_member.subscriber"

# a list element resource
terragrunt state mv -state-out=state_migration/database/database.tfstate "google_service_account.app_service_account[0]" "google_service_account.app_service_account[0]"

Each run of state mv will remove that resource from the source state.

Let’s break down the command:

terragrunt state mv                   # or terraform state mv - the command to poll the component's backend state and move one resource to another state

-state-out=/path/to/file.tfstate # the destination state file

1st "cloud_resource.name" # original state resource name

2nd "cloud_resource.name" # destination state resource name

Step 6 — Update the new state component’s backend state

So far, the only state file that got changed after all the above commands was the terraform_monolith component state (hence the initial backup, just in case something goes terribly wrong..).

But the new components state (database) was only updated at a local file state, the backend state has not yet been updated, and that’s the last piece of the puzzle, really:

# switch paths to your new component's Terraform configuration
cd project/database

# simply run a state push using the destination state local file
terragrunt state push state_migration/database/database.tfstate

# and validate everything works fine
terragrunt init

# you should see all of your defined resources correspondinf moved states
# with no changes to be performed
terragrunt plan

Step 7 (Optional)— Delete/keep what you need from your monolithic component

At this stage, you can choose to keep your trimmed-down terraform_monolith component, rename it, or delete it if you migrated everything.

During the state migration, you can choose to migrate an entire state file to another state file, split a state file into multiple new state files, or move parts of a state file to a new state file and keep what’s left in the original one.

You can choose to rename resources in the destination state file as well or delete resources entirely if not needed.

Hope this tutorial helps you stay on top of you infrastructure :)

Here’s another source I found helpful when building out my solution, the steps are really well explained — https://www.maxivanov.io/how-to-move-resources-and-modules-in-terragrunt/

--

--