A journey with Terraform, Grafana, Delve, VSCode and a stack trace…

Jérôme NAHELOU
Aug 4 · 4 min read

I use terraform to manage all infrastructure components. And it goes from GCP projects to monitoring stack.

Like many Terraform users, my main difficulty is to manage versions upon multiple deployments, providers and software or APIs versions.

Last week, I had to deploy a code I used one year ago (terraform 0.12) to deploy a predefined dashboard in grafana (with datasource, notification channel, …).

So I decided to upgrade the code to Terraform 1.0+ and started my deployment without success 💣

In this story, I will show you how I applied concepts described in documentation to debug the Terraform provider Grafana using Delve and VSCode in a real life experience.

Let’s take a look to my context

My advice is to reproduce the issue with a very minimal terraform code.
Most of users I worked with, try to debug more than 200 lines of terraform code accross muliple files, resources, ...

What’s more boring than waiting for a terraform plan ?

For this issue, I used only one resource with a function to reproduce my content :

Then I runned my terraform init and plan commands

Stack trace from the terraform-provider-grafana plugin:                                                                                                                                                                                      

panic: Attempted to unmarshal invalid JSON. This unexpectedly got past schema validation.

goroutine 54 [running]:
github.com/grafana/terraform-provider-grafana/grafana.unmarshalDashboardConfigJSON(0x110cae4, 0x24, 0x0)
/home/jnahelou/golang/src/github.com/grafana/terraform-provider-grafana/grafana/resource_dashboard.go:268 +0x17a

Three noticeable infrastructure changes on the environment have been made since my last deployment…

  • Terraform version 0.12 to 1.0+ (what I want)
  • Grafana 7 to 8 (what administrators done)
  • Provider version from 1.5 to 1.13.2 (which is required by Grafana 8)

After a quick read of the changelogs, I found that the dashboard json_config changed from Grafana 7 to 8.
Thanks to the sample above (the smallest dynamic content required to create a dashboard) and unit-tests (static content for each of them) in the provider grafana, I can exclude typos in my template.

I found out a github issue is already opened for the same bug :

Of course, I can add log.Printfin provider code…

But let’s try to improve the debugging experience with a real debugger tool like delve

Thanks to great work on terraform-plugin-sdk v2 and examples provided by the community, it’s easy to go !

  • Start with the compilation of the provider as described in Terraform documentation :
    go build -gcflags=”all=-N -l”
  • Configure vscode to attach an already existing debugging session
{
"version": "0.2.0",
"configurations": [
{
"name": "Remote",
"type": "go",
"request": "attach",
"mode": "remote",
"remotePath": "${workspaceFolder}",
"port": 8888,
"host": "127.0.0.1"
}
]
}
  • Set breakpoint on functions you want to stop (in my case, the function before where the panic is triggered)
  • Run debugger session using dlv command :
dlv exec --listen=:8888 --headless --api-version=2 ./terraform-provider-grafana -- -debug
  • Connect to the debugging session in vscode (press F5 or run->start debugging)
  • On delve logs, you will be invited to export the `TF_REATTACH_PROVIDERS` environment variable
  • Now in the shell, export the TF_REATTACH_PROVIDERS, run terraform as usual and follow execution in vcscode.

You can configure terraform to use your local provider :

Let’s go back to my issue…

It seems for a new plan that the StateFunc defined on the json_config attribut “normalizeDashboardConfigJSON” is called with the “74D93920-ED26–11E3-AC10–0800200C9A66" string instead of the result of my template.

You can refer to documentation to read more about schema helpers :

In fact, at this stage, Terraform is not able to know the content of the template because the template is defined by an other resource (“known after apply”).

This value comes from the terraform-plugin-sdk itself :

// UnknownVariableValue is a sentinel value that can be used
// to denote that the value of a variable is unknown at this time.
// RawConfig uses this information to build up data about
// unknown keys.

As a workaround, I suggested to maintainers in a pull request to replace the panic() by an error handler, and skip this step if the json_config format is not expected.
It’s now merged, I’m happy it helps !

There is no magical ways to avoid this situation. Applications evolve, users ask about new features, security issues must be fixed, … But you want to choose when to apply them.

  • Use version-constraints for modules, providers and Terraform to avoid unpredictable deployment issues
  • Schedule Terraform plans using Terraform cloud API and monitor the result
  • Follow upgrades !

During my Grafana upgrade from 1.5 to 1.13, I skipped important versions regarding the dashboard format in the tfstate from slug to uid.
Now I have to update manually my tfstate for all existing workspaces..

Thanks to Louis Billiet and Gaetan Bogaert tips during our “Sfeir Share”. My next priority is to use dependabot. With Dependabot it becomes easy to detect and upgrade provider and module versions.

CodeShake

Learnings and insights from SFEIR community.