IAM Roles for K8s Service Accounts Lesson Learned

Weston Bassler
The Emburse Tech Blog
4 min readNov 29, 2022
New York City: Amazon Web Services AWS advertisement ad sign closeup in underground transit platform in NYC Subway Station
Source: iStockphoto/krblokhin

If you are using EKS, chances are you are utilizing IAM Roles for Service Accounts (IRSA) in order to interact with other AWS services via AWS SDK or AWS CLI. IRSA was an absolute game changer when announced for anyone that was forced to use other projects such as kube2iam or kiam to take advantage of using IAM Roles on Kubernetes. AWS did a great job of making this process very simple to set up and manage. There are also several Terraform modules out there that simplify this if you are a Terraform and AWS shop.

This post, however, is not about IRSA or the process of setting up IRSA. It is about a lesson learned when developing one of our services that needed access to write to DynamoDB instances in various regions. I will show how we discovered that we were not getting expected results, and how we overcame it.

Background Story

Recently, we have been going through the process of migrating one of our machine learning models from AWS Lambda to our Data Science Platform, running on EKS. The final step before returning the results is to write the results of the prediction to a DynamoDB table so that it can be retrieved later. We have several Dynamo tables that exist in different regions and each individual request decides which Dynamo table receives the write based on headers we receive in the payload. We use boto3 dynamodb in our code to write to Dynamo tables.

So far this seems pretty simple. We just need to make sure that the policy within our IAM Role that we attach to our service account, has write permissions to the DynamoDB instance within every region. We use Terraform with a similar example policy:

data "aws_iam_policy_document" "iam_policy" {

statement {
sid = "AllowWriteDynamoDB"

actions = [
"dynamodb:PutItem"
]

effect = "Allow"
resources = [
"arn:aws:dynamodb:${local.REGION_1}:${data.aws_caller_identity.current.account_id}:table/${local.dynamodb_table}*",
"arn:aws:dynamodb:${local.REGION_2}:${data.aws_caller_identity.current.account_id}:table/${local.dynamodb_table}*",
"arn:aws:dynamodb:${local.REGION_3}:${data.aws_caller_identity.current.account_id}:table/${local.dynamodb_table}*"
]
}
}

We will also pass the region-specific code to Boto3 so that it will write to the appropriate DynamoDB instance:

import boto3
import os

AWS_DYNAMO_REGION = os.getenv('AWS_DYNAMO_REGION')

client = boto3.client('dynamodb', region_name=AWS_DYNAMO_REGION)

I set AWS_DYNAMO_REGION as an environment variable in our deployment manifest. We have multiple different deployments each specifying a different region. Pretty simple, right? Not exactly.

apiVersion: apps/v1
kind: Deployment
...
...
env:
- name: AWS_DYNAMO_REGION
value: "ca-central-1"

As we started testing our apps, we discovered that every result was being written to the same exact DynamoDB instance in the same region every time no matter what we passed as the AWS_DYNAMO_REGION. It happened to be the same region where our EKS cluster lived, even though we were passing a different region in the client as seen above.

First, I checked to see if the deployment was in fact receiving the AWS_DYNAMO_REGION environment variable. It was.

Next, I exec’d into one of the dev pods to ensure that the environment variable was giving expected output. It was.

kubectl exec pod -- sh -c 'echo $AWS_DYNAMO_REGION'

Finally, I exported all the environment variables and I noticed that AWS_REGION and AWS_DEFAULT_REGION were there.

kubectl exec pod -- sh -c 'export'
export AWS_DEFAULT_REGION='us-east-1'
export AWS_REGION='us-east-1'
export AWS_DYNAMO_REGION='us-east-2'
...
...
...

Spending almost the last decade using AWS and boto3, I knew right away this was the reason why I was seeing every request hit the same Dynamo table. From the boto3 docs, if any of these environment variables are set, boto3 will use this in the client globally even if you set the region_name parameter.

But how did these get set as they do not exist in my Deployment manifest?

After doing a bit of Googling, I came across an AWS article “Diving into IAM Roles for Service Accounts”. Towards the end of the article, there is a section that describes the actions of the mutating webhook:

“The mutating webhook does more than just mount an additional token into the Pod. The mutating webhook also injects environment variables.

These environment variables are used by the AWS SDKs and the CLI when using assuming a role from a Web Identity. For example, see the python boto3 sdk. The SDK in our workload will now use these credentials instead of using the credentials found in the EC2 instance profile.”

Sure enough therein lies both AWS_DEFAULT_REGION and AWS_REGION environment variables that I found that now existed in the Pods. As mentioned above, these environment variables, if defined, will override any argument passed in the client.

Is there a way to prevent the webhook from injecting these environment variables? Turns out there is and it’s extremely simple. Just set these environment variables yourself and the webhook will not set them automatically.

apiVersion: apps/v1
kind: Deployment
...
...
...
env:
- name: AWS_DYNAMO_REGION
value: "us-east-2"
- name: AWS_REGION
value: "us-east-2"

Turns out if you set AWS_REGION, the AWS_DEFAULT_REGION environment variable doesn’t seem to get set by the webhook so we only need to set AWS_REGION in our case.

Conclusion

IRSA is a really awesome feature when using EKS. Unfortunately, I lost about a day reading through code and making different changes to our app trying to resolve the issue when it ended up being a mutating webhook, causing me a headache. This was my first frustrating moment using IRSA and it ended up being a really great learning experience. I wanted to share this learning with others, as I am yet to find any articles, blogs, stackoverflow, etc. … where someone was experiencing the same troubles. Hopefully I am able to spare someone a day worth of troubleshooting and better understand IRSA as part of EKS.

--

--