Stories by Jonathan Merlevede on Medium

Using AWS IAM with STS as an identity Provider

Jonathan Merlevede — Wed, 03 Dec 2025 15:05:13 GMT

Using AWS IAM with STS as an Identity Provider

How EKS tokens are created, and how we can use the same technique to use AWS IAM as an identity provider.

Also published on welw.it

I recently tried to connect to an AWS EKS cluster from Python code in an environment that did not have the aws CLI installed, leaving me without a way to retrieve tokens using aws eks get-token. Looking for a Boto call or AWS API call for EKS tokens yielded no results. I decided to look at how these tokens are generated, and as it turns out, the bearer tokens authenticating you to EKS are pre-signed calls to the AWS STS API — specifically for the GetCallerIdentity endpoint.

Pre-signing calls to GetCallerIdentity lets you use IAM credentials to generate an identity token that works for authenticating to EKS and other contexts. Let’s dive in!

Relationship between EKS and STS

How we usually authenticate to EKS

When using EKS, we typically create a cluster and then run aws eks update-kubeconfig to update our kubeconfig file as described in the AWS documentation.

For example, if we have a cluster named confused-blues-mushroom, we can run:

aws eks update-kubeconfig --name confused-blues-mushroom

This updates your ~/.kube/config file with an entry for the cluster, looking like so:

apiVersion: v1
kind: Config
preferences: {}
current-context: arn:aws:eks:eu-west-1:299641483789:cluster/confused-blues-mushroom
clusters:
  - cluster:
      certificate-authority-data: 
      server: 
    name: arn:aws:eks:::cluster/confused-blues-mushroom
users:
  - name: arn:aws:eks:::cluster/confused-blues-mushroom
    user:
      exec:
        apiVersion: client.authentication.k8s.io/v1beta1
        command: aws
        args:
          - --region
          - 
          - eks
          - get-token
          - --cluster-name
          - confused-blues-mushroom
          - --output
          - json
contexts:
  - context:
      cluster: arn:aws:eks:::cluster/confused-blues-mushroom
      user: arn:aws:eks:::cluster/confused-blues-mushroom
    name: arn:aws:eks:::cluster/confused-blues-mushroom

The config file defines your cluster, a user (~credentials), and a context that ties the two together.

A closer look at the user token

The user entry tells us that we can obtain credentials for the cluster by running the following command:

aws --region  eks \
  get-token \
  --cluster-name confused-blues-mushroom \
  --output json

Doing so yields something like this:

{
  "kind": "ExecCredential",
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "spec": {},
  "status": {
    "expirationTimestamp": "2025-11-25T22:38:50Z",
    "token": "k8s-aws-v1.aHR0cHM6Ly9zdHMuZXUtd2VzdC0xLmFtYXpvbmF3cy5jb20vP0FjdGlvbj1HZXRDYWxsZXJJZGVudGl0eSZWZXJzaW9uPTIwMTEtMDYtMTUmWC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BU0lBVUxSQUdHSUdVQ042UFBJUyUyRjIwMjUxMTI1JTJGZXUtd2VzdC0xJTJGc3RzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTExMjVUMjIyNDUwWiZYLUFtei1FeHBpcmVzPTYwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCUzQngtazhzLWF3cy1pZCZYLUFtei1TZWN1cml0eS1Ub2tlbj1JUW9KYjNKcFoybHVYMlZqRUs3JTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGd0VhQ1dWMUxYZGxjM1F0TVNKSU1FWUNJUUR3c2l2cHdJdDVTVzdnZVFvV3F4TXA4UndZM1k0UFRYSmQ3ZFFBUktOZmhBSWhBTVo3Q1lPR3YlMkZjTEhDZ29CVVFVYWhxQlcwbllmT250RmxuaThuRGQyJTJCREdLcU1EQ0hjUUFob01Nams1TmpReE5EZ3pOemc1SWd6TiUyRmFYamlENkVlV3BpemdRcWdBUEV6b3hRQmdhbWZlQ3FaSEFnM3h3MndZSmExQmlGdHVxQWgyUWdCd2VMQ0xKJTJCY1U1WjhXQzJxTnFwcFN5QTRDSWVaVXJLY2xFUWEwSVFtTFVwMHA1QUZUWmZOV2ZwYXJEZEt5dldTY2Zzd3RNR0pEanBLem1TQlh5UE9FeVlqVlpWdnZVcGJzc2p3TUp1bmRkY09sbjdac2Q5biUyQlBLaTV0JTJCZ1JZbU9hcTFqY285TUMyOVB6WnJrZ3FteERDOCUyRmZHc1k0a1FpVTklMkZndE0lMkJ6JTJCaXZ5YkhFSnV0Z2p5dkhFeG1ncFZmcFJ1d0lEdkFnRXBaTWFUTDNmTVhOczRHSmMwaHVHMWFVMjBNNGNHakg5Z1BVWmpaR2hoY0plYmxNV2dBJTJCV1l4d29XckhpTiUyQnBHNVpwJTJGQUhaV2pONHh3blY1b2Z5UUt4WUl5c0hZQzVsT1hjTWk0bFV0SSUyQnFScHRGVEVsWGpCWUwxd0dmVHFlcHZGSzJhbHVZbGgyU3h2SjhjTTYxaUF2bnZkOU5ac2slMkZsWWROSUZjZUlBVXJleDZWTHdDcXc5UmQxcFd6Znk5N1NKZUVQTzJVY1YxZk5DckZCSW5RJTJGZllmVjNCTk5EdlhlYnoxVURvbHZwcDZvVE84MVJySVowUDZlRWpLNkcxVGVrNUgxVzdUSVh2TTFQeVFmYlF3eXNXWXlRWTZvd0hjVldMTVlmY1AlMkIlMkZEQ0VEQ042RXVCZTRBNnpiZWNlMzRmbEdQNWlVTG1HJTJCVDQ1TkZlOFNmeU9KV01JR2xoRnpyMXhoVHVPRUZYY2hnQ0NJaE9GNTZiSzJvWENGZnZxQSUyRnJSNzlMNHdxaGlGeXZ4WEx5YlBia2tMV25wa1ElMkJLdEpIb1ZNJTJGRlFwdEh2dlJZSFQyMndlTElnV1JHZWNHRFhsS3hndGZLdExnV213aWVjNWlWd1BUb1NmUWxCViUyQjhvS3pkRWtQTXo2UCUyQlgyQ09HSzVESEpJNHpBeiZYLUFtei1TaWduYXR1cmU9MDAxYzZkZmY3YjFkNDAxZGQzMjlmODQwZTZhYjIxY2RjYzE4OGJmYTU2YTg3ZDBkMTdjYTc1NGY4YjE1M2VmZA"
  }
}

Investigating the token further, we see that it consists of a prefix k8s-aws-v1., followed by a URL-safe base64-encoded string. Decoding this string, we get a pre-signed URL for the GetCallerIdentity API call:

echo "aHR0cHM6Ly9zdHMuYW1hem9uYXdzLmNvbS8_QWN0aW9uPUdldENhbGxlcklkZW50aXR5JlZlcnNpb249MjAxMS0wNi0xNSZYLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFTSUFVTFJBR0dJR1VDTjZQUElTJTJGMjAyNTExMjUlMkZ1cy1lYXN0LTElMkZzdHMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MTEyNVQyMjUxMzhaJlgtQW16LUV4cGlyZXM9NjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JTNCeC1rOHMtYXdzLWlkJlgtQW16LVNlY3VyaXR5LVRva2VuPUlRb0piM0pwWjJsdVgyVmpFSzclMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkZ3RWFDV1YxTFhkbGMzUXRNU0pJTUVZQ0lRRHdzaXZwd0l0NVNXN2dlUW9XcXhNcDhSd1kzWTRQVFhKZDdkUUFSS05maEFJaEFNWjdDWU9HdiUyRmNMSENnb0JVUVVhaHFCVzBuWWZPbnRGbG5pOG5EZDIlMkJER0txTURDSGNRQWhvTU1qazVOalF4TkRnek56ZzVJZ3pOJTJGYVhqaUQ2RWVXcGl6Z1FxZ0FQRXpveFFCZ2FtZmVDcVpIQWczeHcyd1lKYTFCaUZ0dXFBaDJRZ0J3ZUxDTEolMkJjVTVaOFdDMnFOcXBwU3lBNENJZVpVcktjbEVRYTBJUW1MVXAwcDVBRlRaZk5XZnBhckRkS3l2V1NjZnN3dE1HSkRqcEt6bVNCWHlQT0V5WWpWWlZ2dlVwYnNzandNSnVuZGRjT2xuN1pzZDluJTJCUEtpNXQlMkJnUlltT2FxMWpjbzlNQzI5UHpacmtncW14REM4JTJGZkdzWTRrUWlVOSUyRmd0TSUyQnolMkJpdnliSEVKdXRnanl2SEV4bWdwVmZwUnV3SUR2QWdFcFpNYVRMM2ZNWE5zNEdKYzBodUcxYVUyME00Y0dqSDlnUFVaalpHaGhjSmVibE1XZ0ElMkJXWXh3b1dySGlOJTJCcEc1WnAlMkZBSFpXak40eHduVjVvZnlRS3hZSXlzSFlDNWxPWGNNaTRsVXRJJTJCcVJwdEZURWxYakJZTDF3R2ZUcWVwdkZLMmFsdVlsaDJTeHZKOGNNNjFpQXZudmQ5TlpzayUyRmxZZE5JRmNlSUFVcmV4NlZMd0NxdzlSZDFwV3pmeTk3U0plRVBPMlVjVjFmTkNyRkJJblElMkZmWWZWM0JOTkR2WGViejFVRG9sdnBwNm9UTzgxUnJJWjBQNmVFaks2RzFUZWs1SDFXN1RJWHZNMVB5UWZiUXd5c1dZeVFZNm93SGNWV0xNWWZjUCUyQiUyRkRDRURDTjZFdUJlNEE2emJlY2UzNGZsR1A1aVVMbUclMkJUNDVORmU4U2Z5T0pXTUlHbGhGenIxeGhUdU9FRlhjaGdDQ0loT0Y1NmJLMm9YQ0ZmdnFBJTJGclI3OUw0d3FoaUZ5dnhYTHliUGJra0xXbnBrUSUyQkt0SkhvVk0lMkZGUXB0SHZ2UllIVDIyd2VMSWdXUkdlY0dEWGxLeGd0Zkt0TGdXbXdpZWM1aVZ3UFRvU2ZRbEJWJTJCOG9LemRFa1BNejZQJTJCWDJDT0dLNURISkk0ekF6JlgtQW16LVNpZ25hdHVyZT1kYmFjNGQ3MzM1NTU1ODllYWRkMTVhZGZiOGI4MGVkZmNkMjE2YzQ1MmQxZWM3MDEwNmNkNjUwNmViMWY0ZTUz" \
| basenc -d --base64url
# Returns:
# https://sts.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAULRAGGIGUCN6PPIS%2F20251125%2Fus-east-1%2Fsts%2Faws4_request&X-Amz-Date=20251125T225138Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host%3Bx-k8s-aws-id&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEK7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJIMEYCIQDwsivpwIt5SW7geQoWqxMp8RwY3Y4PTXJd7dQARKNfhAIhAMZ7CYOGv%2FcLHCgoBUQUahqBW0nYfOntFlni8nDd2%2BDGKqMDCHcQAhoMMjk5NjQxNDgzNzg5IgzN%2FaXjiD6EeWpizgQqgAPEzoxQBgamfeCqZHAg3xw2wYJa1BiFtuqAh2QgBweLCLJ%2BcU5Z8WC2qNqppSyA4CIeZUrKclEQa0IQmLUp0p5AFTZfNWfparDdKyvWScfswtMGJDjpKzmSBXyPOEyYjVZVvvUpbssjwMJunddcOln7Zsd9n%2BPKi5t%2BgRYmOaq1jco9MC29PzZrkgqmxDC8%2FfGsY4kQiU9%2FgtM%2Bz%2BivybHEJutgjyvHExmgpVfpRuwIDvAgEpZMaTL3fMXNs4GJc0huG1aU20M4cGjH9gPUZjZGhhcJeblMWgA%2BWYxwoWrHiN%2BpG5Zp%2FAHZWjN4xwnV5ofyQKxYIysHYC5lOXcMi4lUtI%2BqRptFTElXjBYL1wGfTqepvFK2aluYlh2SxvJ8cM61iAvnvd9NZsk%2FlYdNIFceIAUrex6VLwCqw9Rd1pWzfy97SJeEPO2UcV1fNCrFBInQ%2FfYfV3BNNDvXebz1UDolvpp6oTO81RrIZ0P6eEjK6G1Tek5H1W7TIXvM1PyQfbQwysWYyQY6owHcVWLMYfcP%2B%2FDCEDCN6EuBe4A6zbece34flGP5iULmG%2BT45NFe8SfyOJWMIGlhFzr1xhTuOEFXchgCCIhOF56bK2oXCFfvqA%2FrR79L4wqhiFyvxXLybPbkkLWnpkQ%2BKtJHoVM%2FFQptHvvRYHT22weLIgWRGecGDXlKxgtfKtLgWmwiec5iVwPToSfQlBV%2B8oKzdEkPMz6P%2BX2COGK5DHJI4zAz&X-Amz-Signature=dbac4d733555589eadd15adfb8b80edfcd216c452d1ec70106cd6506eb1f4e53

Making a straightforward GET request to this URL returns something like this, which could be interpreted as the pre-signed URL being invalid:


  
    Sender
    SignatureDoesNotMatch
    The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.
  
  1d84958b-0ed3-4491-a74b-dbc8c0a3c10a

Inspection of the pre-signed URL reveals the parameter X-Amz-SignedHeaders=host%3Bx-k8s-aws-id, which tells us that the x-k8s-aws-id header should be included in the request. Assuming that $presigned is the pre-signed URL, the command

curl -H "x-k8s-aws-id: confused-blues-mushroom" \
-H "accept: application/json" \
"$presigned"

returns something like:

{
  "GetCallerIdentityResponse": {
    "GetCallerIdentityResult": {
      "Account": "",
      "Arn": "arn:aws:sts:::assumed-role//",
      "UserId": "AROAULRAGGIG6OJUH7R6U:jonathan.merlevede@dataminded.com"
    },
    "ResponseMetadata": {
      "RequestId": "38591f47-fd34-4145-bc81-33047c54e44a"
    }
  }
}

If you receive the EKS token, you can decode it and call the embedded pre-signed URL. You then get a lot of information about the identity of the caller; you know its role session ARN arn:aws:sts:::assumed-role//, which is tied to the role with ID and tagged with userid (docs).

Understanding all of this is helpful for several reasons. You now know that:

It should be easy to replace the aws CLI with a more lightweight alternative.
You can create your own token generator fairly easily, which can be useful in some environments where the aws CLI is not available — like Lambda functions.
You can use this technique to support authentication using AWS IAM credentials in your own services.

Lightweight AWS alternative

The aws CLI is a rather heavyweight dependency if all you use it for is token creation. You can use a lighter alternative instead, such as aws-iam-authenticator. Their GitHub page does a pretty good job of explaining the above process, too.

To use aws-iam-authenticator instead of aws, install it and adapt the user entry in your kubeconfig file as follows:

users:
  - name: arn:aws:eks:::cluster/confused-blues-mushroom
    user:
      exec:
        apiVersion: client.authentication.k8s.io/v1beta1
        command: aws-iam-authenticator
        args:
          - token
          - -i
          - confused-blues-mushroom

Indeed, the output of the aws-iam-authenticator command is exactly the same as the output of the aws eks get-token command.

Generating your own tokens

You can also generate tokens yourself. It helps if you can use a library to handle the heavy lifting — in casu, the AWS Signature v4 (SigV4) signing.

The README documentation of aws-iam-authenticator provides a great example of how to do this using Python (link):

import base64
import boto3
import re
from botocore.signers import RequestSigner
def get_bearer_token(cluster_id, region):
    STS_TOKEN_EXPIRES_IN = 60
    session = boto3.session.Session()
    client = session.client('sts', region_name=region)
    service_id = client.meta.service_model.service_id
    signer = RequestSigner(
        service_id,
        region,
        'sts',
        'v4',
        session.get_credentials(),
        session.events
    )
    params = {
        'method': 'GET',
        'url': 'https://sts.{}.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15'.format(region),
        'body': {},
        'headers': {
            'x-k8s-aws-id': cluster_id
        },
        'context': {}
    }
    signed_url = signer.generate_presigned_url(
        params,
        region_name=region,
        expires_in=STS_TOKEN_EXPIRES_IN,
        operation_name=''
    )
    base64_url = base64.urlsafe_b64encode(signed_url.encode('utf-8')).decode('utf-8')
    # remove any base64 encoding padding:
    return 'k8s-aws-v1.' + re.sub(r'=*', '', base64_url)

A token generated by this function can be used as a bearer token in calls to the Kubernetes API.

Supporting IAM authentication in your own services

You can use this technique to support IAM authentication in our own services. That’s also the idea behind aws-iam-authenticator, which allows you to add IAM authentication to self-managed Kubernetes clusters.

In fact, aws-iam-authenticator even predates Amazon EKS! EKS adopted the authentication approach introduced by aws-iam-authenticator, standardizing it.

The mechanics are straightforward:

The x--prefixed header(s) that you add to your call to AWS STS ensure that your pre-signed URL is used only in the context of the service that you are targeting (e.g., a specific EKS cluster). They serve as what would be known as your token’s aud claim in OIDC or your assertion’s audience restriction in SAML.
On the protected resource side, validate incoming tokens by calling the pre-signed URL you receive with the appropriate headers. This is not too different from how OAuth with token introspection works.

Several services besides EKS use this method. It is, for example, how HashiCorp Vault’s IAM auth method works:

AWS - Auth Methods | Vault | HashiCorp Developer

Note that STS is not the perfect identity provider for several reasons, including but not limited to:

Generating the token is somewhat complicated; it does not follow a “standard” flow (think the OAuth client credentials flow) and requires SigV4 signing.
STS calls are free, but e.g. throttling might become an issue. The default quota allows 600 requests per second.
Having to call the pre-signed URLs for all incoming requests imposes a load on your protected resource. Self-contained tokens such as JWS-encoded tokens (~JWT) are typically better in this regard.
You will have to validate the incoming pre-signed URL before calling it for security reasons.

Summary

We explored how EKS uses AWS STS to construct bearer tokens for Kubernetes API access by pre-signing calls to GetCallerIdentity. This technique is not limited to EKS — you can use it to add IAM authentication to your own services, just like HashiCorp Vault does. Whether you need to create tokens in environments without the aws CLI or want to build your own IAM-based authentication system, understanding this pattern opens up some interesting possibilities.

Using AWS IAM with STS as an identity Provider was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Overcoming Corporate Distrust

Jonathan Merlevede — Sat, 04 Oct 2025 22:00:32 GMT

How to trust your organization’s self-signed certificates and deal with applications like ZScaler man-in-the-middling traffic

When your enterprise environment intercepts and inspects HTTPS traffic at a proxy, that proxy is operating as a man-in-the-middle (MITM). This is exactly the situation that HTTPS is designed to protect against. To get your applications to play nice with a proxy like ZScaler, you have to force them to trust its root certificate, which is not part of the globally accepted set of root certificates. System administrators or your proxy client (e.g. Zscaler Client Connector) generally ensure that your system’s trust bundle, used by applications such as your browser and the OS itself, includes these root certificates.

What is really happening. Image source

The problem

The presence of your company’s root certificates allows you to perform tasks like browsing the internet. Unfortunately, not all applications use your system’s trust bundle by default, resulting in errors like this:

curl: (60) SSL certificate problem: self-signed certificate in certificate chain

In fact, most scripts (Python, Node.js) do not integrate with the system trust bundle. Instead, they ship with their own bundles. I have spent more time than I would care to admit dealing with related issues.

This post lists some ways to create a trust bundle — preferably by extracting the system’s trust bundle into a PEM file — and how to get different applications to trust it by setting environment variables.

Step 1: Creating a trust bundle

The first step is to create a trust bundle that includes your corporate root certificates. I prefer to do this by extracting the system’s trust bundle into a PEM file. Depending on your environment, you may want to include public root certificates as well. It almost certainly does not hurt to include them.

Every operating system has its own way of storing and accessing the system trust bundle.

On MacOS

On MacOS, the system’s trust bundle is stored on your system keychain. You can extract it using the following convenient commands:

security find-certificate -a -p /System/Library/Keychains/SystemRootCertificates.keychain > ~/.bundle.pem
security find-certificate -a -p /System/Library/Keychains/System.keychain >> ~/.bundle.pem

It appears to me that the SystemRootCertificates.keychain includes the certificates that ship with macOS, and the System.keychain includes additional certificates added by administrators or enterprise tools.

On Linux

Linux distributions generally store your system’s trust bundle already in a PEM file. All you have to do is figure out its location. This location depends on your specific distribution.

On Debian-based systems (Debian, Ubuntu, Mint, …), it is located at /etc/ssl/certs/ca-certificates.crt.
On Red Hat-based systems (RHEL, CentOS, Fedora, Amazon Linux), it is located at /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt.

Your OS generally compiles this bundle from a collection of PEM files that are stored in a different location. On Debian/Ubuntu, for example, certificates are stored in/usr/local/share/ca-certificates/, and compiled into a bundle by the update-ca-certificates command. On Red Hat-based systems, certificates are stored in/etc/pki/ca-trust/source/anchors/ and compiled using update-ca-trust.

On Windows and other systems

Like MacOS, Windows stores your system’s trust bundle on your system keychain. Unlike MacOS, it does not provide convenient commands for extracting it into a PEM file.

Some combination of PowerShell commands and/or calls to certutil allow you to extract the system trust bundle. This is non-trivial, and corporate environments often restrict any type of shell access on Windows machines anyway. So on Windows, instead of extracting the system trust bundle, I tend to construct "my own" instead, starting from Python's default certifi trust bundle and appending additional root certificates to it.

Unlike MacOS, Windows does not make it easy to extract your system trust bundle.

You can get the certifi trust bundle from its GitHub repository:

curl https://raw.githubusercontent.com/certifi/python-certifi/refs/heads/master/certifi/cacert.pem > ~/.bundle.pem

To get the additional certificates to trust, that is, the certificates used by the corporate proxy, you have several options:

You can ask your admin for it;
You can often find it hosted on Sharepoint or Confluence;
You can also fetch it from the “man-in-the-middle” yourself.

I prefer the last option. If openssl is installed, you can use it to inspect served certificates:

hostname=google.com # any other hostname serving the untrusted certificate
openssl s_client -showcerts -connect "$hostname":443

You can craft a script to create a PEM file from this output automatically if you like.

If OpenSSL is not available, you can inspect the certificate served by the proxy from your browser’s developer tools or navigate to your PC’s trust store and export it from there. You can find some more detailed instructions on how to do this here.

Step 2: Get applications to use the trust bundle

Now that we have a trust bundle, we need to get applications to use it. There are many ways to do this, and they unfortunately depend on the application you are dealing with. The most common and universal way to configure this is by setting environment variables.

Depending on the application you are dealing with, you may need to set one or more environment variables. This is a list of the most common ones:

These more or less apply cross-platform. On UNIX systems, you can set them by running the following commands:

export SSL_CERT_FILE=~/.bundle.pem
export SSL_CERT_DIR=~/.bundle.pem
export CURL_CA_BUNDLE=~/.bundle.pem
export REQUESTS_CA_BUNDLE=~/.bundle.pem
export NODE_EXTRA_CA_CERTS=~/.bundle.pem
export CARGO_HTTP_CAINFO=~/.bundle.pem
export GIT_SSL_CAINFO=~/.bundle.pem

I shamelessly copied this list from the excellent httpjail

You can set these “permanently” by exporting them from your .bashrc, .zshrc or any other shell configuration file.

conffile=~/.bashrc
echo "export SSL_CERT_FILE=~/.bundle.pem" >> "$conffile"
echo "export SSL_CERT_DIR=~/.bundle.pem" >> "$conffile"
echo "export CURL_CA_BUNDLE=~/.bundle.pem" >> "$conffile"
echo "export REQUESTS_CA_BUNDLE=~/.bundle.pem" >> "$conffile"
echo "export NODE_EXTRA_CA_CERTS=~/.bundle.pem" >> "$conffile"
echo "export CARGO_HTTP_CAINFO=~/.bundle.pem" >> "$conffile"
echo "export GIT_SSL_CAINFO=~/.bundle.pem" >> "$conffile"

On Windows, you can set environment variables from the System UI panel, or you can create a shortcut which sets them locally.

More approaches for different applications can be found here:

https://help.zscaler.com/zia/adding-custom-certificate-application-specific-trust-store

What about Docker?

Applications running inside a container (e.g. Docker) will use the trust bundle that is part of the container image. Essentially all container images come with a trust bundle, even the minimal ones like distroless.

Yes, indeed, this means that your perfect self-contained containerized applications still have a limited shelf life if they connect to the outside internet. Because even root certificates expire, you will have to rebuild at some point, even if the containerized application itself does not change.

To ensure that your containerized application trusts your man-in-the-middle, you will have to either:

bake your custom bundle into the image or
mount your custom bundle into the container, preferably shadowing the system trust bundle that is part of the image.

Which one is most appropriate depends on your use case, but usually mounting is preferred. You may additionally have to set some of the environment variables listed above if you are using applications that do not use the system trust bundle, like we discussed above.

Distroless images contain a trust bundle at /etc/ssl/certs/ca-certificates.crt

For an Ubuntu- or distroless-based image, this can look like this:

curl: (60) SSL certificate problem: self-signed certificate in certificate chain

The process of building an image is itself containerized and may also require you to set similar settings. Docker Desktop generally uses your system’s trust bundle out of the box, but when you use podman on an OS that is not Linux, you will have to update the trust bundle of the (Linux) VM running your containers. You can do this using podman machine ssh commands.

Originally published at https://welw.it on October 5, 2025.

Authorizing AWS Principals on Azure

Jonathan Merlevede — Mon, 22 Sep 2025 15:50:30 GMT

Use AWS IAM user- or session credentials to access Azure resources

How to delegate trust from Entra to AWS IAM through Cognito, authorizing Azure actions without needing long-lived credentials.

AWS IAM principals can be granted access to AWS resources through AWS IAM policies. Unfortunately, those policies do not carry weight outside of AWS, and certainly not within Azure. But what if you do want to access Azure or Entra resources from AWS? How then can you assign Azure privileges to AWS IAM roles and users?

Azure does not recognize your IAM Principal’s authority (image source).

On Azure, not AWS IAM but another service rules the roost. Microsoft Entra ID — the product formerly known as Azure Active Directory — manages user identities, apps, and access to Microsoft resources, including Azure, Microsoft 365, and even other applications that support Microsoft Entra ID (source). The latter notably do not include AWS applications.

This post shows how to use AWS Cognito as a bridge to generate OIDC tokens using AWS IAM-derived privileges, and how to exchange those for Microsoft Identity Platform tokens authorizing Azure (Entra) actions. We show how to set up the required infrastructure using Terraform.

For those seeking a deeper understanding of authentication, this post also includes a second part that presents the classical approach to machine-to-machine (M2M) authentication, explores what it means to trust, and touches on what it takes to reverse the scenario — setting up AWS to trust Entra principals.

Workload Identity Federation

You can configure Entra application registrations to trust a third-party OpenID Provider (OP). Entra refers to this as “Federated Credentials” and also as “Workload Identity Federation”.

After configuring Entra to trust a third-party OP, the JWT tokens issued by the third party allow obtaining Microsoft tokens through the client credentials flow with token assertions (see below, section Certificate-Based Authentication). You do not pin secrets or public keys inside Entra; instead, Microsoft validates the tokens you present to it using public keys it retrieves from the OIDC well-known endpoint you configure (see below, section OpenID Connect). Private key rotation is part of the protocols, and OPs generally rotate signing keys automatically.

Setting up AWS and Entra for trust delegation

We can configure Entra to trust a third-party OP. Unfortunately, AWS STS is not an OP, and AWS credentials are not OIDC JWT tokens. Luckily, AWS offers an OIDC-compatible identity service in the form of Cognito User Pools, and we can protect access to it using AWS IAM.

We demonstrate how to configure user pools to permit only IAM-protected login flows, allowing the exchange of AWS credentials for short-lived Cognito JWT tokens. We then show how to configure Entra to trust Cognito. Finally, we present the whole flow, culminating in Microsoft identity platform bearer tokens.

Cognito

Cognito allows users to obtain JWT tokens in several ways:

Cognito supports OAuth flows, including the client credentials grant and the authorization code grant.

Additionally, three Cognito AWS API calls result in JWT tokens:

The InitiateAuth AWS API call, also known as Cognito’s “client-side authentication flow”. This call is public and not protected by AWS IAM.
The GetTokensFromRefreshToken AWS API call, which is also public and not protected by AWS IAM.
The AdminInitiateAuth AWS API call, also known as Cognito’s “server-side authentication flow”. This call is protected by AWS IAM.

To the extent possible, we will disable OAuth flows and flows compatible with InitiateAuth at the Cognito client level, as these are not IAM-protected.

AWS Infrastructure

On the AWS side, create a Cognito user pool, register a Cognito application, and create a Cognito user. All these resources are free for the given settings.

User pools require only minimal configuration. You can create one through Terraform as follows:

resource "aws_cognito_user_pool" "this" {
  name             = "demo"
  alias_attributes = ["preferred_username"]
  admin_create_user_config {
    allow_admin_create_user_only = true
  }
}

Cognito clients require more configuration. At the time of writing, setting up a client through the Console web UI always results in at least one enabled OAuth flow, which cannot be disabled. The Cognito API, however, allows the creation of an application that allows only the AdminInitiateAuth API call as desired. You can create such an application using Terraform as follows:

resource "aws_cognito_user_pool_client" "this" {
  name            = "demo"
  user_pool_id    = aws_cognito_user_pool.this.id
  generate_secret = false
  allowed_oauth_flows_user_pool_client = false
  allowed_oauth_flows                  = []
  enable_token_revocation              = false
  explicit_auth_flows = ["ALLOW_ADMIN_USER_PASSWORD_AUTH"]
  id_token_validity      = 60
  access_token_validity  = 60
  refresh_token_validity = 60
  token_validity_units {
    id_token      = "minutes"
    access_token  = "minutes"
    refresh_token = "minutes"
  }
}

Despite not allowing the REFRESH_TOKEN_AUTH auth flowhere , at the time of writing AdminInitiateAuth always returns a refresh token, which can then be used together with the unprotected InitiateAuth and/or theGetTokensFromRefreshToken API calls depending on refresh token rotation configuration. We set the lifetime of refresh tokens to its minimum value (1 hour) to limit the impact of this weird behavior.

Note that we do not create a Cognito M2M application supporting the client credentials flow. Firstly, this results in a client secret, which, although easily rotated, is not what we want here. Secondly, calls to Cognito’s token endpoint are not IAM-protected; again, this is not what we look for in this post. Thirdly, Cognito M2M applications are not free.

Lastly, we need a Cognito user. As this user is only able to obtain tokens using the IAM-protected AdminInitiateAuth call and not through password grants or InitiateAuth, its password only serves as an unnecessary second factor and does not have to remain secret:

resource "aws_cognito_user" "this" {
  user_pool_id = aws_cognito_user_pool.this.id
  username     = "dummyuser"
  password     = "dummyPassword1!"
}

Lastly, to be able to call AdminInitiateAuth your IAM principal needs rights:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "cognito-idp:AdminInitiateAuth",
            "Resource": "arn:aws:cognito-idp:REGION:ACCOUNT_ID:userpool/"
        }
    ]
}

If you have multiple Azure applications and want to scope access to specific ones through AWS IAM, consider creating one user pool, client, and user for each one. The AdminInitiateAuth IAM action does not support conditions to limit it to specific client IDs or users. There is a generous limit of 1000 user pools per region (can be increased to 10000).

Obtaining AWS Bearer Tokens

If you are authenticated to AWS as a principal authorized to perform AdminInitiateAuth, obtain tokens as follows:

aws cognito-idp admin-initiate-auth \
--region eu-west-1 \
--user-pool-id "$user_pool_id" \
--client-id "$client_id" \
--auth-flow ADMIN_USER_PASSWORD_AUTH \
--auth-parameters 'USERNAME=dummyuser,PASSWORD=dummyPassword1!'

This returns something like:

{
  "ChallengeParameters": {},
  "AuthenticationResult": {
    "AccessToken": "",
    "ExpiresIn": 3600,
    "TokenType": "Bearer",
    "RefreshToken": "",
    "IdToken": ""
  }
}

The access tokens retrieved using AdminInitiateAuth always have scope aws.cognito.signin.user.admin and no audience. Its payload looks as follows:

{
  "sub": "",
  "event_id": "",
  "token_use": "access",
  "scope": "aws.cognito.signin.user.admin",
  "auth_time": 1756773256,
  "iss": "https://cognito-idp..amazonaws.com/",
  "exp": 1756773556,
  "iat": 1756773256,
  "jti": "",
  "client_id": "",
  "username": "dummyuser"
}

Requesting custom scopes with Cognito is possible only when requesting tokens through OAuth flows.

As is suggested in trust delegation scenarios, we will not use the access token. Instead, we will exchange the identity token for a Microsoft/Entra JWT token. The Cognito identity token has the Cognito client ID as the audience. The decoded payload of an identity token looks as follows:

{
  "sub": "",
  "aud": "",
  "event_id": "",
  "token_use": "id",
  "auth_time": 1756771643,
  "iss": "https://cognito-idp..amazonaws.com/",
  "cognito:username": "dummyuser",
  "exp": 1756775243,
  "iat": 1756771643
}

Configuring Trust Relationship

Now it is time to create an Entra application and configure it to trust our Cognito tokens (“federated credentials”). You can do so using the UI. The code below creates an application and configures trust using Terraform:

resource "azuread_application" "this" {
  display_name = "blogpost"
}

resource "azuread_application_federated_identity_credential" "this" {
  application_id = azuread_application.this.id
  display_name   = "cognito"
  description    = "Trust Cognito"
  audiences      = [aws_cognito_user_pool_client.this.id]
  issuer         = "https://${aws_cognito_user_pool.this.endpoint}"
  subject        = aws_cognito_user.this.sub
}

This configures Entra only to accept specific tokens:

Tokens from your Cognito user pool, thanks to the issuer setting. Under the hood, Entra queries https://${aws_cognito_user_pool.this.endpoint}/.well-known/openid-configuration and retrieves public keys (JWKS) for validating signatures.
Tokens authenticating your specific Cognito user, thanks to the subject setting.
Identity tokens that were generated for your application/client, thanks to the audience setting.

Obtaining tokens

Defining getcreds as an alias for the aws cognito-idp admin-initiate-auth command from before, you can now obtain Microsoft identity platform bearer tokens as follows (reference):

TENANT=""
AZURE_CLIENT_ID=""
COGNITO_TOKEN="$(getcreds | jq .AuthenticationResult.IdToken -r )"
TOKEN_URL="https://login.microsoftonline.com/$TENANT/oauth2/v2.0/token"

curl -X POST "$TOKEN_URL" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id=$AZURE_CLIENT_ID" \
-d "grant_type=client_credentials" \
-d "scope=https://graph.microsoft.com/.default" \
-d "client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer" \
--data-urlencode "client_assertion=$COGNITO_TOKEN"

If you want to understand better what happens here under the hood, or learn about alternate ways of establishing trust, read along.

Classical Approach to Trust

Establishing trust and authority with a sceptical party like Entra is always done by demonstrating to it that you have or know something that only an authorized actor should have or know.

In classical machine-to-machine interactions with Entra, this “secret something” is one of two things:

A client ID and client secret. These work like classical usernames and passwords. Entra knows your secrets, and you show that you do too by a straightforward “show of hands”.
A certificate or public/private key pair. In this case, the public part of the secret is registered with Entra. You present Entra with a secret value derived from your private key. Entra can then validate that you are in possession of the private key using the corresponding public key.

To bridge AWS IAM and Entra, you store the “secret somethings” in a location protected by AWS IAM, e.g. in AWS Secrets Manager, in the Systems Manager (SSM) Parameter Store or even as an object on S3. Then, to make authorized calls to Azure resources, you retrieve and use them.

Certificates and private keys are considered more secure than client id/secret pairs or username/password combinations because private keys themselves are not transmitted, and the secrets derived from it are short-lived.

The main problem with this basic approach is that “secret somethings” are usually long-lived, with rotation remaining a manual, infrequent, error-prone, and even forgotten-about process. Automatic rotation can alleviate this. We will get back to this below.

Certificate-Based Authentication

Before moving on, let’s dive a bit deeper into how certificates are used with Azure and Entra ID. This helps understand how similar the flow with federated credentials is.

To establish trust using certificates, you request Entra bearer tokens using the OAuth client credentials grant (RFC6749) with JWT token assertions (RFC 7523). It works as follows:

Generate an X.509 certificate. Entra requires you to pin specific certificates, so public key infrastructure (PKI) does not apply. You can and probably should use a self-signed certificate.
Register your certificate’s public key in the application registration.
Construct a JWT token that will serve as an assertion. Sign the JWT token using your certificate’s private key.
Obtain an Entra bearer token from Microsoft identity platform using the client credentials flow — that is, make a request to Microsoft’s token endpoint, exchanging the “assertion token” you signed for one that Microsoft signed.

You probably do not want to code this flow yourself; instead opt to use a library, such as Microsoft’s Authentication Library (MSAL) for Python.

Microsoft identity platform trusts you because it can validate the signature of the assertion token using the public key you gave to Entra ID before. Only someone who knows the corresponding private key can create a valid signature. As a token of this trust (😏), Microsoft gives you a “bearer token”

The bearer token is again something proving that you are indeed worthy of trust, but this time to different untrusting parties: Azure resources that are trusting of the Microsoft identity platform and nothing else. You can use the MS bearer token to authorize calls to any Entra-protected API by embedding it into the Authorization header of your HTTP calls.

Using federated credentials works in the same way, except the assertion tokens you use are not self-signed JWT tokens, signed using a certificate you pinned in Entra, but rather tokens issued by the trusted OP.

OpenID Connect

At this point, you may wonder how services receiving a Microsoft Identity bearer token or assertion from your federated OP validate its veracity.

Microsoft identity platform is an OpenID Provider (OP), meaning that it complies to the OpenID Connect (OIDC) specification. Like most OPs, it additionally implements the OpenID Connect Discovery protocol. In a nutshell, this means that:

The bearer tokens it issues are JWT tokens, valid only for a limited duration and easily decoded.
The JWT tokens are signed using a private key known only to Microsoft.
The public keys corresponding to the private ones that Microsoft uses for signing its tokens can be retrieved from standardized “well-known” endpoints. Multiple public keys can be acceptable at the same time, allowing for seamless key rotation.

Thanks to standardization, all you need to do to be able to trust tokens issued by the Microsoft identity platform (or any other OP) is know its well-known endpoint, hosted on an HTTPS URL. Thanks to the global PKI and its network of certificate authorities (out of scope here), you know that the information you find there is trustworthy. You retrieve public keys from coordinates you find at the well-known endpoint and can use them to verify bearer tokens. Microsoft can and does rotate its certificates, without it causing any disruptions or requiring changes on the end of its clients.

Reversed scenario

Because Entra and the Microsoft Identity Platform are OpenID providers and AWS supports what it calls “web identity credentials”, authorizing Entra identities to access AWS resources is a lot easier than the reverse scenario discussed in this post.

To do so, register your Entra tenant as an OIDC web identity provider on your AWS account. Then update your roles’ assume role policies appropriately. This enables the exchange of Entra tokens for temporary AWS credentials using AssumeRoleWithWebIdentity API calls. The exchange of JWT for and refreshing of AWS session credentials can happen automagically under the hood, e.g., by setting appropriate environment variables pointing to your Entra JWT token (AWS_WEB_IDENTITY_TOKEN_FILE) and setting the role you want to assume (AWS_ROLE_ARN) — so, unlike the solution above, this flow will work with all AWS SDKs out of the box without further customization.

If your Entra tokens are highly privileged, consider exchanging them for less priviledged ones before sending them to AWS.

Conclusion

This post leverages Cognito, AWS’s fully managed OIDC-compatible identity provider service, to exchange IAM credentials for Entra bearer tokens, without incurring costs or requiring any form of long-lived credentials. It goes on to show how certificate-based OAuth client credential flows work and how OIDC facilitates establishing trust, including seamless rotation of key pairs. Finally, it discussed that exchanging Azure credentials for AWS is possible in a more straightforward manner.

Creating a Cognito user pool, Cognito client, and especially a Cognito user in this setup is arguably awkward, and you may prefer using long-lived secrets on account of their relative simplicity. We did not outline how the solution above can be integrated with Azure SDKs to fetch and refresh tokens automatically. If you know of an easier way to exchange AWS IAM credentials for Entra ones, be sure to comment!

Source: South Park

Authorizing AWS Principals on Azure was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Quack, Quack, Ka-Ching: Cut Costs by Querying Snowflake from DuckDB

Jonathan Merlevede — Wed, 15 May 2024 10:15:17 GMT

The duck escapes with the credits.

How to leverage Snowflake’s support for interoperable open lakehouse technology — Iceberg — to save money.

Update 10/2024: A lot has changed since this article was written. Snowflake stopped generating the version-hint.txt file since their 2024_05 bundle, invalidating parts of the post below. Snowflake also released Polaris, which is something to look into if you want to access your Snowflake data from different engines.

Snowflake recently released extensive support for the open table format Iceberg. Using open formats enhances data agility and reduces lock-in. This post explores leveraging this flexibility to decrease Snowflake’s high compute costs by using DuckDB to query Snowflake-managed data.

What is Apache Iceberg?

Apache Iceberg is a table format specification created by Netflix in 2017. In 2018, Netflix open-sourced the Iceberg project and donated it to the Apache Software Foundation.

Netflix designed Iceberg to overcome the limitations of data lakes that contain plain partitioned data files with minimal metadata — also known as Hive-formatted tables. These included performance problems (many file listings, many partitions, limited pruning) and the absence of features that had become common in data warehouses, such as time travel, schema evolution, and ACID transactions.

Table format specification

A table format specification is a standard way of writing metadata to define a table. Metadata allows tools to know what is in a dataset without having to read all of the data inside, but they can also assign different meanings to data — e.g., by marking it as non-current.

Apache Iceberg is not a storage format. You can store your Iceberg table’s data in formats such as Parquet, ORC, or Avro; Iceberg is a standard way to organize metadata next to those data files.

Open toolbox and interoperability

Many engines and tools implement the Iceberg spec. Tools implementing the same spec can all interact with the same Iceberg tables, which is why Apache Iceberg is “multi-engine.” Most major engines, such as AWS Athena, Trino (Starburst), DuckDB, and Snowflake, support Iceberg.

This interoperable approach fundamentally differs from what was common in the past. Databases like Oracle, Vertica, BigQuery, and so on store metadata and data in proprietary formats, presenting a challenge for seamless interoperability, requiring lots of data copying, and potentially leading to vendor lock-in.

Paradigm shift

By working with a centrally accessible format independent of the compute engine, compute engines become interchangeable. This allows us to use the most suited computing engine for a particular task, without having to move around data. Data written by one tool can immediately be read by another.

This architecture results in a paradigm shift, favoring data sharing over redundant data duplication across different computing engines.

Image adapted from https://www.youtube.com/watch?v=_GW3GYZK66U

Featureful lakehouses

In addition to facilitating interoperability, Apache Iceberg enables an ever-growing number of features that close the feature gap between data lakes and data warehouses, giving rise to what is now known as the lakehouse. These include time travel, ACID transactions, partition evolution, hidden partitioning, schema evolution, saving object storage costs, etc. This blog post only focuses on interoperability.

Apache Iceberg and Snowflake

On December 4, 2023, Snowflake published a blog post announcing their Apache Iceberg integration is in Public preview.

Snowflake now offers two ways to work with Iceberg tables:

External catalog. These tables are written externally, by a tool such as Apache Spark, Apache Flink, or even Trino, to your object store and registered in an external catalog such as the Hive Metastore, the AWS Glue Data Catalog, or Nessie. In this mode, tables are read-only from Snowflake.
Snowflake catalog. These tables are read-write from Snowflake and read-only externally.

In both cases, Snowflake stores all data and Iceberg metadata in your own (cloud) object storage. Both ways of working with Iceberg have merits. Given your situation, it should be clear which is the most appropriate.

All data and Iceberg metadata are on your own (cloud) object storage.

When using Iceberg tables with the Snowflake catalog, Snowflake behaves like it always does; it remains a “zero-ops” warehouse, and you can remain carefree while Snowflake performs storage maintenance operations like compaction, expiring snapshots, and cleaning up of orphaned files. Iceberg tables behave nearly identically to Snowflake-native tables, although there are some limitations that you may want to check out.

This post assumes that your data lives and breathes in Snowflake and that Snowflake is where your large-scale processing happens. Using the Snowflake catalog is then the right choice.

Image from https://www.snowflake.com/blog/unifying-iceberg-tables/

Iceberg Catalog

When using Iceberg tables with the Snowflake catalog, the “catalog” remains on Snowflake’s side. To determine whether this impedes our ability to interact with data directly, we should know what the metadata catalog does; after all, is a table’s metadata not stored in Iceberg’s metadata files? Catalogs bring at least two things to the table (pun intended):

Database abstraction. Iceberg is a specification for technical metadata at the table level, and Iceberg metadata files are stored next to your data files. The table specification is unaware of concepts such as table names, schemas, and databases or collections. A metadata catalog allows you to consider your “bag of tables” as a database by introducing hierarchy and storing a map of table names onto prefixes.
Pointer to the current table version. When mutating an Iceberg table, new data and metadata files are added and stored next to the old ones. The catalog keeps track of table prefixes but must also know which metadata files are “current.”

TL;DR: You need access to the catalog to know which table version is current, and to access tables by name and write queries as you are used to.

Bergs need careful filing in a metadata catalog.

Iceberg Catalog SDK

If you want to read your Iceberg tables using Spark, you’re in luck! Snowflake released an Iceberg Catalog SDK for Spark, which implements Spark’s catalog interface using an (otherwise undocumented) Snowflake Catalog API. Currently, this Snowflake functionality is free and does not require a running warehouse, cost “serverless credits,” or incur “cloud services” charges.

Snowflake’s announcement provides readily usable sample code and confirms that Spark reads Iceberg metadata and Parquet files directly from the customer-managed storage account:

After making an initial connection to Snowflake via the Iceberg Catalog SDK, Spark can read Iceberg metadata and Parquet files directly from the customer-managed storage account. With this configuration, multiple engines can consistently read from a single copy of data.

Unfortunately, this is not immediately helpful for querying from DuckDB. There is no Snowflake catalog SDK available for DuckDB. Luckily, we can use the file system directly to read our data.

Image slightly adapted from https://www.snowflake.com/blog/unifying-iceberg-tables/

Iceberg Filesystem Catalog

If it seems possible to implement a catalog on top of a filesystem or object store through straightforward naming conventions, that is because it is! Indeed, Iceberg’s Hadoop catalog is just that. Its class documentation reads:

HadoopCatalog […] uses a specified directory under a specified filesystem as the warehouse directory, and organizes multiple levels directories that mapped to the database, namespace and the table respectively. The HadoopCatalog takes a location as the warehouse directory. When creating a table such as $db.$tbl, it creates $db/$tbl directory under the warehouse directory, and put the table metadata into that directory.

For Iceberg to know which metadata is the latest, it expects the filesystem tables’ metadata files to have names determined as a function of monotonically increasing version numbers. It also looks for an optional version-hint.text file pointing to the newest version.

Note: Writers maintain consistency and monotonically increasing versions by implementing the scheme documented here. Unfortunately, this requires storage systems to support atomic renaming, which many storage engines, notably S3, Google Cloud Storage, and Azure Blob Storage, do not do. This is one of the reasons why one of Iceberg’s original authors, Ryan Blue, has referred to the creation of Hadoop tables as “one of his biggest mistakes”. Even on storage systems supporting atomic renames you may see lower performance than when using a “proper” metadata catalog. The use of HadoopCatalog is generally discouraged for production use.

Snowflake presumably uses a proprietary, highly performant catalog implementation in its backend. However, it is nice enough to materialize data and metadata on the customer-managed object storage in a way compatible with the Hadoop catalog — they even maintain a current version-hint.text file! This compatibility means that any reader with support for the Iceberg Hadoop catalog can read Snowflake data directly by pointing it to the root of the Iceberg warehouse on the object storage system.

DuckDB

DuckDB has partial support for the Iceberg Hadoop catalog and filesystem tables. While DuckDB unfortunately does not (yet?) support reading an entire warehouse, you can point it to a table prefix. DuckDB will then pick up on the version-hint.text file and read the latest version of the table.

Creating an Iceberg table

Getting Snowflake to create an Iceberg table on your cloud requires some configuration. The example below uses S3 as a storage layer, but Snowflake also supports Google Cloud Storage and Azure Storage. You can find a playbook for S3 here:

GitHub - datamindedbe/platform-quack-quack-ka-ching: The duck escapes with the credits.

On a high level, this is what needs to happen:

Provision storage: Create an S3 bucket and an IAM role for Snowflake and ensure that the IAM role has the necessary permissions to access the bucket.
Connect Snowflake to storage: Create a Snowflake External Volume. In S3’s case, an external volume will create an IAM user on Snowflake’s account. You need to create a trust relationship so that IAM user can assume the role with access to your S3 bucket.

We can finally create native iceberg tables in Snowflake with CREATE ICEBERG TABLE, and you can find your Parquet and Iceberg metadata files in the S3 bucket.

Reading data from DuckDB

Having established a secure connection between S3 and Snowflake and created Iceberg tables in Snowflake, let’s — finally — see how DuckDB facilitates querying them.

We use DuckDB’s iceberg extension to read the Iceberg tables we made in Snowflake directly from S3. Again, you can find the playbook here. The main functionality is provided by the following iceberg_scan method:

select * from iceberg_scan('s3://chapter-platform-iceberg/icebergs/line_item';)

The iceberg_scan method fetches the tables from S3. You don’t have to point to the current manifest.json file explicitly because the version-hint.text is pointing to the current version of the tables.

We have now unlocked the real power of open table formats: we have the convenience of Snowflake and its catalog but can save costs by performing single-node queries on DuckDB.

As of now, DuckDB does not support writing Iceberg tables — only reading. You can write out to parquet though, for example to S3 with COPY TO 's3://bucket/file.parquet';. However, even if DuckDB would support Iceberg writes, Snowflake would not — although you could register DuckDB’s output as an Iceberg table with external catalog in Snowflake.

Why is Snowflake doing this?

If using Iceberg tables on Snowflake is a bit like having your cake and eating it, with Snowflake footing the bill, then why did Snowflake build this integration? The move makes sense in the context of fierce competition from Databricks. Both behemoths are trying to open up their systems to attract customers.

Snowflake sends the message to its (prospective) customers that choosing Snowflake does not tie them to one vendor and that there is no risk of lock-in; with them, you always have the option to switch compute engines when you want. Databricks is behaving similarly by opening up its Delta Lake format and better supporting Hudi and Iceberg through UniForm.

Snowflake still wants to keep as much compute as possible on their systems. There is a clear path for moving external metadata to the Iceberg catalog, but going in the other direction is much more challenging. By owning the metadata catalog, Snowflake remains the compute engine of choice and the only writer. If Snowflake had not opened up its systems, it would likely have lost many customers who were afraid of lock-in.

Conclusion

Open table formats like Iceberg enable true separation of compute from storage. By using Snowflake’s Iceberg tables, you can continue enjoying Snowflake’s powerful and operations-free capabilities, while making it possible to occasionally escape its “walled garden.” Because Iceberg with Parquet has characteristics and features that are very similar to those of native Snowflake tables — like efficient compression, partition pruning, schema evolution, etc. —, and because Snowflake has implemented support for them, you should be able to use Iceberg tables instead of native tables without a significant impact on performance or functionality. We therefore suggest defaulting to using Iceberg tables with Snowflake.

This post demonstrated how easy it is to run a query on DuckDB instead of on expensive Snowflake compute by directly pointing it to Snowflake-managed data in your own object storage. There, you can even combine it with data that is not available in your Snowflake warehouse. Knowing that you can operate DuckDB from instances that cost around 10% of a comparably powered Snowflake warehouse, such an approach can come with significant cost savings. Of course, we do not mean to suggest DuckDB is a replacement for Snowflake. We do think this is a good demonstration of the power of interoperability.

This post is the result of a collaborative effort by Jelle De Vleminck, Robbert, Moenes Bensoussia, and Jonathan Merlevede.

👏 If you liked this article, don’t forget to clap
🗣️ Share your insights in the comments; we will try to respond
🗞️ Follow me and subscribe to datamindedbe for more posts on cloud, platform-, data-, and software engineering.
👀 For more about Data Minded, visit our website.

Quack, Quack, Ka-Ching: Cut Costs by Querying Snowflake from DuckDB was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Two Lifecycle Policies Every S3 Bucket Should Have

Jonathan Merlevede — Thu, 07 Mar 2024 09:31:02 GMT

Stop paying for objects that you do not see

Abandoned incomplete multipart uploads and expired current delete markers: what are they, and why you must care about them thanks to bad AWS defaults.

According to your AWS bill, this seemingly empty bucket might, in fact, be quite full.

There may be items in your buckets that you do not see but that adversely impact S3 costs and performance. This post explains what these invisible objects are and what you can do to remove them, as, for reasons of backward compatibility and possibly also because of perverse incentives, AWS does not remove them for you by default.

TL;DR version: The objects in question are parts of abandoned multipart uploads and expired object deletion markers. I think every bucket should have a lifecycle policy to remove them. If you do not know what these objects are and are interested in knowing, read on.

Aborted, incomplete multipart uploads

What are multipart uploads?

Uploading small objects to AWS S3 is possible using just a singlePutObject operation. For larger objects, we use multipart uploads, and the flow is more involved:

Perform theCreateMultiPartUpload operation. You specify an object key; AWS returns you an upload ID.
Perform UploadPart operations, one for every part. You present AWS with the object key, upload ID, a “part number”, and part of the file you want to upload.
Perform the CompleteMultiPartUpload operation. You specify the object key and upload ID; AWS then creates your object as the concatenation of all the parts you uploaded.

All of this is usually handled by your upload tool or library. For example, if you use the AWS CLI to upload files (using aws s3 cp), it will use multipart uploads by default for files larger than 8MiB.

What is problematic about multipart uploads?

If you never complete an upload, associated uploaded parts remain in your bucket forever. These parts are stored in your bucket; you pay to keep them while you do not see them.

To illustrate this, let’s create a 5GiB test file and commence uploading it to the cloud using aws s3 cp:

bucket=yourbucketname
key=tmp/testfile
dd if=/dev/urandom of=/tmp/testfile bs=1G count=5
aws s3 cp /tmp/testfile "s3://$bucket/$key" &
[1] 71717

After leaving the upload running for a while, kill the upload abruptly:

kill -9 71717

You can see that the multipart upload still exists:

aws s3api list-multipart-uploads --bucket $bucket

{
    "Uploads": [
        {
            "UploadId": "gB8iBxnOiladG...",
            "Key": "tmp/testfile",
            "Initiated": "2023-11-17T14:31:24+00:00",
            "StorageClass": "STANDARD",
            "Owner": {...},
            "Initiator": {...}
        }
    ],
    "RequestCharged": null
}

You can list associated parts using list-parts :

id="gB8iBxnOiladG..."
aws s3api list-parts --bucket "$bucket" --key "$key" --upload-id "$id"

{
    "Parts": [
        {
            "PartNumber": 1,
            "LastModified": "2023-11-17T14:39:17+00:00",
            "ETag": "\"edeb946bf1303ed6350887c519041350\"",
            "Size": 8388608
        },
        {
            "PartNumber": 2,
            "LastModified": "2023-11-17T14:39:17+00:00",
            "ETag": "\"5c288b397c12981b40f19fde2387db5d\"",
            "Size": 8388608
        },
        ...
    ],
    ...
}

What can I do about abandoned multipart uploads?

If you know that an upload is abandoned and should be aborted, you can remove uploaded parts by aborting the upload using the AbortMultipartUload operation:

aws s3api abort-multipart-upload --bucket "$bucket" --key "$key" --upload-id "$id"

You can verify that the upload ID, and therefore also associated parts, are removed:

aws s3api list-multipart-uploads --bucket $bucket

{
    "RequestCharged": null
}

aws s3api list-parts --bucket "$bucket" --key "$key" --upload-id "$id"

An error occurred (NoSuchUpload) when calling the ListParts operation: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.

What can I really do about abandoned multipart uploads?

The above is highly impractical, as it requires you to monitor your bucket for abandoned uploads and manually abort them. Instead, tell AWS to abort multipart uploads automatically after a certain period expires using an (expiration) object lifecycle rule.

You can do this using the CLI or IaC tools like Terraform, but also from the AWS Console. While you’re there, tick the box “Delete expired object delete markers” too; we will explain what this does in the next section.

Expiration lifecycle policy expressing what should really be the AWS defaults.

Are there any disadvantages to this?

Object expiration through lifecycle policies is free.

The only thing to account for is that the number of days you set in the policy limits how long a multipart upload can take. Keep this in mind if you are uploading 5TiB objects from a slow connection. Choose something ridiculously long for a single upload but not so long that it impacts your storage bill, like 14 days.

Expired delete markers

Other invisible objects that tend to linger around in buckets are called “expired object delete markers.” These objects only exist in versioned buckets.

What are object delete markers?

When you enable bucket versioning, every prefix becomes associated with a stack of versioned items. The most recent item is the “latest” or “current” one. There are two types of versioned items: object versions and delete markers. Writing data to a location pushes an object version onto the stack; deleting an object pushes a delete marker. If the current item is an object version, this object version is visible as an object in your bucket; current delete markers remain invisible.

We can easily illustrate this using a couple of aws CLI commands. Before we upload an object, no version exists:

bucket=yourbucketname
key=tmp/testfile
aws s3 ls "s3://$bucket/$key" # verify that no object exists
aws s3api list-object-versions --bucket "$bucket" --prefix "$key"

{
    "RequestCharged": null
}

After uploading an object, we see the object and a single object version:

dd if=/dev/urandom bs=1M count=2 | aws s3 cp - s3://$bucket/$key
aws s3 ls s3://$bucket/$key

2023-11-21 00:33:09    2097152 testfile

aws s3api list-object-versions --bucket "$bucket" --prefix "$key"

{
    "Versions": [
        {
            "ETag": "\"18f78ecf07a0c41e8ec2defa200a5029\"",
            "Size": 2097152,
            "StorageClass": "STANDARD",
            "Key": "tmp/testfile",
            "VersionId": "bUWdb14EppQR13KLVapM699xySNlo9yR",
            "IsLatest": true,
            "LastModified": "2023-11-20T23:33:09+00:00",
            "Owner": {...}
        }
    ],
    "RequestCharged": null
}

Deletion of the file adds a deletion marker:

aws s3 rm s3://$bucket/$key
aws s3api list-object-versions --bucket "$bucket" --prefix "$key"

{
    "Versions": [
        {
            "ETag": "\"18f78ecf07a0c41e8ec2defa200a5029\"",
            "Size": 2097152,
            "StorageClass": "STANDARD",
            "Key": "tmp/testfile",
            "VersionId": "bUWdb14EppQR13KLVapM699xySNlo9yR",
            "IsLatest": false,
            "LastModified": "2023-11-20T23:33:09+00:00",
            "Owner": {}
        }
    ],
    "DeleteMarkers": [
        {
            "Owner": {},
            "Key": "tmp/testfile",
            "VersionId": "kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj",
            "IsLatest": true,
            "LastModified": "2023-11-20T23:34:59+00:00"
        }
    ],
    "RequestCharged": null
}

Listing objects using aws s3 ls returns an empty result. Removal of a delete marker would restore the object. You could do this as follows (but let’s not for now):

vid="kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj"
aws s3api delete-object --bucket "$bucket" --key "$key" --version-id "$vid"

What are current and noncurrent object delete markers?

If a delete marker is the latest or current item on the version stack, we refer to it as a current object delete marker. Otherwise, we refer to it as a noncurrent object delete marker.

In the example above, the single delete marker at the prefix tmp/testfile is a current delete marker. Uploading another object to the same location creates a new object version:

dd if=/dev/urandom bs=1M count=2 | aws s3 cp - s3://$bucket/$key
aws s3api list-object-versions --bucket "$bucket" --prefix "$key"

{
    "Versions": [
        {
            "Key": "tmp/testfile",
            "VersionId": "l3_yEX5pQZCloovHVMsbIgbzP1pqZ4iU",
            "IsLatest": true,
            ...
        },
        {
            "Key": "tmp/testfile",
            "VersionId": "bUWdb14EppQR13KLVapM699xySNlo9yR",
            "IsLatest": false,
            ...
        }
    ],
    "DeleteMarkers": [
        {
            "Key": "tmp/testfile",
            "VersionId": "kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj",
            "IsLatest": false,
            ...
        }
    ],
    "RequestCharged": null
}

At this point, the delete marker with version ID kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj is still there but has become “noncurrent”, as indicated by its property IsLatest with value false.

What are expired object delete markers?

Expired object delete markers are delete markers at a prefix with no noncurrent object versions.

We can turn the delete marker from our example into an expired object delete marker by removing the object versions at the same location:

aws s3api delete-object --bucket $bucket --key $key \
  --version-id l3_yEX5pQZCloovHVMsbIgbzP1pqZ4iU
aws s3api delete-object --bucket $bucket --key $key \
  --version-id bUWdb14EppQR13KLVapM699xySNlo9yR
aws s3api list-object-versions --bucket "$bucket" --prefix "$key"

{
    "DeleteMarkers": [
        {
            "Key": "tmp/testfile",
            "VersionId": "kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj",
            "IsLatest": true,
            ...
        }
    ],
    "RequestCharged": null
}

The delete marker with version IDkNig7WIYhADWCr47u_nRrQ8QYdeW4eIj is now an expired object delete marker.

Why are expired object delete markers bad?

The problem is that expired object delete markers can be current and can remain in your bucket forever unless you do something about them. Lingering markers can slow list requests and result in redundant results when listing object versions.

When enabling bucket versioning, you typically implement an expiration policy for noncurrent items, as not doing so often quickly becomes prohibitively expensive. With such a policy, all delete markers eventually become current, expired ones. Your expiration policy will not remove these from your bucket. In our example, the delete marker at tmp/testfile will never be automatically removed by a policy expiring noncurrent versions.

How can I remove expired object delete markers?

You can remove delete markers manually:

id=kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj
aws s3api delete-object --bucket $bucket --key $key --version-id $id

How can I really remove expired object delete markers?

As for multipart uploads, the best way to remove expired object delete markers is through an explicit (expiration) lifecycle policy targeting expired object delete markers. One way to do this is by ticking a box in the Console (see the screenshot above). (You should probably use an IaC tool, though.)

Are there any disadvantages to this?

In the improbable event that you would like to have an account of which objects existed in the past and when they were removed but do not need the ability to restore said objects, consider not removing expired object delete markers.

As for aborted multipart uploads, removing delete markers through lifecycle policies is free. Delete markers only exist in versioned buckets, but having a policy to remove them is never harmful.

Conclusion

We have seen what incomplete multipart uploads are and why you should abort the abandoned ones. We have also seen what expired object delete markers are, that they can be current and that you should remove them.

AWS does not abort or remove anything for you by default. Therefore, whenever you create a bucket, create an expiration lifecycle policy that aborts multipart uploads after some days and removes expired object delete markers.

👏 If you liked this article, don’t forget to clap
🗣️ Share your insights in the comments; I’ll try to respond
🗞️ Follow me and subscribe to datamindedbe for more posts on cloud, platform-, data-, and software engineering
👀 For more about Data Minded, visit our website

Two Lifecycle Policies Every S3 Bucket Should Have was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Connecting to Databases using JDBC from the CLI

Jonathan Merlevede — Mon, 11 Dec 2023 13:59:13 GMT

How To Establish JDBC Connections From the CLI

A quick guide to using the CLI tool SQLLine and any JDBC driver to construct, test, and use connections to any database

Many applications connect to databases using JDBC drivers, often configured by JDBC Connection URLs. As a system or platform administrator, constructing these strings and testing credentials or permissions is often painful, especially as databases are often only accessible from specific networks; connecting to them from your laptop using graphical tools like DBeaver or SQuirrel may not be an option.

Testing JBDC Connection URLs as an administrator can be a painful experience

sqlline is a command-line shell for issuing SQL to relational databases via JDBC. It can connect to any database for which there exist JDBC drivers. This makes it easy to quickly construct and test JDBC connection URIs to connect to your database. Although sqlline is not under very active development, I have found it to work reliably and for various database types (MySQL, Oracle, PostgreSQL, …).

GitHub - julianhyde/sqlline: Shell for issuing SQL to relational databases via JDBC

Installation

You can download the sqlline JAR pre-packaged together with all its dependencies from the Sonatype Maven Central Repository:

ver=1.12.0
wget https://repo1.maven.org/maven2/sqlline/sqlline/$ver/sqlline-$ver-jar-with-dependencies.jar

You should now be able to run sqlline:

java -jar sqlline-$ver-jar-with-dependencies.jar --help
# Usage: java sqlline.SqlLine 
#   -u                the JDBC URL to connect to
#   -n                    the username to connect as
#   -p                    the password to connect as
#   -d                the driver class to use

You will also want to download JDBC drivers, for example, Oracle’s JDBC driver if you want to connect to Oracle servers:

wget https://download.oracle.com/otn-pub/otn_software/jdbc/233/ojdbc11.jar

If you have special requirements, you can follow the instructions in the sqlline repository to compile sqlline from source.

Usage

You can now start sqlline and use it to connect to a database:

java -cp "*" sqlline.SqlLine \
  -n myusername -p supersecretpassword \
  -u "jdbc:oracle:thin:@my.host.name:1521:my-sid"

This will open up an interactive interface into which you can type commands. For example, you can issue the order !tables to list all available tables. Type !help to get a list of all commands.

You can run SQL queries by typing !sql followed by SQL command:

0: jdbc:oracle:thin:@my.host.name> !sql SELECT COUNT(1) FROM mytable;
+----------+
| COUNT(1) |
+----------+
| 3358     |
+----------+
1 row selected (0.02 seconds)

To quit sqlline, type !quit.

Conclusion

SQLLine is an excellent tool that you can use to quickly construct JDBC connection strings and test JDBC drivers and database credentials. It is especially useful to test connections from headless servers. As we have seen in this blog post, installation is a breeze. Stop port forwarding to DBeaver and start using SQLLine!

👏 If you liked this article, don’t forget to clap!
🗣️ Share your insights in the comments.
👀 For more about Data Minded, visit our website.

Connecting to Databases using JDBC from the CLI was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Debugging Running Pods on Kubernetes

Jonathan Merlevede — Wed, 25 Oct 2023 08:45:42 GMT

Exploring Kubernetes’s debugging feature, kubectl debug, and introducing kubectl superdebug — an enhanced kubectl debug supporting volume mounts.

Some pods require debugging.

Executing commands using kubectl exec

If you run software on Kubernetes, you will, at some point, want to debug some aspect of what you deploy. A simple approach to debugging that is natural to people used to working with VMs is to connect to a running pod and hack away:

kubectl exec -it podname -c containername -- bash

This often works and is very useful. However, there are at least two Kubernetes “best practices” limiting exec’s usefulness in the real world:

Not running as root. Containers run with as few privileges as possible and may even run with randomized UIDs.
Minimal images. Images are kept as small as possible, with binaries installed into a distroless image as an extreme.

When applying these best practices, connecting to your container using kubectl exec is either impossible or drops you into a barren wasteland-like environment unsuited for debugging.

kubectl exec does not allow you to specify a user flag or capabilities to start your process with, instead copying those settings from the target container’s main command. Some Kubernetes users think this should be changed.

Debug containers

The Kubernetes-native answer to debugging running containers is to use kubectl debug. The debug command spins up a new container into a running pod. This new container can run as a different user and from any image you choose. Because the debug container runs within the same pod as the container it targets (and therefore on the same node), the isolation between both containers does not need to be absolute. The debug container can share system resources with other containers running in the same pod.

Consider wanting to inspect the CPU usage of a PostgreSQL database running in the container postcont in the pod postpod. The pod does not run as root, and the Postgres image does not have tools like top or htop installed — in other words, the kubectl exec command is of little use. You can then run the following command:

kubectl debug -it \
--container=debug-container \
--image=alpine \
--target=postcont \
postpod

You will be logged in as root (this is the default for the Alpine image) and can easily install your favorite interactive process viewer htop (apk add htop). You share the same process namespace as the postcont container and can see and even kill all the processes running there! When you exit the process, the ephemeral container stops existing, too.

Note: Specifying --target is non-optional if you want your debug container to share the same process namespace as postcont, even if postcont is the only container running in postpod.

Note: You can disconnect from your ephemeral container / bash session without exiting (killing) it by pressing CTRL+P CTRL+D. You can then later reconnect to it using kubectl attach.

Note: kubectl debug offers more functionality than outlined here, such as the copying of pods with a modified startup command or starting a “node” pod with access to the node’s filesystem.

Under the hood

The kubectl debug command above works by creating something called an ephemeral container. These containers are supposed to run temporarily in an existing pod to support actions such as troubleshooting.

The difference between “normal” containers and ephemeral containers is slim. Nothing really prevents ephemeral containers from running for a long time. I think the reason for having ephemeral containers is best understood by looking at foundational architectural choices made by Kubernetes at its inception:

Pods should be disposable and replaceable, and, supporting this,
the Pod specification is immutable.

This made a lot of sense when Kubernetes was used primarily for deploying stateless workloads — when pods themselves could be considered ephemeral. It can be restrictive in this new world where Kubernetes is used for everything. The Pod spec remains immutable, but Kubernetes models ephemeral containers as a subresource of Pod. Unlike “normal” containers, ephemeral containers are not part of the Pod spec, even if they are part of the pod. This subtle distinction keeps everyone happy 🥳!

Ephemeral containers are still relatively new; they have been stable since Kubernetes v1.25 (August 2022), beta since v1.23 (December 2021) and alpha since v1.22 (August 2021).

Mounting volumes

The built-in command kubectl debug can be very useful. It allows you to add an ephemeral container to a running pod, optionally sharing its process namespace with that of a running container. However, if you were expecting to use kubectl debug to inspect or modify any part of the running container’s filesystem, you’re out of luck — the filesystem of the debug pod is disjoint from that of the container you connect it to.

Luckily, we can do better. The idea is simple:

Retrieve the specification of the running target container.
Patch an ephemeral container into the pod. Configure it to share the same process namespace as the target container and additionally to include the same volume mounts.

There is no kubectl command for creating ephemeral containers, so we need to craft a PATCH request to the K8s API to create it. The kubectl proxy command allows reaching the K8s API.

This process is not exactly user-friendly, so it makes sense to wrap the procedure into a script or kubectl plugin. You can find an example implementation of such a script over here:

JonMerlevede/kubectl-superdebug

Note that this approach and script can easily be extended to also copy the environment variable specification from the target container.

If you save this script as kubectl-superdebug and make it available on your path, you can run it as kubectl superdebug from anywhere as follows:

kubectl superdebug \
--container=debug-container \
--image=alpine \
--target=postcont \
postpod

You may also want to extend this script to copy other aspects of the target container into your debug container, such as references to environment variables.

This completes the overview of Kubernetes-native approaches to debugging running containers and should cover most people’s needs. However, read on if you’re particularly interested or have special needs!

Non-Kubernetes native approaches

Kubernetes does not offer a way to connect to a running container as root (unless the main process is running as root) or to access a container’s root filesystem from another container. This does not mean that these things are impossible to do. Kubernetes is, after all, simply a container orchestrator sitting on top of a containerization engine. You can usually do whatever you want by removing layers of abstraction if you, for some reason, really have to. Just make sure that you have to…

If you use the Docker Engine and can access your engine directly from a node or through a privileged container running on a node, then you can run docker exec --user and execute a process as a user of your choice. Plugins such as kubectl ssh and kubectl exec-user implement this approach. Unfortunately, modern engines such as containerd and CRI-O no longer offer the --user flag functionality — which means that these plugins do not work on modern Kubernetes installations.

However, even these modern engines usually just interface with Linux namespaces. You can run commands in whatever “container” you want by entering the appropriate set of Linux namespaces. The tool kpexec implements this approach. It starts a privileged pod on the same node as the target container, then determines which (Linux) namespaces to target, executes commands in those (Linux) namespaces, and finally streams their output to your terminal. As an added bonus, it can overlay a set of tools useful for debugging on top of the target container’s filesystem.

Unlike kubectl exec, kpexec can run commands with a different uid/gid and even different capabilities as the container’s main process. It is compatible with containerd and cri-o. kpexec takes a somewhat heavyweight and brittle approach and may not be compatible with your cluster's security configuration. It can be worth considering if kubectl (super)debug fails to suit your needs.

Note that kpexec directly executes commands into namespaces using nsenter. It is compatible with the ubiquitous container runtime runc, but incompatible with runtimes such as Kata Containers.

In this post, we looked at two Kubernetes-native approaches to debugging running containers: kubectl exec and kubectl debug. We investigated how kubectl debug works, and presented kubectl superdebug, a variation of kubectl debug that starts an ephemeral container sharing the same volumes as the target container and the same process namespace. Lastly, we reviewed some non-Kubernetes native approaches to container debugging.

Debugging Running Pods on Kubernetes was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Twelve-Factor Python applications using Pydantic Settings

Jonathan Merlevede — Wed, 27 Sep 2023 13:26:49 GMT

A look at Pydantic Settings and how it can help you reliably deploy applications across environments

Python configuration

Pydantic Settings

Pydantic Settings is a Python package closely related to the popular Pydantic package. It allows defining type-checked “settings” objects that can be automatically populated from environment variables or a dotenv (.env) file.

As a small example, consider the following Python code snippet:

from pydantic_settings import BaseSettings, SettingsConfigDict

class TestSettings(BaseSettings, frozen=True):
    model_config = SettingsConfigDict(env_file=".env")
    foo: str
    bar: bool
    baz: int


if __name__ == "__main__":
    print(TestSettings(baz=1))

The test settings object initializes values that are not configured in the code with values sourced from the environment:

$ foo=bar BAR=true baz=5 python example.py
# foo='bar' bar=True baz=1

If settings remain unset, Pydantic refuses to create the settings object:

$ python example.py
# pydantic_core._pydantic_core.ValidationError: 2 validation errors for TestSettings
# foo
#   Field required [type=missing, input_value={}, input_type=dict]
#     For further information visit https://errors.pydantic.dev/2.3/v/missing
# bar
#   Field required [type=missing, input_value={}, input_type=dict]
#     For further information visit https://errors.pydantic.dev/2.3/v/missing

Pydantic additionally checks whether settings match their declared type:

$ foo=bar bar=baz python example.py
# pydantic_core._pydantic_core.ValidationError: 1 validation error for TestSettings
# bar
#   Input should be a valid boolean, unable to interpret input [type=bool_parsing, input_value='baz', input_type=str]
#     For further information visit https://errors.pydantic.dev/2.3/v/bool_parsing

If possible, Pydantic implicitly converts types; if you set “bar” to 1, Pydantic converts this to True. Settings can also be read automatically from a .env file:

$ echo -e "foo=bar\nbar=1\nbaz=5" > .env && python example.py
# foo='bar' bar=True baz=1

Twelve-factor applications

Now that we know what Pydantic Settings is, let’s look at what we need it for.

It is almost always a good idea™️ to separate application configuration from core application logic. A good way to do this is by injecting configuration through the environment. A fancier way of saying this is that Pydantic Settings helps create applications that adhere to the twelve-factor methodology. The twelve-factor methodology is an influential set of best practices that maximize portability between execution environments, intending to facilitate your application’s deployment. Injecting configuration through environment variables is a big part of that.

Say you build or package your application, and then deploy it to a development or integration environment with certain settings. Injecting all configurations into it from the environment enables moving your application from your testing to your production environment exactly as it was built and tested.

Why Pydantic Settings?

Separating configuration values from your core application code is definitely a good idea. Whether you should use Pydantic Settings for this is more circumstantial. Pydantic Settings does offer several advantages over using python-dotenv directly and/or reading from environment variables or configuration files from all over your application:

Validity. Pydantic Settings checks for the presence and types of your setting variables, allowing you to fail early in case of incorrect configuration.
Error messages. Pydantic Settings presents clear validation errors that tell you exactly which settings are missing or wrong.
Overriding. Environment settings can easily be overridden from within your code. This can be useful, e.g., to set settings to localized values when testing.
Loose coupling. Although the twelve-factor methodology is very specific about using environment variables, I consider this an implementation detail — the important part is that you keep configuration and code separated. Environment variables cannot always be considered sufficiently secure and are resolved at your application’s startup time, so there may be situations where retrieving configuration in another way is more appropriate. For example, Google recommends against storing secret configuration values in environment variables. Creating settings objects at the fringes of your application instead of calling os.env everywhere directly decouples your application logic from how it retrieves configuration, making it easy to source settings from somewhere else, such as a database, a YAML file, or a secret vault. Although such functionality is not built into Pydantic Settings, loose coupling means you can implement it without changing most of your application.
Features. Pydantic (Settings) has many nice-to-have features not discussed in this introductory article. One example is that it allows cleanly defining custom validators. Another is its ability to mark strings as “secret”, which helps prevent them from ending up in logs. Check out the Pydantic and Pydantic Settings documentation to learn more!

Twelve-Factor Python applications using Pydantic Settings was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Upserting Data using Spark and Iceberg

Jonathan Merlevede — Thu, 25 May 2023 11:13:51 GMT

Use Spark and Iceberg’s MERGE INTO syntax to efficiently store daily, incremental snapshots of a mutable source table.

Iceberg allows tracking a table’s history by storing incremental diffs. Unfortunately, there are some caveats, and getting this to work as you likely want it to requires non-obvious querying. In this post, we look at the why and the how.

We use Spark as our analytical engine, but this post should also apply to other engines working with Iceberg.

Merge iceberg into table. Thanks, DALL-E!

Problem setting

A typical pattern in analytical data processing is to ingest data from an operational system on a daily basis. Often, we store previously ingested table versions in addition to the current state of things, for example, to support reproducing machine learning results. When working with “standard” Spark and Parquet, we do this by storing daily snapshots and by partitioning on the ingestion date.

We aim to use Apache Iceberg to achieve the same result — storing a queryable history of table snapshots — more efficiently. Iceberg is a project offering a metadata format and a set of execution engine plugins. This extends popular analytical processing engines like Spark, Flink, and Trino with features such as incremental updates, time travel, and ACID transactions.

Upserting

Using Spark with Iceberg unlocks the SQL MERGE INTO statement, which implements a table “upsert”, a portmanteau formed by combining a “table insert” and “table update”:

MERGE INTO prod.db.target t -- a target table
USING (SELECT ...) s        -- the source updates
ON t.id = s.id              -- condition to find updates for target rows
WHEN MATCHED AND  THEN DELETE
WHEN MATCHED AND  THEN UPDATE *
WHEN NOT MATCHED THEN INSERT *

The code above uses the result of the SELECTstatement to delete, update and insert rows from and into the table prod.db.target, depending on whether an id exists in the source table or not and whether or not or are true. For an overview of the MERGE INTO statement, check out the Iceberg documentation here.

On the storage side, executing an upsert statement like the one above triggers Iceberg to create new data files corresponding to any modified partitions (~ copy-on-write) or to create small files expressing e.g. deletes (~ merge-on-read, available since Iceberg v2). Iceberg also creates new metadata files pointing to these new data files. Unchanged data files (partitions) are re-used. This allows efficient storing of snapshot series by keeping only the snapshot “deltas”. Earlier versions of the prod.db.target table can be “recalled” by time traveling, using the TIMESTAMP AS OF or VERSION AS OF clauses.

Iceberg table format spec. (source)

TL;DR Upserting allows us to keep our target copy up-to-date while maintaining a complete history of previous states of a source table, without storing full snapshots of the data.

MERGE INTO limitations

How can we use upserts to reconcile differences between an existing Iceberg table and a newly extracted snapshot?

Assume that we extract or create a mutable source table snapshot on a daily basis, and want to use it to upsert an Iceberg table called iceberg. Ideally, we would be able to write the following:

MERGE INTO iceberg
USING snapshot
ON iceberg.id = snapshot.id
WHEN MATCHED THEN UPDATE *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

Unfortunately, this straightforward query does not work for two reasons.

Deletes. Unlike Delta, Iceberg does not support the syntax MATCHED BY SOURCE. Iceberg’s NOT MATCHED statement corresponds to NOT MATCHED BY TARGET, that is, it fires when a row exists in snapshot but not in iceberg. This is problematic if rows can be deleted from the source system.

Superfluous copies. For rows that are the same in iceberg and snapshot, the WHEN MATCHED THEN UPDATE * results in an identical duplicate of the data being stored on your filesystem. This means that Iceberg will not bring storage benefits over storing multiple snapshots of the source table.

Under the hood, Iceberg decides which partitions it will re-write based on the ON conditional (see this issue for a discussion on the impact on performance). Rows matched by the ON statement but not by any guard on the match conditions will, therefore, still be copied every time you run the upsert statement. Practically, this means that re-writing the MERGE INTO statement above to read +- as follows still results in duplicate partitions being stored:

MERGE INTO iceberg
USING snapshot
ON iceberg.id = snapshot.id
-- condition expressing change is true if one or more columns is different
-- in iceberg and target, i.e.
-- (iceberg.col1 != snapshot.col1) OR (iceberg.col2 != snapshot.col2) OR ...
WHEN MATCHED AND  THEN UPDATE *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

(More detailed analysis with references to source code below)

In fact, your MATCHED conditions are always re-written to include a catchall condition that emits the target row.

Iceberg will construct modified partitions differently depending on whether your MERGE INTO statement contains only MATCHED conditions, NOT MATCHED conditions or both. If only MATCHED conditions exist, a right outer join between target and source suffices (with source being on the right). If only NOT MATCHED conditions exist, Iceberg uses a left anti join and performs a simple append operation instead of re-writing partitions. If both MATCHED and NOT MATCHED conditions exist, a full outer join between target and source is required.

To determine which rows/partitions of the target to re-write, Iceberg performs a quick inner join between source and target tables. To support NOT MATCHED BY SOURCE, a right outer join would be required, as is implemented by Delta here.

Overcoming MERGE INTO limitations

Luckily, we can easily overcome these properties of Iceberg’s upserting functionality. We do this by first preparing a table containing only the changes to your iceberg table. When dealing with sources where rows can be updated and deleted, this requires a full outer join or a sequence of anti-joins. Then, we can use this changes table as the source of updates for our MERGE INTO statement.

One way to do this is by using a CTE as follows:

WITH changes AS
SELECT 
  COALESCE(b.id, a.id) AS id,
  b.col1 as col1,
  b.col2 as col2,
  ...
  CASE WHEN b.id IS NULL THEN 'D' WHEN a.id IS NULL THEN 'I' ELSE 'U' END as cdc
FROM iceberg a
FULL OUTER JOIN snapshot b ON a.id = b.id
WHERE NOT (a.col1 = b.col1 AND a.col2 = b.col2 AND ...)

MERGE INTO iceberg
USING changes
ON iceberg.id = changes.id
WHEN MATCHED AND changes.cdc = 'D' THEN DELETE
WHEN MATCHED AND changes.cdc = 'U' THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This results in only changes being stored in the target table iceberg, and supports insertions, updates, and deletes in the source table snapshot.

The table changes includes a row for every new insert, update or delete in the source table. To construct changes, we perform a full outer join between the recent snapshot of the source table (snapshot) and the existing Iceberg table (iceberg):

If a row existed in iceberg but no longer exists in snapshot (b.id is null), this corresponds to a delete operation in the source table.
If an id exists in snapshot but does not yet exist in iceberg (a.id is null), this corresponds to an insert operation in the source table.
If a row with a specific id exists in both iceberg and snapshot (both a.id and b.id are non-null), the row with this id was unchanged or updated. We filter out unchanged rows by specifying WHERE NOT (a.col1 = b.col1 AND a.col2 = b.col2 AND ...).

The changes tables only has entries for rows actually requiring changes in the iceberg table, working around the problem of superfluous updates. Merging changes into iceberg using MERGE INTO is straightforward and works the way you would expect it to.

This post looked at how we can leverage Iceberg to maintain a history of full table snapshots efficiently.

Iceberg requires some tinkering for it to work the way we want it to, but enables patterns that were previously impossible or inefficient to use with Spark. Iceberg extends Spark's capabilities with functionality that was previously only available within data warehouses like Snowflake. We hope that with time, Iceberg will become even easier to use. It's definitely a technology worth exploring!

Edit 21/11: Added some sentences to the introduction and re-wrote some sentences for clarity.

Upserting Data using Spark and Iceberg was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.

Make Gitpod Open Sites in the Browser

Jonathan Merlevede — Fri, 16 Dec 2022 15:24:10 GMT

Gitpod configuration for opening links in the browser instead of in the terminal.

The Lynx browser

At Data Minded, we ❤️ Gitpod for allowing us to easily configure reproducible and sharable development environments. However, we all have our pet peeves — and one small thing I do not like about Gitpod’s default configuration is that opening a URL from a terminal application fires up LYNX… and this despite Gitpod running in a perfectly usable browser 🌎! Luckily, we can change this default and configure Gitpod to open links in a browser window by using a startup script.

Gitpod helper

Gitpod’s helper application gp can open browser windows. As an example, the following command will open Data Minded’s website in a browser window:

gp preview --external https://www.dataminded.com/

Updating alternatives

We can configure the above command as the default way to open links. The idea is to create a small script /usr/local/bin/open.sh with the following contents:

#!/bin/sh
exec gp preview --external "$@"

Then, make it the default way to open links by running:

sudo chmod +x /usr/local/bin/open.sh
sudo update-alternatives --install /usr/bin/www-browser www-browser /usr/local/bin/open.sh 100

The reason for creating the script is that update-alternatives requires you to point to a single executable.

Startup configuration

You can make Gitpod perform the steps above when starting your workspace starts by adding the following start task to your gitpod.yml file:

tasks:
  command: |
    cat <<'EOF' | sudo install /dev/stdin /usr/local/bin/open.sh
    #!/bin/sh
    exec gp preview --external "$@"
    EOF
    sudo update-alternatives --install /usr/bin/www-browser www-browser /usr/local/bin/open.sh 100

Make Gitpod Open Sites in the Browser was originally published in Dataminded on Medium, where people are continuing the conversation by highlighting and responding to this story.