<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Jonathan Merlevede on Medium]]></title>
        <description><![CDATA[Stories by Jonathan Merlevede on Medium]]></description>
        <link>https://medium.com/@jonathan.merlevede?source=rss-e29e99a9aba7------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/2*QwU7oOr_ONo6pDUyDixRvg.jpeg</url>
            <title>Stories by Jonathan Merlevede on Medium</title>
            <link>https://medium.com/@jonathan.merlevede?source=rss-e29e99a9aba7------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 19 May 2026 15:00:03 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@jonathan.merlevede/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Using AWS IAM with STS as an identity Provider]]></title>
            <link>https://medium.com/datamindedbe/using-aws-iam-with-sts-as-an-identity-provider-f5177ca0850c?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/f5177ca0850c</guid>
            <category><![CDATA[authentication]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Wed, 03 Dec 2025 15:05:13 GMT</pubDate>
            <atom:updated>2025-12-10T18:43:30.275Z</atom:updated>
            <content:encoded><![CDATA[<h3>Using AWS IAM with STS as an Identity Provider</h3><h4>How EKS tokens are created, and how we can use the same technique to use AWS IAM as an identity provider.</h4><blockquote><a href="https://welw.it/posts/using-aws-iam-with-sts-as-an-identity-provider-70e7cd19c499/">Also published on welw.it</a></blockquote><p>I recently tried to connect to an AWS EKS cluster from Python code in an environment that did not have the aws CLI installed, leaving me without a way to retrieve tokens using aws eks get-token. Looking for a Boto call or AWS API call for EKS tokens yielded no results. I decided to look at how these tokens are generated, and as it turns out, the bearer tokens authenticating you to EKS are pre-signed calls to the AWS STS API — specifically for the GetCallerIdentity endpoint.</p><p>Pre-signing calls to GetCallerIdentity lets you use IAM credentials to generate an identity token that works for authenticating to EKS and other contexts. Let’s dive in!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1020/1*cAIQp5ROWM2j4YdIhjFbow.png" /><figcaption>Relationship between EKS and STS</figcaption></figure><h3>How we usually authenticate to EKS</h3><p>When using EKS, we typically create a cluster and then run aws eks update-kubeconfig to update our kubeconfig file <a href="https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html">as described in the AWS documentation</a>.</p><p>For example, if we have a cluster named confused-blues-mushroom, we can run:</p><pre>aws eks update-kubeconfig --name confused-blues-mushroom</pre><p>This updates your ~/.kube/config file with an entry for the cluster, looking like so:</p><pre>apiVersion: v1<br>kind: Config<br>preferences: {}<br>current-context: arn:aws:eks:eu-west-1:299641483789:cluster/confused-blues-mushroom<br>clusters:<br>  - cluster:<br>      certificate-authority-data: &lt;base64-encoded-certificate&gt;<br>      server: &lt;cluster-endpoint&gt;<br>    name: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom<br>users:<br>  - name: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom<br>    user:<br>      exec:<br>        apiVersion: client.authentication.k8s.io/v1beta1<br>        command: aws<br>        args:<br>          - --region<br>          - &lt;region&gt;<br>          - eks<br>          - get-token<br>          - --cluster-name<br>          - confused-blues-mushroom<br>          - --output<br>          - json<br>contexts:<br>  - context:<br>      cluster: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom<br>      user: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom<br>    name: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom</pre><p>The config file defines your cluster, a user (~credentials), and a context that ties the two together.</p><h3>A closer look at the user token</h3><p>The user entry tells us that we can obtain credentials for the cluster by running the following command:</p><pre>aws --region &lt;region&gt; eks \<br>  get-token \<br>  --cluster-name confused-blues-mushroom \<br>  --output json</pre><p>Doing so yields something like this:</p><pre>{<br>  &quot;kind&quot;: &quot;ExecCredential&quot;,<br>  &quot;apiVersion&quot;: &quot;client.authentication.k8s.io/v1beta1&quot;,<br>  &quot;spec&quot;: {},<br>  &quot;status&quot;: {<br>    &quot;expirationTimestamp&quot;: &quot;2025-11-25T22:38:50Z&quot;,<br>    &quot;token&quot;: &quot;k8s-aws-v1.aHR0cHM6Ly9zdHMuZXUtd2VzdC0xLmFtYXpvbmF3cy5jb20vP0FjdGlvbj1HZXRDYWxsZXJJZGVudGl0eSZWZXJzaW9uPTIwMTEtMDYtMTUmWC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BU0lBVUxSQUdHSUdVQ042UFBJUyUyRjIwMjUxMTI1JTJGZXUtd2VzdC0xJTJGc3RzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTExMjVUMjIyNDUwWiZYLUFtei1FeHBpcmVzPTYwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCUzQngtazhzLWF3cy1pZCZYLUFtei1TZWN1cml0eS1Ub2tlbj1JUW9KYjNKcFoybHVYMlZqRUs3JTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGJTJGd0VhQ1dWMUxYZGxjM1F0TVNKSU1FWUNJUUR3c2l2cHdJdDVTVzdnZVFvV3F4TXA4UndZM1k0UFRYSmQ3ZFFBUktOZmhBSWhBTVo3Q1lPR3YlMkZjTEhDZ29CVVFVYWhxQlcwbllmT250RmxuaThuRGQyJTJCREdLcU1EQ0hjUUFob01Nams1TmpReE5EZ3pOemc1SWd6TiUyRmFYamlENkVlV3BpemdRcWdBUEV6b3hRQmdhbWZlQ3FaSEFnM3h3MndZSmExQmlGdHVxQWgyUWdCd2VMQ0xKJTJCY1U1WjhXQzJxTnFwcFN5QTRDSWVaVXJLY2xFUWEwSVFtTFVwMHA1QUZUWmZOV2ZwYXJEZEt5dldTY2Zzd3RNR0pEanBLem1TQlh5UE9FeVlqVlpWdnZVcGJzc2p3TUp1bmRkY09sbjdac2Q5biUyQlBLaTV0JTJCZ1JZbU9hcTFqY285TUMyOVB6WnJrZ3FteERDOCUyRmZHc1k0a1FpVTklMkZndE0lMkJ6JTJCaXZ5YkhFSnV0Z2p5dkhFeG1ncFZmcFJ1d0lEdkFnRXBaTWFUTDNmTVhOczRHSmMwaHVHMWFVMjBNNGNHakg5Z1BVWmpaR2hoY0plYmxNV2dBJTJCV1l4d29XckhpTiUyQnBHNVpwJTJGQUhaV2pONHh3blY1b2Z5UUt4WUl5c0hZQzVsT1hjTWk0bFV0SSUyQnFScHRGVEVsWGpCWUwxd0dmVHFlcHZGSzJhbHVZbGgyU3h2SjhjTTYxaUF2bnZkOU5ac2slMkZsWWROSUZjZUlBVXJleDZWTHdDcXc5UmQxcFd6Znk5N1NKZUVQTzJVY1YxZk5DckZCSW5RJTJGZllmVjNCTk5EdlhlYnoxVURvbHZwcDZvVE84MVJySVowUDZlRWpLNkcxVGVrNUgxVzdUSVh2TTFQeVFmYlF3eXNXWXlRWTZvd0hjVldMTVlmY1AlMkIlMkZEQ0VEQ042RXVCZTRBNnpiZWNlMzRmbEdQNWlVTG1HJTJCVDQ1TkZlOFNmeU9KV01JR2xoRnpyMXhoVHVPRUZYY2hnQ0NJaE9GNTZiSzJvWENGZnZxQSUyRnJSNzlMNHdxaGlGeXZ4WEx5YlBia2tMV25wa1ElMkJLdEpIb1ZNJTJGRlFwdEh2dlJZSFQyMndlTElnV1JHZWNHRFhsS3hndGZLdExnV213aWVjNWlWd1BUb1NmUWxCViUyQjhvS3pkRWtQTXo2UCUyQlgyQ09HSzVESEpJNHpBeiZYLUFtei1TaWduYXR1cmU9MDAxYzZkZmY3YjFkNDAxZGQzMjlmODQwZTZhYjIxY2RjYzE4OGJmYTU2YTg3ZDBkMTdjYTc1NGY4YjE1M2VmZA&quot;<br>  }<br>}</pre><p>Investigating the token further, we see that it consists of a prefix k8s-aws-v1., followed by a URL-safe base64-encoded string. Decoding this string, we get a pre-signed URL for the GetCallerIdentity API call:</p><pre>echo &quot;aHR0cHM6Ly9zdHMuYW1hem9uYXdzLmNvbS8_QWN0aW9uPUdldENhbGxlcklkZW50aXR5JlZlcnNpb249MjAxMS0wNi0xNSZYLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFTSUFVTFJBR0dJR1VDTjZQUElTJTJGMjAyNTExMjUlMkZ1cy1lYXN0LTElMkZzdHMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MTEyNVQyMjUxMzhaJlgtQW16LUV4cGlyZXM9NjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JTNCeC1rOHMtYXdzLWlkJlgtQW16LVNlY3VyaXR5LVRva2VuPUlRb0piM0pwWjJsdVgyVmpFSzclMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkYlMkZ3RWFDV1YxTFhkbGMzUXRNU0pJTUVZQ0lRRHdzaXZwd0l0NVNXN2dlUW9XcXhNcDhSd1kzWTRQVFhKZDdkUUFSS05maEFJaEFNWjdDWU9HdiUyRmNMSENnb0JVUVVhaHFCVzBuWWZPbnRGbG5pOG5EZDIlMkJER0txTURDSGNRQWhvTU1qazVOalF4TkRnek56ZzVJZ3pOJTJGYVhqaUQ2RWVXcGl6Z1FxZ0FQRXpveFFCZ2FtZmVDcVpIQWczeHcyd1lKYTFCaUZ0dXFBaDJRZ0J3ZUxDTEolMkJjVTVaOFdDMnFOcXBwU3lBNENJZVpVcktjbEVRYTBJUW1MVXAwcDVBRlRaZk5XZnBhckRkS3l2V1NjZnN3dE1HSkRqcEt6bVNCWHlQT0V5WWpWWlZ2dlVwYnNzandNSnVuZGRjT2xuN1pzZDluJTJCUEtpNXQlMkJnUlltT2FxMWpjbzlNQzI5UHpacmtncW14REM4JTJGZkdzWTRrUWlVOSUyRmd0TSUyQnolMkJpdnliSEVKdXRnanl2SEV4bWdwVmZwUnV3SUR2QWdFcFpNYVRMM2ZNWE5zNEdKYzBodUcxYVUyME00Y0dqSDlnUFVaalpHaGhjSmVibE1XZ0ElMkJXWXh3b1dySGlOJTJCcEc1WnAlMkZBSFpXak40eHduVjVvZnlRS3hZSXlzSFlDNWxPWGNNaTRsVXRJJTJCcVJwdEZURWxYakJZTDF3R2ZUcWVwdkZLMmFsdVlsaDJTeHZKOGNNNjFpQXZudmQ5TlpzayUyRmxZZE5JRmNlSUFVcmV4NlZMd0NxdzlSZDFwV3pmeTk3U0plRVBPMlVjVjFmTkNyRkJJblElMkZmWWZWM0JOTkR2WGViejFVRG9sdnBwNm9UTzgxUnJJWjBQNmVFaks2RzFUZWs1SDFXN1RJWHZNMVB5UWZiUXd5c1dZeVFZNm93SGNWV0xNWWZjUCUyQiUyRkRDRURDTjZFdUJlNEE2emJlY2UzNGZsR1A1aVVMbUclMkJUNDVORmU4U2Z5T0pXTUlHbGhGenIxeGhUdU9FRlhjaGdDQ0loT0Y1NmJLMm9YQ0ZmdnFBJTJGclI3OUw0d3FoaUZ5dnhYTHliUGJra0xXbnBrUSUyQkt0SkhvVk0lMkZGUXB0SHZ2UllIVDIyd2VMSWdXUkdlY0dEWGxLeGd0Zkt0TGdXbXdpZWM1aVZ3UFRvU2ZRbEJWJTJCOG9LemRFa1BNejZQJTJCWDJDT0dLNURISkk0ekF6JlgtQW16LVNpZ25hdHVyZT1kYmFjNGQ3MzM1NTU1ODllYWRkMTVhZGZiOGI4MGVkZmNkMjE2YzQ1MmQxZWM3MDEwNmNkNjUwNmViMWY0ZTUz&quot; \<br>| basenc -d --base64url<br># Returns:<br># https://sts.amazonaws.com/?Action=GetCallerIdentity&amp;Version=2011-06-15&amp;X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=ASIAULRAGGIGUCN6PPIS%2F20251125%2Fus-east-1%2Fsts%2Faws4_request&amp;X-Amz-Date=20251125T225138Z&amp;X-Amz-Expires=60&amp;X-Amz-SignedHeaders=host%3Bx-k8s-aws-id&amp;X-Amz-Security-Token=IQoJb3JpZ2luX2VjEK7%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJIMEYCIQDwsivpwIt5SW7geQoWqxMp8RwY3Y4PTXJd7dQARKNfhAIhAMZ7CYOGv%2FcLHCgoBUQUahqBW0nYfOntFlni8nDd2%2BDGKqMDCHcQAhoMMjk5NjQxNDgzNzg5IgzN%2FaXjiD6EeWpizgQqgAPEzoxQBgamfeCqZHAg3xw2wYJa1BiFtuqAh2QgBweLCLJ%2BcU5Z8WC2qNqppSyA4CIeZUrKclEQa0IQmLUp0p5AFTZfNWfparDdKyvWScfswtMGJDjpKzmSBXyPOEyYjVZVvvUpbssjwMJunddcOln7Zsd9n%2BPKi5t%2BgRYmOaq1jco9MC29PzZrkgqmxDC8%2FfGsY4kQiU9%2FgtM%2Bz%2BivybHEJutgjyvHExmgpVfpRuwIDvAgEpZMaTL3fMXNs4GJc0huG1aU20M4cGjH9gPUZjZGhhcJeblMWgA%2BWYxwoWrHiN%2BpG5Zp%2FAHZWjN4xwnV5ofyQKxYIysHYC5lOXcMi4lUtI%2BqRptFTElXjBYL1wGfTqepvFK2aluYlh2SxvJ8cM61iAvnvd9NZsk%2FlYdNIFceIAUrex6VLwCqw9Rd1pWzfy97SJeEPO2UcV1fNCrFBInQ%2FfYfV3BNNDvXebz1UDolvpp6oTO81RrIZ0P6eEjK6G1Tek5H1W7TIXvM1PyQfbQwysWYyQY6owHcVWLMYfcP%2B%2FDCEDCN6EuBe4A6zbece34flGP5iULmG%2BT45NFe8SfyOJWMIGlhFzr1xhTuOEFXchgCCIhOF56bK2oXCFfvqA%2FrR79L4wqhiFyvxXLybPbkkLWnpkQ%2BKtJHoVM%2FFQptHvvRYHT22weLIgWRGecGDXlKxgtfKtLgWmwiec5iVwPToSfQlBV%2B8oKzdEkPMz6P%2BX2COGK5DHJI4zAz&amp;X-Amz-Signature=dbac4d733555589eadd15adfb8b80edfcd216c452d1ec70106cd6506eb1f4e53</pre><p>Making a straightforward GET request to this URL returns something like this, which could be interpreted as the pre-signed URL being invalid:</p><pre>&lt;ErrorResponse xmlns=&quot;https://sts.amazonaws.com/doc/2011-06-15/&quot;&gt;<br>  &lt;Error&gt;<br>    &lt;Type&gt;Sender&lt;/Type&gt;<br>    &lt;Code&gt;SignatureDoesNotMatch&lt;/Code&gt;<br>    &lt;Message&gt;The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.&lt;/Message&gt;<br>  &lt;/Error&gt;<br>  &lt;RequestId&gt;1d84958b-0ed3-4491-a74b-dbc8c0a3c10a&lt;/RequestId&gt;<br>&lt;/ErrorResponse&gt;</pre><p>Inspection of the pre-signed URL reveals the parameter X-Amz-SignedHeaders=host%3Bx-k8s-aws-id, which tells us that the x-k8s-aws-id header should be included in the request. Assuming that $presigned is the pre-signed URL, the command</p><pre>curl -H &quot;x-k8s-aws-id: confused-blues-mushroom&quot; \<br>-H &quot;accept: application/json&quot; \<br>&quot;$presigned&quot;</pre><p>returns something like:</p><pre>{<br>  &quot;GetCallerIdentityResponse&quot;: {<br>    &quot;GetCallerIdentityResult&quot;: {<br>      &quot;Account&quot;: &quot;&lt;account-id&gt;&quot;,<br>      &quot;Arn&quot;: &quot;arn:aws:sts::&lt;account-id&gt;:assumed-role/&lt;role-name&gt;/&lt;username&gt;&quot;,<br>      &quot;UserId&quot;: &quot;AROAULRAGGIG6OJUH7R6U:jonathan.merlevede@dataminded.com&quot;<br>    },<br>    &quot;ResponseMetadata&quot;: {<br>      &quot;RequestId&quot;: &quot;38591f47-fd34-4145-bc81-33047c54e44a&quot;<br>    }<br>  }<br>}</pre><p>If you receive the EKS token, you can decode it and call the embedded pre-signed URL. You then get a lot of information about the identity of the caller; you know its role session ARN arn:aws:sts::&lt;account-id&gt;:assumed-role/&lt;role-name&gt;/&lt;username&gt;, which is tied to the role with ID &lt;role-id&gt; and tagged with userid &lt;userid&gt; (<a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#principaltable">docs</a>).</p><p>Understanding all of this is helpful for several reasons. You now know that:</p><ul><li>It should be easy to replace the aws CLI with a more lightweight alternative.</li><li>You can create your own token generator fairly easily, which can be useful in some environments where the aws CLI is not available — like Lambda functions.</li><li>You can use this technique to support authentication using AWS IAM credentials in your own services.</li></ul><h3>Lightweight AWS alternative</h3><p>The aws CLI is a rather heavyweight dependency if all you use it for is token creation. You can use a lighter alternative instead, such as <a href="https://github.com/kubernetes-sigs/aws-iam-authenticator">aws-iam-authenticator</a>. Their GitHub page does a pretty good job of explaining the above process, too.</p><p>To use aws-iam-authenticator instead of aws, install it and adapt the user entry in your kubeconfig file as follows:</p><pre>users:<br>  - name: arn:aws:eks:&lt;region&gt;:&lt;account-id&gt;:cluster/confused-blues-mushroom<br>    user:<br>      exec:<br>        apiVersion: client.authentication.k8s.io/v1beta1<br>        command: aws-iam-authenticator<br>        args:<br>          - token<br>          - -i<br>          - confused-blues-mushroom</pre><p>Indeed, the output of the aws-iam-authenticator command is exactly the same as the output of the aws eks get-token command.</p><h3>Generating your own tokens</h3><p>You can also generate tokens yourself. It helps if you can use a library to handle the heavy lifting — <em>in casu</em>, the AWS Signature v4 (SigV4) signing.</p><p>The README documentation of aws-iam-authenticator provides a great example of how to do this using Python (<a href="https://github.com/kubernetes-sigs/aws-iam-authenticator/blob/master/README.md#api-authorization-from-outside-a-cluster">link</a>):</p><pre>import base64<br>import boto3<br>import re<br>from botocore.signers import RequestSigner<br>def get_bearer_token(cluster_id, region):<br>    STS_TOKEN_EXPIRES_IN = 60<br>    session = boto3.session.Session()<br>    client = session.client(&#39;sts&#39;, region_name=region)<br>    service_id = client.meta.service_model.service_id<br>    signer = RequestSigner(<br>        service_id,<br>        region,<br>        &#39;sts&#39;,<br>        &#39;v4&#39;,<br>        session.get_credentials(),<br>        session.events<br>    )<br>    params = {<br>        &#39;method&#39;: &#39;GET&#39;,<br>        &#39;url&#39;: &#39;https://sts.{}.amazonaws.com/?Action=GetCallerIdentity&amp;Version=2011-06-15&#39;.format(region),<br>        &#39;body&#39;: {},<br>        &#39;headers&#39;: {<br>            &#39;x-k8s-aws-id&#39;: cluster_id<br>        },<br>        &#39;context&#39;: {}<br>    }<br>    signed_url = signer.generate_presigned_url(<br>        params,<br>        region_name=region,<br>        expires_in=STS_TOKEN_EXPIRES_IN,<br>        operation_name=&#39;&#39;<br>    )<br>    base64_url = base64.urlsafe_b64encode(signed_url.encode(&#39;utf-8&#39;)).decode(&#39;utf-8&#39;)<br>    # remove any base64 encoding padding:<br>    return &#39;k8s-aws-v1.&#39; + re.sub(r&#39;=*&#39;, &#39;&#39;, base64_url)</pre><p>A token generated by this function can be used as a bearer token in calls to the Kubernetes API.</p><h3>Supporting IAM authentication in your own services</h3><p>You can use this technique to support IAM authentication in our own services. That’s also the idea behind aws-iam-authenticator, which allows you to add IAM authentication to self-managed Kubernetes clusters.</p><blockquote><em>In fact</em>, aws-iam-authenticator even predates Amazon EKS! EKS adopted the authentication approach introduced by aws-iam-authenticator, standardizing it.</blockquote><p>The mechanics are straightforward:</p><ul><li>The x--prefixed header(s) that you add to your call to AWS STS ensure that your pre-signed URL is used only in the context of the service that you are targeting (e.g., a specific EKS cluster). They serve as what would be known as <a href="https://sergiodxa.com/articles/oauth2-audience-explained">your token’s </a><a href="https://sergiodxa.com/articles/oauth2-audience-explained">aud claim in OIDC</a> or your assertion’s <a href="https://mojoauth.com/glossary/saml-audience-restriction/">audience restriction</a> in SAML.</li><li>On the protected resource side, validate incoming tokens by calling the pre-signed URL you receive with the appropriate headers. This is not too different from how <a href="https://www.oauth.com/oauth2-servers/token-introspection-endpoint/">OAuth with token introspection</a> works.</li></ul><p>Several services besides EKS use this method. It is, for example, how HashiCorp Vault’s IAM auth method works:</p><p><a href="https://developer.hashicorp.com/vault/docs/auth/aws#iam-auth-method">AWS - Auth Methods | Vault | HashiCorp Developer</a></p><p>Note that STS is not the <em>perfect</em> identity provider for several reasons, including but not limited to:</p><ul><li>Generating the token is somewhat complicated; it does not follow a “standard” flow (think the OAuth client credentials flow) and requires SigV4 signing.</li><li>STS calls are free, but e.g. throttling might become an issue. <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_iam-quotas.html#reference_iam-quotas-sts-requests">The default quota allows 600 requests per second</a>.</li><li>Having to call the pre-signed URLs for all incoming requests imposes a load on your protected resource. Self-contained tokens such as JWS-encoded tokens (~JWT) are typically better in this regard.</li><li>You will have to validate the incoming pre-signed URL before calling it for security reasons.</li></ul><h3>Summary</h3><p>We explored how EKS uses AWS STS to construct bearer tokens for Kubernetes API access by pre-signing calls to GetCallerIdentity. This technique is not limited to EKS — you can use it to add IAM authentication to your own services, just like HashiCorp Vault does. Whether you need to create tokens in environments without the aws CLI or want to build your own IAM-based authentication system, understanding this pattern opens up some interesting possibilities.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f5177ca0850c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/using-aws-iam-with-sts-as-an-identity-provider-f5177ca0850c">Using AWS IAM with STS as an identity Provider</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Overcoming Corporate Distrust]]></title>
            <link>https://medium.com/@jonathan.merlevede/overcoming-corporate-distrust-5abb2682a6ca?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/5abb2682a6ca</guid>
            <category><![CDATA[security]]></category>
            <category><![CDATA[linux]]></category>
            <category><![CDATA[web-development]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Sat, 04 Oct 2025 22:00:32 GMT</pubDate>
            <atom:updated>2025-10-06T10:51:26.801Z</atom:updated>
            <content:encoded><![CDATA[<h4>How to trust your organization’s self-signed certificates and deal with applications like ZScaler man-in-the-middling traffic</h4><p>When your enterprise environment intercepts and inspects HTTPS traffic at a proxy, that proxy is operating as a <strong>man-in-the-middle (MITM)</strong>. This is exactly the situation that HTTPS is designed to protect against. To get your applications to play nice with a proxy like ZScaler, you have to force them to trust its root certificate, which is not part of the globally accepted set of root certificates. System administrators or your proxy client (e.g. Zscaler Client Connector) generally ensure that your system’s trust bundle, used by applications such as your browser and the OS itself, includes these root certificates.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AtB-bHteXTR2lVDQtNjWIg.jpeg" /><figcaption><em>What is really happening. </em><a href="https://oyrendik.blogspot.com/2024/09/what-is-proxy-server.html"><em>Image source</em></a></figcaption></figure><h3>The problem</h3><p>The presence of your company’s root certificates allows you to perform tasks like browsing the internet. Unfortunately, not all applications use your system’s trust bundle by default, resulting in errors like this:</p><pre>curl: (60) SSL certificate problem: self-signed certificate in certificate chain</pre><p>In fact, <em>most</em> scripts (Python, Node.js) do not integrate with the system trust bundle. Instead, they ship with their own bundles. I have spent more time than I would care to admit dealing with related issues.</p><p>This post lists some ways to create a trust bundle — preferably by extracting the system’s trust bundle into a PEM file — and how to get different applications to trust it by setting environment variables.</p><h3>Step 1: Creating a trust bundle</h3><p>The first step is to create a trust bundle that includes your corporate root certificates. I prefer to do this by extracting the system’s trust bundle into a PEM file. Depending on your environment, you may want to include public root certificates as well. It almost certainly does not hurt to include them.</p><p>Every operating system has its own way of storing and accessing the system trust bundle.</p><h4>On MacOS</h4><p>On MacOS, the system’s trust bundle is stored on your system keychain. You can extract it using the following convenient commands:</p><pre>security find-certificate -a -p /System/Library/Keychains/SystemRootCertificates.keychain &gt; ~/.bundle.pem<br>security find-certificate -a -p /System/Library/Keychains/System.keychain &gt;&gt; ~/.bundle.pem</pre><p>It appears to me that the SystemRootCertificates.keychain includes the certificates that ship with macOS, and the System.keychain includes additional certificates added by administrators or enterprise tools.</p><h4>On Linux</h4><p>Linux distributions generally store your system’s trust bundle already in a PEM file. All you have to do is figure out its location. This location depends on your specific distribution.</p><ul><li>On Debian-based systems (Debian, Ubuntu, Mint, …), it is located at /etc/ssl/certs/ca-certificates.crt.</li><li>On Red Hat-based systems (RHEL, CentOS, Fedora, Amazon Linux), it is located at /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt.</li></ul><blockquote><em>Your OS generally compiles this bundle from a collection of PEM files that are stored in a different location. </em><a href="https://documentation.ubuntu.com/server/how-to/security/install-a-root-ca-certificate-in-the-trust-store/"><em>On Debian/Ubuntu, for example</em></a><em>, certificates are stored in</em><em>/usr/local/share/ca-certificates/, and compiled into a bundle by the </em><em>update-ca-certificates command. On Red Hat-based systems, certificates are stored in</em><em>/etc/pki/ca-trust/source/anchors/ and compiled using </em><em>update-ca-trust.</em></blockquote><h4>On Windows and other systems</h4><p>Like MacOS, Windows stores your system’s trust bundle on your system keychain. Unlike MacOS, it does not provide convenient commands for extracting it into a PEM file.</p><p>Some combination of PowerShell commands and/or calls to certutil allow you to extract the system trust bundle. This is non-trivial, and corporate environments often restrict any type of shell access on Windows machines anyway. So on Windows, <strong>instead of extracting the system trust bundle, I tend to construct &quot;my own&quot; instead</strong>, starting from Python&#39;s default certifi trust bundle and appending additional root certificates to it.</p><blockquote><em>Unlike MacOS, Windows does not make it easy to extract your system trust bundle.</em></blockquote><p>You can get the certifi trust bundle from <a href="https://github.com/certifi/python-certifi">its GitHub repository</a>:</p><pre>curl https://raw.githubusercontent.com/certifi/python-certifi/refs/heads/master/certifi/cacert.pem &gt; ~/.bundle.pem</pre><p>To get the additional certificates to trust, that is, the certificates used by the corporate proxy, you have several options:</p><ul><li>You can ask your admin for it;</li><li>You can often find it hosted on Sharepoint or Confluence;</li><li>You can also fetch it from the “man-in-the-middle” yourself.</li></ul><p>I prefer the last option. If openssl is installed, you can use it to inspect served certificates:</p><pre>hostname=google.com # any other hostname serving the untrusted certificate<br>openssl s_client -showcerts -connect &quot;$hostname&quot;:443 &lt;/dev/null</pre><p>You can craft a script to create a PEM file from this output automatically if you like.</p><p>If OpenSSL is not available, you can inspect the certificate served by the proxy from your browser’s developer tools or navigate to your PC’s trust store and export it from there. You can find some more detailed instructions on how to do this <a href="https://dev.to/elphynomenon/resolving-ssl-certificate-issues-with-zscaler-4hba">here</a>.</p><h3>Step 2: Get applications to use the trust bundle</h3><p>Now that we have a trust bundle, we need to get applications to use it. There are many ways to do this, and they unfortunately depend on the application you are dealing with. The most common and universal way to configure this is by setting environment variables.</p><p>Depending on the application you are dealing with, you may need to set one or more environment variables. This is a list of the most common ones:</p><p>These more or less apply cross-platform. On UNIX systems, you can set them by running the following commands:</p><pre>export SSL_CERT_FILE=~/.bundle.pem<br>export SSL_CERT_DIR=~/.bundle.pem<br>export CURL_CA_BUNDLE=~/.bundle.pem<br>export REQUESTS_CA_BUNDLE=~/.bundle.pem<br>export NODE_EXTRA_CA_CERTS=~/.bundle.pem<br>export CARGO_HTTP_CAINFO=~/.bundle.pem<br>export GIT_SSL_CAINFO=~/.bundle.pem</pre><blockquote><em>I shamelessly copied this list from the excellent </em><a href="https://coder.github.io/httpjail/guide/platform-support.html"><em>httpjail</em></a></blockquote><p>You can set these “permanently” by exporting them from your .bashrc, .zshrc or any other shell configuration file.</p><pre>conffile=~/.bashrc<br>echo &quot;export SSL_CERT_FILE=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export SSL_CERT_DIR=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export CURL_CA_BUNDLE=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export REQUESTS_CA_BUNDLE=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export NODE_EXTRA_CA_CERTS=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export CARGO_HTTP_CAINFO=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;<br>echo &quot;export GIT_SSL_CAINFO=~/.bundle.pem&quot; &gt;&gt; &quot;$conffile&quot;</pre><p>On Windows, you can set environment variables from the System UI panel, or you can create a shortcut which sets them locally.</p><p>More approaches for different applications can be found here:</p><p><a href="https://help.zscaler.com/zia/adding-custom-certificate-application-specific-trust-store">https://help.zscaler.com/zia/adding-custom-certificate-application-specific-trust-store</a></p><h3>What about Docker?</h3><p>Applications running inside a container (e.g. Docker) will use the trust bundle that is part of the container image. Essentially all container images come with a trust bundle, even the minimal ones like <a href="https://github.com/GoogleContainerTools/distroless">distroless</a>.</p><blockquote><em>Yes, indeed, this means that your perfect self-contained containerized applications still have a limited shelf life if they connect to the outside internet. Because even root certificates expire, you will have to rebuild at some point, even if the containerized application itself does not change.</em></blockquote><p>To ensure that your containerized application trusts your man-in-the-middle, you will have to either:</p><ul><li>bake your custom bundle into the image or</li><li>mount your custom bundle into the container, preferably shadowing the system trust bundle that is part of the image.</li></ul><p>Which one is most appropriate depends on your use case, but usually mounting is preferred. You may additionally have to set some of the environment variables listed above if you are using applications that do not use the system trust bundle, like we discussed above.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Jdbz5ecpm9uc0PiEx53kiQ.png" /><figcaption><em>Distroless images contain a trust bundle at </em><em>/etc/ssl/certs/ca-certificates.crt</em></figcaption></figure><p>For an Ubuntu- or distroless-based image, this can look like this:</p><pre>curl: (60) SSL certificate problem: self-signed certificate in certificate chain</pre><p>The process of building an image is itself containerized and may also require you to set similar settings. Docker Desktop generally uses your system’s trust bundle out of the box, but when you use podman on an OS that is not Linux, you will have to update the trust bundle of the (Linux) VM running your containers. You can do this using podman machine ssh commands.</p><p><em>Originally published at </em><a href="https://welw.it/posts/overcoming-corporate-distrust-59bc99e919e8/"><em>https://welw.it</em></a><em> on October 5, 2025.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5abb2682a6ca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Authorizing AWS Principals on Azure]]></title>
            <link>https://medium.com/datamindedbe/authorizing-aws-principals-on-azure-2a9353a3f97f?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/2a9353a3f97f</guid>
            <category><![CDATA[azure]]></category>
            <category><![CDATA[authentication]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[security]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Mon, 22 Sep 2025 15:50:30 GMT</pubDate>
            <atom:updated>2026-04-14T08:16:36.718Z</atom:updated>
            <content:encoded><![CDATA[<h4>Use AWS IAM user- or session credentials to access Azure resources</h4><h4>How to delegate trust from Entra to AWS IAM through Cognito, authorizing Azure actions without needing long-lived credentials.</h4><p>AWS IAM principals can be granted access to AWS resources through AWS IAM policies. Unfortunately, those policies do not carry weight outside of AWS, and certainly not within Azure. But what if you do want to access Azure or Entra resources from AWS? How then can you assign Azure privileges to AWS IAM roles and users?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1wb3HqE1LrYzaFY4zvBjIw.png" /><figcaption>Azure does not recognize your IAM Principal’s authority (<a href="https://www.qm.clothing/store/p/not-my-king-t-shirt">image source</a>).</figcaption></figure><p>On Azure, not AWS IAM but another service rules the roost. Microsoft Entra ID — the product formerly known as Azure Active Directory — manages user identities, apps, and access to Microsoft resources, including Azure, Microsoft 365, and even other applications that support Microsoft Entra ID (<a href="https://learn.microsoft.com/en-us/azure/azure-arc/servers/cloud-native/identity-access#microsoft-entra-integration">source</a>). The latter notably do not include AWS applications.</p><p>This post shows how to use AWS Cognito as a bridge to generate OIDC tokens using AWS IAM-derived privileges, and how to exchange those for Microsoft Identity Platform tokens authorizing Azure (Entra) actions. We show how to set up the required infrastructure using Terraform.</p><p>For those seeking a deeper understanding of authentication, this post also includes a second part that presents the classical approach to machine-to-machine (M2M) authentication, explores what it means to trust, and touches on what it takes to reverse the scenario — setting up AWS to trust Entra principals.</p><h3>Workload Identity Federation</h3><p>You can configure Entra application registrations to trust a third-party OpenID Provider (OP). Entra refers to this as “Federated Credentials” and also as “<a href="https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation">Workload Identity Federation</a>”.</p><p>After configuring Entra to trust a third-party OP, the JWT tokens issued by the third party allow obtaining Microsoft tokens through the client credentials flow with token assertions (see below, section Certificate-Based Authentication). You do not pin secrets or public keys inside Entra; instead, Microsoft validates the tokens you present to it using public keys it retrieves from the OIDC well-known endpoint you configure (see below, section OpenID Connect). Private key rotation is part of the protocols, and OPs generally rotate signing keys automatically.</p><h3>Setting up AWS and Entra for trust delegation</h3><p>We can configure Entra to trust a third-party OP. Unfortunately, AWS STS is not an OP, and AWS credentials are not OIDC JWT tokens. Luckily, AWS offers an OIDC-compatible identity service in the form of <a href="https://docs.aws.amazon.com/cognito/latest/developerguide/cognito-user-pools.html">Cognito User Pools</a>, and we can protect access to it using AWS IAM.</p><p>We demonstrate how to configure user pools to permit only IAM-protected login flows, allowing the exchange of AWS credentials for short-lived Cognito JWT tokens. We then show how to configure Entra to trust Cognito. Finally, we present the whole flow, culminating in Microsoft identity platform bearer tokens.</p><h4>Cognito</h4><p>Cognito <a href="https://docs.aws.amazon.com/cognito/latest/developerguide/authentication-flows-public-server-side.html#amazon-cognito-user-pools-client-side-authentication-flow">allows users to obtain JWT tokens in several ways</a>:</p><ul><li>Cognito supports OAuth flows, including the client <a href="https://www.oauth.com/oauth2-servers/access-tokens/client-credentials/">credentials grant</a> and the <a href="https://www.oauth.com/oauth2-servers/server-side-apps/authorization-code/">authorization code grant</a>.</li></ul><p>Additionally, three Cognito AWS API calls result in JWT tokens:</p><ul><li>The <a href="https://docs.aws.amazon.com/cognito-user-identity-pools/latest/APIReference/API_InitiateAuth.html">InitiateAuth AWS API call</a>, also known as Cognito’s “client-side authentication flow”. This call is public and not protected by AWS IAM.</li><li>The <a href="https://docs.aws.amazon.com/cognito-user-identity-pools/latest/APIReference/API_GetTokensFromRefreshToken.html">GetTokensFromRefreshToken AWS API call</a>, which is also public and not protected by AWS IAM.</li><li>The <a href="https://docs.aws.amazon.com/cognito-user-identity-pools/latest/APIReference/API_AdminInitiateAuth.html">AdminInitiateAuth AWS API call</a>, also known as Cognito’s “server-side authentication flow”. This call is protected by AWS IAM.</li></ul><p>To the extent possible, we will disable OAuth flows and flows compatible with InitiateAuth at the Cognito client level, as these are not IAM-protected.</p><h4>AWS Infrastructure</h4><p>On the AWS side, create a Cognito user pool, register a Cognito application, and create a Cognito user. All these resources are free for the given settings.</p><p><strong>User pools</strong> require only minimal configuration. You can create one through Terraform as follows:</p><pre>resource &quot;aws_cognito_user_pool&quot; &quot;this&quot; {<br>  name             = &quot;demo&quot;<br>  alias_attributes = [&quot;preferred_username&quot;]<br>  admin_create_user_config {<br>    allow_admin_create_user_only = true<br>  }<br>}</pre><p><strong>Cognito clients</strong> require more configuration. At the time of writing, setting up a client through the Console web UI always results in at least one enabled OAuth flow, which cannot be disabled. The Cognito API, however, allows the creation of an application that allows only the AdminInitiateAuth API call as desired. You can create such an application using Terraform as follows:</p><pre>resource &quot;aws_cognito_user_pool_client&quot; &quot;this&quot; {<br>  name            = &quot;demo&quot;<br>  user_pool_id    = aws_cognito_user_pool.this.id<br>  generate_secret = false<br>  allowed_oauth_flows_user_pool_client = false<br>  allowed_oauth_flows                  = []<br>  enable_token_revocation              = false<br>  explicit_auth_flows = [&quot;ALLOW_ADMIN_USER_PASSWORD_AUTH&quot;]<br>  id_token_validity      = 60<br>  access_token_validity  = 60<br>  refresh_token_validity = 60<br>  token_validity_units {<br>    id_token      = &quot;minutes&quot;<br>    access_token  = &quot;minutes&quot;<br>    refresh_token = &quot;minutes&quot;<br>  }<br>}</pre><p>Despite not allowing the REFRESH_TOKEN_AUTH auth flowhere , at the time of writing AdminInitiateAuth always returns a refresh token, which can then be used together with the unprotected InitiateAuth and/or theGetTokensFromRefreshToken API calls depending on refresh token rotation configuration. We set the lifetime of refresh tokens to its minimum value (1 hour) to limit the impact of this weird behavior.</p><blockquote>Note that we do not create a Cognito M2M application supporting the client credentials flow. Firstly, this results in a client secret, which, although easily rotated, is not what we want here. Secondly, calls to Cognito’s token endpoint are not IAM-protected; again, this is not what we look for in this post. Thirdly, Cognito M2M applications <a href="https://aws.amazon.com/cognito/pricing/#Pricing_table_for_Amazon_Cognito_add-ons">are not free</a>.</blockquote><p>Lastly, we need a <strong>Cognito user</strong>. As this user is only able to obtain tokens using the IAM-protected AdminInitiateAuth call and not through password grants or InitiateAuth, its password only serves as an unnecessary second factor and does not have to remain secret:</p><pre>resource &quot;aws_cognito_user&quot; &quot;this&quot; {<br>  user_pool_id = aws_cognito_user_pool.this.id<br>  username     = &quot;dummyuser&quot;<br>  password     = &quot;dummyPassword1!&quot;<br>}</pre><p>Lastly, to be able to call AdminInitiateAuth your IAM principal needs rights:</p><pre>{<br>    &quot;Version&quot;: &quot;2012-10-17&quot;,<br>    &quot;Statement&quot;: [<br>        {<br>            &quot;Effect&quot;: &quot;Allow&quot;,<br>            &quot;Action&quot;: &quot;cognito-idp:AdminInitiateAuth&quot;,<br>            &quot;Resource&quot;: &quot;arn:aws:cognito-idp:REGION:ACCOUNT_ID:userpool/&lt;userpoolid&gt;&quot;<br>        }<br>    ]<br>}</pre><blockquote>If you have multiple Azure applications and want to scope access to specific ones through AWS IAM, consider creating one user pool, client, and user for each one. The <em>AdminInitiateAuth</em> IAM action does not support conditions to limit it to specific client IDs or users. There is a generous <a href="https://docs.aws.amazon.com/cognito/latest/developerguide/quotas.html#amazon-cognito-identity-pools-federated-identities-request-rate-quotas">limit of 1000 user pools</a> per region (can be increased to 10000).</blockquote><h4>Obtaining AWS Bearer Tokens</h4><p>If you are authenticated to AWS as a principal authorized to perform AdminInitiateAuth, obtain tokens as follows:</p><pre>aws cognito-idp admin-initiate-auth \<br>--region eu-west-1 \<br>--user-pool-id &quot;$user_pool_id&quot; \<br>--client-id &quot;$client_id&quot; \<br>--auth-flow ADMIN_USER_PASSWORD_AUTH \<br>--auth-parameters &#39;USERNAME=dummyuser,PASSWORD=dummyPassword1!&#39;</pre><p>This returns something like:</p><pre>{<br>  &quot;ChallengeParameters&quot;: {},<br>  &quot;AuthenticationResult&quot;: {<br>    &quot;AccessToken&quot;: &quot;&lt;JWT token&gt;&quot;,<br>    &quot;ExpiresIn&quot;: 3600,<br>    &quot;TokenType&quot;: &quot;Bearer&quot;,<br>    &quot;RefreshToken&quot;: &quot;&lt;JWT token&gt;&quot;,<br>    &quot;IdToken&quot;: &quot;&lt;JWT token&gt;&quot;<br>  }<br>}</pre><p>The access tokens retrieved using AdminInitiateAuth always have scope aws.cognito.signin.user.admin and no audience. Its payload looks as follows:</p><pre>{<br>  &quot;sub&quot;: &quot;&lt;uuid&gt;&quot;,<br>  &quot;event_id&quot;: &quot;&lt;another-uuid&gt;&quot;,<br>  &quot;token_use&quot;: &quot;access&quot;,<br>  &quot;scope&quot;: &quot;aws.cognito.signin.user.admin&quot;,<br>  &quot;auth_time&quot;: 1756773256,<br>  &quot;iss&quot;: &quot;https://cognito-idp.&lt;region&gt;.amazonaws.com/&lt;userpool-id&gt;&quot;,<br>  &quot;exp&quot;: 1756773556,<br>  &quot;iat&quot;: 1756773256,<br>  &quot;jti&quot;: &quot;&lt;yet-another-uuid&gt;&quot;,<br>  &quot;client_id&quot;: &quot;&lt;your application client id&gt;&quot;,<br>  &quot;username&quot;: &quot;dummyuser&quot;<br>}</pre><blockquote>Requesting custom scopes with Cognito is possible only when requesting tokens through OAuth flows.</blockquote><p>As is suggested in trust delegation scenarios, we will not use the access token. Instead, we will exchange the identity token for a Microsoft/Entra JWT token. The Cognito identity token has the Cognito client ID as the audience. The decoded payload of an identity token looks as follows:</p><pre>{<br>  &quot;sub&quot;: &quot;&lt;uuid of dummyuser&gt;&quot;,<br>  &quot;aud&quot;: &quot;&lt;your application client id&gt;&quot;,<br>  &quot;event_id&quot;: &quot;&lt;another-uuid&gt;&quot;,<br>  &quot;token_use&quot;: &quot;id&quot;,<br>  &quot;auth_time&quot;: 1756771643,<br>  &quot;iss&quot;: &quot;https://cognito-idp.&lt;region&gt;.amazonaws.com/&lt;userpool-id&gt;&quot;,<br>  &quot;cognito:username&quot;: &quot;dummyuser&quot;,<br>  &quot;exp&quot;: 1756775243,<br>  &quot;iat&quot;: 1756771643<br>}</pre><h4>Configuring Trust Relationship</h4><p>Now it is time to create an Entra application and configure it to trust our Cognito tokens (“federated credentials”). You can do so using the UI. The code below creates an application and configures trust using Terraform:</p><pre>resource &quot;azuread_application&quot; &quot;this&quot; {<br>  display_name = &quot;blogpost&quot;<br>}</pre><pre>resource &quot;azuread_application_federated_identity_credential&quot; &quot;this&quot; {<br>  application_id = azuread_application.this.id<br>  display_name   = &quot;cognito&quot;<br>  description    = &quot;Trust Cognito&quot;<br>  audiences      = [aws_cognito_user_pool_client.this.id]<br>  issuer         = &quot;https://${aws_cognito_user_pool.this.endpoint}&quot;<br>  subject        = aws_cognito_user.this.sub<br>}</pre><p>This configures Entra only to accept specific tokens:</p><ul><li>Tokens from your Cognito user pool, thanks to the issuer setting. Under the hood, Entra queries <a href="https://${aws_cognito_user_pool.this.endpoint}">https://${aws_cognito_user_pool.this.endpoint</a>}/.well-known/openid-configuration and retrieves public keys (JWKS) for validating signatures.</li><li>Tokens authenticating your specific Cognito user, thanks to the subject setting.</li><li>Identity tokens that were generated for your application/client, thanks to the audience setting.</li></ul><h4>Obtaining tokens</h4><p>Defining getcreds as an alias for the aws cognito-idp admin-initiate-auth command from before, you can now obtain Microsoft identity platform bearer tokens as follows (<a href="https://learn.microsoft.com/en-us/entra/identity-platform/v2-oauth2-client-creds-grant-flow#third-case-access-token-request-with-a-federated-credential">reference</a>):</p><pre>TENANT=&quot;&lt;your-tenant-id&gt;&quot;<br>AZURE_CLIENT_ID=&quot;&lt;your-app-registration-id&gt;&quot;<br>COGNITO_TOKEN=&quot;$(getcreds | jq .AuthenticationResult.IdToken -r )&quot;<br>TOKEN_URL=&quot;https://login.microsoftonline.com/$TENANT/oauth2/v2.0/token&quot;<br><br>curl -X POST &quot;$TOKEN_URL&quot; \<br>-H &quot;Content-Type: application/x-www-form-urlencoded&quot; \<br>-d &quot;client_id=$AZURE_CLIENT_ID&quot; \<br>-d &quot;grant_type=client_credentials&quot; \<br>-d &quot;scope=https://graph.microsoft.com/.default&quot; \<br>-d &quot;client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer&quot; \<br>--data-urlencode &quot;client_assertion=$COGNITO_TOKEN&quot;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/317/0*yzn8_NVQjUmu3VlZ" /></figure><p>If you want to understand better what happens here under the hood, or learn about alternate ways of establishing trust, read along.</p><h3>Classical Approach to Trust</h3><p>Establishing trust and authority with a sceptical party like Entra is <em>always</em> done by demonstrating to it that you have or know something that only an authorized actor should have or know.</p><p>In classical machine-to-machine interactions with Entra, this “secret something” is one of two things:</p><ul><li>A client ID and <strong>client secret</strong>. These work like classical usernames and passwords. Entra knows your secrets, and you show that you do too by a straightforward “show of hands”.</li><li>A <strong>certificate</strong> or public/private key pair. In this case, the public part of the secret is registered with Entra. You present Entra with a secret value derived from your private key. Entra can then validate that you are in possession of the private key using the corresponding public key.</li></ul><p>To bridge AWS IAM and Entra, you store the “secret somethings” in a location protected by AWS IAM, e.g. in <a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html">AWS Secrets Manager</a>, in the Systems Manager (SSM) <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html">Parameter Store</a> or even as an object on <a href="https://aws.amazon.com/s3/">S3</a>. Then, to make authorized calls to Azure resources, you retrieve and use them.</p><blockquote>Certificates and private keys are considered more secure than client id/secret pairs or username/password combinations because private keys themselves are not transmitted, and the secrets derived from it are short-lived.</blockquote><p>The main problem with this basic approach is that “secret somethings” are usually long-lived, with rotation remaining a manual, infrequent, error-prone, and even forgotten-about process. Automatic rotation can alleviate this. We will get back to this below.</p><h3>Certificate-Based Authentication</h3><p>Before moving on, let’s dive a bit deeper into how certificates are used with Azure and Entra ID. This helps understand how similar the flow with federated credentials is.</p><p>To establish trust using certificates, you request Entra bearer tokens using the<a href="https://www.rfc-editor.org/rfc/rfc6749#section-1.3.4"> <strong>OAuth client credentials grant (</strong>RFC6749<strong>)</strong></a><strong> with </strong><a href="https://www.rfc-editor.org/rfc/rfc7523#page-5"><strong>JWT token assertions </strong>(RFC 7523)</a>. It works as follows:</p><ul><li>Generate an X.509 certificate. Entra requires you to pin specific certificates, so public key infrastructure (PKI) does not apply. You can and probably should use a self-signed certificate.</li><li>Register your certificate’s public key in the application registration.</li><li>Construct a JWT token that will serve as an assertion. Sign the JWT token using your certificate’s private key.</li><li>Obtain an Entra bearer token from Microsoft identity platform using the client credentials flow — that is, make a request to Microsoft’s token endpoint, exchanging the “assertion token” you signed for one that Microsoft signed.</li></ul><blockquote>You probably do not want to code this flow yourself; instead opt to use a library, such as <a href="https://learn.microsoft.com/en-us/entra/msal/python/">Microsoft’s Authentication Library (MSAL) for Python</a>.</blockquote><p>Microsoft identity platform trusts you because it can validate the signature of the assertion token using the public key you gave to Entra ID before. Only someone who knows the corresponding private key can create a valid signature. As a token of this trust (😏), Microsoft gives you a “bearer token”</p><p>The bearer token is again something proving that you are indeed worthy of trust, but this time to different untrusting parties: Azure resources that are trusting of the Microsoft identity platform and nothing else. You can use the MS bearer token to authorize calls to any Entra-protected API by embedding it into the Authorization header of your HTTP calls.</p><p>Using federated credentials works in the same way, except the assertion tokens you use are not self-signed JWT tokens, signed using a certificate you pinned in Entra, but rather tokens issued by the trusted OP.</p><h3>OpenID Connect</h3><p>At this point, you may wonder how services receiving a Microsoft Identity bearer token or assertion from your federated OP validate its veracity.</p><p><a href="https://learn.microsoft.com/en-us/entra/identity-platform/v2-protocols-oidc">Microsoft identity platform is an OpenID Provider (OP)</a>, meaning that it complies to the <a href="https://openid.net/specs/openid-connect-core-1_0.html">OpenID Connect (OIDC) specification</a>. Like most OPs, it additionally implements the <a href="https://openid.net/specs/openid-connect-discovery-1_0.html">OpenID Connect Discovery protocol</a>. In a nutshell, this means that:</p><ul><li>The bearer tokens it issues are JWT tokens, valid only for a limited duration and easily decoded.</li><li>The JWT tokens are signed using a private key known only to Microsoft.</li><li>The public keys corresponding to the private ones that Microsoft uses for signing its tokens can be retrieved from standardized “well-known” endpoints. Multiple public keys can be acceptable at the same time, allowing for seamless key rotation.</li></ul><p>Thanks to standardization, all you need to do to be able to trust tokens issued by the Microsoft identity platform (or any other OP) is know its well-known endpoint, hosted on an HTTPS URL. Thanks to the global <a href="https://en.wikipedia.org/wiki/Public_key_infrastructure">PKI</a> and its network of <a href="https://en.wikipedia.org/wiki/Certificate_authority">certificate authorities</a> (out of scope here), you know that the information you find there is trustworthy. You retrieve public keys from coordinates you find at the well-known endpoint and can use them to verify bearer tokens. Microsoft can and does rotate its certificates, without it causing any disruptions or requiring changes on the end of its clients.</p><h3>Reversed scenario</h3><p>Because Entra and the Microsoft Identity Platform are OpenID providers and AWS supports what it calls “web identity credentials”, authorizing Entra identities to access AWS resources is a lot easier than the reverse scenario discussed in this post.</p><p>To do so, register your Entra tenant <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html">as an OIDC web identity provider</a> on your AWS account. Then update your roles’ assume role policies appropriately. This enables the exchange of Entra tokens for temporary AWS credentials using <a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html">AssumeRoleWithWebIdentity API</a> calls. The exchange of JWT for and refreshing of AWS session credentials can happen automagically under the hood, e.g., by setting appropriate environment variables pointing to your Entra JWT token (AWS_WEB_IDENTITY_TOKEN_FILE) and setting the role you want to assume (AWS_ROLE_ARN) — so, unlike the solution above, this flow will work with all AWS SDKs out of the box without further customization.</p><blockquote>If your Entra tokens are highly privileged, consider exchanging them for less priviledged ones before sending them to AWS.</blockquote><h3>Conclusion</h3><p>This post leverages Cognito, AWS’s fully managed OIDC-compatible identity provider service, to exchange IAM credentials for Entra bearer tokens, without incurring costs or requiring any form of long-lived credentials. It goes on to show how certificate-based OAuth client credential flows work and how OIDC facilitates establishing trust, including seamless rotation of key pairs. Finally, it discussed that exchanging Azure credentials for AWS is possible in a more straightforward manner.</p><p>Creating a Cognito user pool, Cognito client, and especially a Cognito user in this setup is arguably awkward, and you may prefer using long-lived secrets on account of their relative simplicity. We did not outline how the solution above can be integrated with Azure SDKs to fetch and refresh tokens automatically. If you know of an easier way to exchange AWS IAM credentials for Entra ones, be sure to comment!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/320/0*x9yqf_9IgTYLKmdk" /><figcaption>Source: South Park</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/780/1*G7I3DQwTd7ZojipZQTMrSA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2a9353a3f97f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/authorizing-aws-principals-on-azure-2a9353a3f97f">Authorizing AWS Principals on Azure</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Quack, Quack, Ka-Ching: Cut Costs by Querying Snowflake from DuckDB]]></title>
            <link>https://medium.com/datamindedbe/quack-quack-ka-ching-cut-costs-by-querying-snowflake-from-duckdb-f19eff2fdf9d?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/f19eff2fdf9d</guid>
            <category><![CDATA[platform-engineering]]></category>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data-warehouse]]></category>
            <category><![CDATA[snowflake]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Wed, 15 May 2024 10:15:17 GMT</pubDate>
            <atom:updated>2024-10-11T10:49:17.681Z</atom:updated>
            <content:encoded><![CDATA[<h4>The duck escapes with the credits.</h4><h4>How to leverage Snowflake’s support for interoperable open lakehouse technology — Iceberg — to save money.</h4><p><em>Update 10/2024: A lot has changed since this article was written. Snowflake </em><a href="https://docs.snowflake.com/en/release-notes/bcr-bundles/2024_05/bcr-1658"><em>stopped generating the version-hint.txt file since their 2024_05 bundle</em></a><em>, invalidating parts of the post below. Snowflake also released </em><a href="https://github.com/apache/polaris"><em>Polaris</em></a><em>, which is something to look into if you want to access your Snowflake data from different engines.</em></p><p>Snowflake <a href="https://www.snowflake.com/blog/build-open-data-lakehouse-iceberg-tables/">recently released extensive support for the open table format Iceberg</a>. Using open formats enhances data agility and reduces lock-in. This post explores leveraging this flexibility to decrease Snowflake’s high compute costs by using DuckDB to query Snowflake-managed data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*tDSK7BTnZDNHOdDU" /></figure><h3>What is Apache Iceberg?</h3><p><a href="https://iceberg.apache.org/">Apache Iceberg</a> is a <strong>table format specification</strong> created by Netflix in 2017. In 2018, Netflix open-sourced the Iceberg project and donated it to the Apache Software Foundation.</p><p>Netflix designed Iceberg to overcome the limitations of data lakes that contain plain partitioned data files with minimal metadata — also known as Hive-formatted tables. These included performance problems (many file listings, many partitions, limited pruning) and the absence of features that had become common in data warehouses, such as time travel, schema evolution, and ACID transactions.</p><h4>Table format specification</h4><p>A table format specification is a standard way of writing <strong>metadata</strong> to define a table. Metadata allows tools to know what is in a dataset without having to read all of the data inside, but they can also assign different meanings to data — e.g., by marking it as non-current.</p><p>Apache Iceberg is <strong>not a storage format</strong>. You can store your Iceberg table’s data in formats such as <a href="https://parquet.apache.org/">Parquet</a>, <a href="https://orc.apache.org/">ORC</a>, or <a href="https://avro.apache.org/">Avro</a>; Iceberg is a standard way to organize metadata next to those data files.</p><h4>Open toolbox and interoperability</h4><p>Many engines and tools implement the Iceberg spec. Tools implementing the same spec can all interact with the same Iceberg tables, which is why Apache Iceberg is “multi-engine.” Most major engines, such as <a href="https://aws.amazon.com/athena/">AWS Athena</a>, <a href="https://trino.io/">Trino</a> (<a href="https://www.starburst.io/">Starburst</a>), DuckDB, and Snowflake, support Iceberg.</p><p>This interoperable approach fundamentally differs from what was common in the past. Databases like Oracle, <a href="https://www.vertica.com/">Vertica</a>, <a href="https://cloud.google.com/bigquery">BigQuery</a>, and so on store metadata and data in proprietary formats, presenting a challenge for seamless interoperability, requiring lots of data copying, and potentially leading to vendor lock-in.</p><h4>Paradigm shift</h4><p>By working with a centrally accessible format independent of the compute engine, compute engines become interchangeable. This allows us to use the most suited computing engine for a particular task, without having to move around data. Data written by one tool can immediately be read by another.</p><p>This architecture results in a <strong>paradigm shift</strong>, favoring data sharing over redundant data duplication across different computing engines.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*J5-X9uOAkJ-aGiRg" /><figcaption>Image adapted from <a href="https://www.youtube.com/watch?v=_GW3GYZK66U">https://www.youtube.com/watch?v=_GW3GYZK66U</a></figcaption></figure><h4>Featureful lakehouses</h4><p>In addition to facilitating interoperability, Apache Iceberg enables an ever-growing number of <strong>features</strong> that close the feature gap between data lakes and data warehouses, giving rise to what is now known as the <a href="https://www.techtarget.com/searchdatamanagement/definition/data-lakehouse">lakehouse</a>. These include <a href="https://www.starburst.io/blog/apache-iceberg-time-travel-rollbacks-in-trino/">time travel</a>, <a href="https://medium.com/snowflake/how-apache-iceberg-enables-acid-compliance-for-data-lakes-9069ae783b60">ACID transactions</a>, partition evolution, hidden partitioning, schema evolution, <a href="https://medium.com/insiderengineering/apache-iceberg-reduced-our-amazon-s3-cost-by-90-997cde5ce931">saving object storage cost</a>s, etc. This blog post only focuses on interoperability.</p><h3>Apache Iceberg and Snowflake</h3><p>On December 4, 2023, Snowflake <a href="https://www.snowflake.com/blog/build-open-data-lakehouse-iceberg-tables/">published a blog post</a> announcing their Apache Iceberg integration is in Public preview.</p><p>Snowflake now offers two ways to work with Iceberg tables:</p><ul><li><strong>External catalog.</strong> These tables are written externally, by a tool such as <a href="https://spark.apache.org/">Apache Spark</a>, <a href="https://flink.apache.org/">Apache Flink</a>, or even <a href="https://trino.io/">Trino</a>, to your object store and registered in an external catalog such as the <a href="https://cwiki.apache.org/confluence/display/hive/design">Hive Metastore</a>, the <a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html">AWS Glue Data Catalog</a>, or <a href="https://github.com/projectnessie/nessie">Nessie</a>. In this mode, tables are read-only from Snowflake<strong>.</strong></li><li><strong>Snowflake catalog. </strong>These tables are read-write from Snowflake and read-only externally<strong>.</strong></li></ul><p>In both cases, Snowflake stores all data and Iceberg metadata in your own (cloud) object storage. Both ways of working with Iceberg have merits. Given your situation, it should be clear which is the most appropriate.</p><blockquote>All data and Iceberg metadata are on your own (cloud) object storage.</blockquote><p>When using Iceberg tables with the Snowflake catalog, Snowflake behaves like it always does; it remains a “zero-ops” warehouse, and you can remain carefree while Snowflake performs storage maintenance operations like compaction, expiring snapshots, and cleaning up of orphaned files. Iceberg tables behave nearly identically to Snowflake-native tables, although <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg#considerations-and-limitations">there are some limitations that you may want to check out</a>.</p><p>This post assumes that your data lives and breathes in Snowflake and that Snowflake is where your large-scale processing happens. Using the Snowflake catalog is then the right choice.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*szYOv4NyZ_sMN6f-.png" /><figcaption>Image from <a href="https://www.snowflake.com/blog/unifying-iceberg-tables/">https://www.snowflake.com/blog/unifying-iceberg-tables/</a></figcaption></figure><h4>Iceberg Catalog</h4><p>When using Iceberg tables with the Snowflake catalog, the “catalog” remains on Snowflake’s side. To determine whether this impedes our ability to interact with data directly, we should know what the metadata catalog does; after all, is a table’s metadata not stored in Iceberg’s metadata files? Catalogs bring at least two things to the table (pun intended):</p><ul><li><strong>Database abstraction</strong>. Iceberg is a specification for technical metadata at the table level, and Iceberg metadata files are stored next to your data files. The table specification is unaware of concepts such as table names, schemas, and databases or collections. A metadata catalog allows you to consider your “bag of tables” as a database by introducing hierarchy and storing a map of table names onto prefixes.</li><li><strong>Pointer to the current table version</strong>. When mutating an Iceberg table, new data and metadata files are added and stored next to the old ones. The catalog keeps track of table prefixes but must also know which metadata files are “current.”</li></ul><p>TL;DR: You need access to the catalog to know which table version is current, and to access tables by name and write queries as you are used to.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*KWIdgglhQIT6bpSl" /><figcaption>Bergs need careful filing in a metadata catalog.</figcaption></figure><h4>Iceberg Catalog SDK</h4><p>If you want to read your Iceberg tables using Spark, you’re in luck! Snowflake <a href="https://www.snowflake.com/blog/iceberg-tables-catalog-support-available-now/">released an Iceberg Catalog SDK for Spark</a>, which implements Spark’s catalog interface using an (otherwise undocumented) Snowflake Catalog API. Currently, this Snowflake functionality is free and does not require a running warehouse, cost “serverless credits,” or incur “cloud services” charges.</p><p>Snowflake’s <a href="https://www.snowflake.com/blog/iceberg-tables-catalog-support-available-now/">announcement</a> provides readily usable sample code and confirms that Spark reads Iceberg metadata and Parquet files directly from the customer-managed storage account:</p><blockquote>After making an initial connection to Snowflake via the Iceberg Catalog SDK, Spark can read Iceberg metadata and Parquet files directly from the customer-managed storage account. With this configuration, multiple engines can consistently read from a single copy of data.</blockquote><p>Unfortunately, this is not immediately helpful for querying from DuckDB. There is no Snowflake catalog SDK available for DuckDB. Luckily, we can use the file system directly to read our data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9DmknVKd5uKTPjl1" /><figcaption>Image slightly adapted from <a href="https://www.snowflake.com/blog/unifying-iceberg-tables/">https://www.snowflake.com/blog/unifying-iceberg-tables/</a></figcaption></figure><h4>Iceberg Filesystem Catalog</h4><p>If it seems possible to implement a catalog on top of a filesystem or object store through straightforward naming conventions, that is because it is! Indeed, <a href="https://iceberg.apache.org/docs/1.5.0/java-api-quickstart/#using-a-hadoop-catalog">Iceberg’s Hadoop catalog is just that</a>. Its <a href="https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/hadoop/HadoopCatalog.html">class documentation</a> reads:</p><blockquote>HadoopCatalog […] uses a specified directory under a specified filesystem as the warehouse directory, and organizes multiple levels directories that mapped to the database, namespace and the table respectively. The HadoopCatalog takes a location as the warehouse directory. When creating a table such as $db.$tbl, it creates $db/$tbl directory under the warehouse directory, and put the table metadata into that directory.</blockquote><p>For Iceberg to know which metadata is the latest, it expects the filesystem tables’ metadata files to have names determined as a function of monotonically increasing version numbers. It also looks for an optional version-hint.text file pointing to the newest version.</p><blockquote><strong>Note</strong>: Writers maintain consistency and monotonically increasing versions by implementing the scheme documented <a href="https://iceberg.apache.org/spec/#file-system-tables">here</a>. Unfortunately, this requires storage systems to support <em>atomic renaming</em>, which many storage engines, notably S3, Google Cloud Storage, and Azure Blob Storage, do not do. This is one of the reasons why one of Iceberg’s original authors, Ryan Blue, has referred to the creation of Hadoop tables as “<a href="https://github.com/apache/iceberg/issues/3220#issuecomment-1344611016">one of his biggest mistakes</a>”. Even on storage systems supporting atomic renames you may see lower performance than when using a “proper” metadata catalog. The use of HadoopCatalog is generally discouraged for production use.</blockquote><p>Snowflake presumably uses a proprietary, highly performant catalog implementation in its backend. However, it is nice enough to materialize data and metadata on the customer-managed object storage in a way compatible with the Hadoop catalog — they even maintain a current version-hint.text file! This compatibility means that any reader with support for the Iceberg Hadoop catalog can read Snowflake data directly by pointing it to the root of the Iceberg warehouse on the object storage system.</p><h4>DuckDB</h4><p>DuckDB has partial support for the Iceberg Hadoop catalog and filesystem tables. While DuckDB unfortunately does not (yet?) support reading an entire warehouse, you <em>can</em> point it to a table prefix. DuckDB will then pick up on the version-hint.text file and read the latest version of the table.</p><h3>Creating an Iceberg table</h3><p>Getting Snowflake to create an Iceberg table on your cloud requires some configuration. The example below uses S3 as a storage layer, but Snowflake also <a href="https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume#configure-an-external-volume">supports Google Cloud Storage and Azure Storage</a>. You can find a playbook for S3 here:</p><p><a href="https://github.com/datamindedbe/platform-quack-quack-ka-ching">GitHub - datamindedbe/platform-quack-quack-ka-ching: The duck escapes with the credits.</a></p><p>On a high level, this is what needs to happen:</p><ul><li>Provision storage: Create an S3 bucket and an IAM role for Snowflake and ensure that the IAM role has the necessary permissions to access the bucket.</li><li>Connect Snowflake to storage: Create a Snowflake <a href="https://docs.snowflake.com/en/sql-reference/sql/create-external-volume">External Volume</a>. In S3’s case, an external volume will create an IAM user on Snowflake’s account. You need to create a trust relationship so that IAM user can assume the role with access to your S3 bucket.</li></ul><p>We can finally create native iceberg tables in Snowflake with <a href="https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table#examples">CREATE ICEBERG TABLE</a>, and you can find your Parquet and Iceberg metadata files in the S3 bucket.</p><h3>Reading data from DuckDB</h3><p>Having established a secure connection between S3 and Snowflake and created Iceberg tables in Snowflake, let’s — finally — see how DuckDB facilitates querying them.</p><p>We use <a href="https://duckdb.org/docs/extensions/iceberg.html">DuckDB’s iceberg extension</a> to read the Iceberg tables we made in Snowflake directly from S3. Again, you can find the playbook <a href="https://github.com/datamindedbe/platform-quack-quack-ka-ching">here</a>. The main functionality is provided by the following iceberg_scan method:</p><pre>select * from iceberg_scan(&#39;s3://chapter-platform-iceberg/icebergs/line_item&#39;;)</pre><p>The iceberg_scan method fetches the tables from S3. You don’t have to point to the current manifest.json file explicitly because the version-hint.text is pointing to the current version of the tables.</p><p>We have now unlocked the real power of open table formats: we have the convenience of Snowflake and its catalog but can save costs by performing single-node queries on DuckDB.</p><blockquote>As of now, DuckDB does not support writing Iceberg tables — only reading. You can write out to parquet though, for example to S3 with COPY &lt;table_name&gt; TO &#39;s3://bucket/file.parquet&#39;;. However, even if DuckDB would support Iceberg writes, Snowflake would not — although you could register DuckDB’s output as an Iceberg table with external catalog in Snowflake.</blockquote><h3>Why is Snowflake doing this?</h3><p>If using Iceberg tables on Snowflake is a bit like having your cake <em>and</em> eating it, with Snowflake footing the bill, then why did Snowflake build this integration? The move makes sense in the context of fierce competition from Databricks. Both behemoths are trying to open up their systems to attract customers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Cz8hfPtrz9J12qHP" /></figure><p>Snowflake sends the message to its (prospective) customers that choosing Snowflake does not tie them to one vendor and that there is no risk of lock-in; with them, you always have the option to switch compute engines when you want. Databricks is behaving similarly by opening up its <a href="https://delta.io/">Delta Lake</a> format and better supporting Hudi and Iceberg through <a href="https://docs.delta.io/latest/delta-uniform.html">UniForm</a>.</p><p>Snowflake still wants to keep as much compute as possible on their systems. There is a clear path for moving external metadata to the Iceberg catalog, but going in the other direction is much more challenging. By owning the metadata catalog, Snowflake remains the compute engine of choice and the only writer. If Snowflake had not opened up its systems, it would likely have lost many customers who were afraid of lock-in.</p><h3>Conclusion</h3><p>Open table formats like Iceberg enable true separation of compute from storage. By using Snowflake’s Iceberg tables, you can continue enjoying Snowflake’s powerful and operations-free capabilities, while making it possible to occasionally escape its “walled garden.” Because Iceberg with Parquet has characteristics and features that are very similar to those of native Snowflake tables — like efficient compression, partition pruning, schema evolution, etc. —, and because Snowflake has implemented support for them, you should be able to use Iceberg tables instead of native tables without a significant impact on performance or functionality. We therefore suggest defaulting to using Iceberg tables with Snowflake.</p><p>This post demonstrated how easy it is to run a query on DuckDB instead of on expensive Snowflake compute by directly pointing it to Snowflake-managed data in your own object storage. There, you can even combine it with data that is not available in your Snowflake warehouse. Knowing that you can operate DuckDB from instances that cost around 10% of a comparably powered Snowflake warehouse, such an approach can come with significant cost savings. Of course, we do not mean to suggest DuckDB is a replacement for Snowflake. We do think this is a good demonstration of the power of interoperability.</p><p><em>This post is the result of a collaborative effort by </em><a href="https://medium.com/u/83eb1c80a9f1"><em>Jelle De Vleminck</em></a><em>, </em><a href="https://medium.com/u/adc70be939f7"><em>Robbert</em></a><em>, </em><a href="https://medium.com/u/cbce5a999a8e"><em>Moenes Bensoussia</em></a>,<em> and </em><a href="https://medium.com/u/e29e99a9aba7"><em>Jonathan Merlevede</em></a><em>.</em></p><ul><li>👏 If you liked this article, don’t forget to clap</li><li>🗣️ Share your insights in the comments; we will try to respond</li><li>🗞️ Follow me and subscribe to <a href="https://medium.com/datamindedbe">datamindedbe</a> for more posts on cloud, platform-, data-, and software engineering.</li><li>👀 For more about Data Minded, visit our <a href="https://hubs.li/Q0295h4B0">website</a>.</li></ul><figure><a href="https://hubs.li/Q0295h4B0"><img alt="" src="https://cdn-images-1.medium.com/max/200/0*5UXa76GxgI83178u" /></a></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3qDicn4XH8CQJOgk" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f19eff2fdf9d" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/quack-quack-ka-ching-cut-costs-by-querying-snowflake-from-duckdb-f19eff2fdf9d">Quack, Quack, Ka-Ching: Cut Costs by Querying Snowflake from DuckDB</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Two Lifecycle Policies Every S3 Bucket Should Have]]></title>
            <link>https://medium.com/datamindedbe/two-lifecycle-policies-every-s3-bucket-should-have-f5e51436c060?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/f5e51436c060</guid>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[cloud]]></category>
            <category><![CDATA[s3]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Thu, 07 Mar 2024 09:31:02 GMT</pubDate>
            <atom:updated>2024-03-07T09:31:02.727Z</atom:updated>
            <content:encoded><![CDATA[<h4>Stop paying for objects that you do not see</h4><h4>Abandoned incomplete multipart uploads and expired current delete markers: what are they, and why you must care about them thanks to bad AWS defaults.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/841/1*4sYUnSnvAiHP1tSR9x_tEw.png" /><figcaption>According to your AWS bill, this seemingly empty bucket might, in fact, be quite full.</figcaption></figure><p>There may be items in your buckets that you do not see but that adversely impact S3 costs and performance. This post explains what these invisible objects are and what you can do to remove them, as, for reasons of backward compatibility and possibly also because of perverse incentives, AWS does not remove them for you by default.</p><blockquote>TL;DR version: The objects in question are parts of abandoned multipart uploads and expired object deletion markers. I think <em>every</em> bucket should have a lifecycle policy to remove them. If you do not know what these objects are and are interested in knowing, read on.</blockquote><h3>Aborted, incomplete multipart uploads</h3><h4>What are multipart uploads?</h4><p>Uploading small objects to AWS S3 is possible using just a single<a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html">PutObject operation</a>. For larger objects, we use <em>multipart uploads</em>, and the flow is more involved:</p><ul><li>Perform theCreateMultiPartUpload operation. You specify an object key; AWS returns you an upload ID.</li><li>Perform UploadPart operations, one for every part. You present AWS with the object key, upload ID, a “part number”, and part of the file you want to upload.</li><li>Perform the CompleteMultiPartUpload operation. You specify the object key and upload ID; AWS then creates your object as the concatenation of all the parts you uploaded.</li></ul><p>All of this is usually handled by your upload tool or library. For example, if you use the AWS CLI to upload files (using <a href="https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/cp.html">aws s3 cp</a>), it will use multipart uploads <a href="https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold">by default for files larger than 8MiB</a>.</p><h4>What is problematic about multipart uploads?</h4><p>If you never complete an upload, associated uploaded parts remain in your bucket forever. These parts are stored in your bucket; you pay to keep them while you do not see them.</p><p>To illustrate this, let’s create a 5GiB test file and commence uploading it to the cloud using aws s3 cp:</p><pre>bucket=yourbucketname<br>key=tmp/testfile<br>dd if=/dev/urandom of=/tmp/testfile bs=1G count=5<br>aws s3 cp /tmp/testfile &quot;s3://$bucket/$key&quot; &amp;<br>[1] 71717</pre><p>After leaving the upload running for a while, kill the upload abruptly:</p><pre>kill -9 71717</pre><p>You can see that the multipart upload still exists:</p><pre>aws s3api list-multipart-uploads --bucket $bucket</pre><pre>{<br>    &quot;Uploads&quot;: [<br>        {<br>            &quot;UploadId&quot;: &quot;gB8iBxnOiladG...&quot;,<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;Initiated&quot;: &quot;2023-11-17T14:31:24+00:00&quot;,<br>            &quot;StorageClass&quot;: &quot;STANDARD&quot;,<br>            &quot;Owner&quot;: {...},<br>            &quot;Initiator&quot;: {...}<br>        }<br>    ],<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>You can list associated parts using list-parts :</p><pre>id=&quot;gB8iBxnOiladG...&quot;<br>aws s3api list-parts --bucket &quot;$bucket&quot; --key &quot;$key&quot; --upload-id &quot;$id&quot;</pre><pre>{<br>    &quot;Parts&quot;: [<br>        {<br>            &quot;PartNumber&quot;: 1,<br>            &quot;LastModified&quot;: &quot;2023-11-17T14:39:17+00:00&quot;,<br>            &quot;ETag&quot;: &quot;\&quot;edeb946bf1303ed6350887c519041350\&quot;&quot;,<br>            &quot;Size&quot;: 8388608<br>        },<br>        {<br>            &quot;PartNumber&quot;: 2,<br>            &quot;LastModified&quot;: &quot;2023-11-17T14:39:17+00:00&quot;,<br>            &quot;ETag&quot;: &quot;\&quot;5c288b397c12981b40f19fde2387db5d\&quot;&quot;,<br>            &quot;Size&quot;: 8388608<br>        },<br>        ...<br>    ],<br>    ...<br>}</pre><h4>What can I do about abandoned multipart uploads?</h4><p>If you know that an upload is abandoned and should be aborted, you can remove uploaded parts by aborting the upload using <a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html">the </a><a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html">AbortMultipartUload operation</a>:</p><pre>aws s3api abort-multipart-upload --bucket &quot;$bucket&quot; --key &quot;$key&quot; --upload-id &quot;$id&quot;</pre><p>You can verify that the upload ID, and therefore also associated parts, are removed:</p><pre>aws s3api list-multipart-uploads --bucket $bucket</pre><pre>{<br>    &quot;RequestCharged&quot;: null<br>}</pre><pre>aws s3api list-parts --bucket &quot;$bucket&quot; --key &quot;$key&quot; --upload-id &quot;$id&quot;         </pre><pre>An error occurred (NoSuchUpload) when calling the ListParts operation: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.</pre><h4>What can I really do about abandoned multipart uploads?</h4><p>The above is highly impractical, as it requires you to monitor your bucket for abandoned uploads and manually abort them. Instead, tell AWS to abort multipart uploads automatically after a certain period expires using an (expiration) <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html">object lifecycle rule</a>.</p><p>You can do this using the CLI or IaC tools like Terraform, but also from the AWS Console. While you’re there, tick the box “Delete expired object delete markers” too; we will explain what this does in the next section.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*YkWnYAijrdG74G72C_563A.png" /><figcaption>Expiration lifecycle policy expressing what should really be the AWS defaults.</figcaption></figure><h4>Are there any disadvantages to this?</h4><p>Object expiration through lifecycle policies <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html#lifecycle-expire-when">is free</a>.</p><p>The only thing to account for is that the number of days you set in the policy limits how long a multipart upload can take. Keep this in mind if you are uploading <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html">5TiB objects</a> from a slow connection. Choose something ridiculously long for a single upload but not so long that it impacts your storage bill, like 14 days.</p><h3>Expired delete markers</h3><p>Other invisible objects that tend to linger around in buckets are called “expired object delete markers.” These objects only exist in versioned buckets.</p><h4>What are object delete markers?</h4><p>When you enable bucket versioning, every prefix becomes associated with a stack of versioned items. The most recent item is the “latest” or “current” one. There are two types of versioned items: object versions and <em>delete markers</em>. Writing data to a location pushes an object version onto the stack; deleting an object pushes a delete marker. If the current item is an object version, this object version is visible as an object in your bucket; current delete markers remain invisible.</p><p>We can easily illustrate this using a couple of aws CLI commands. Before we upload an object, no version exists:</p><pre>bucket=yourbucketname<br>key=tmp/testfile<br>aws s3 ls &quot;s3://$bucket/$key&quot; # verify that no object exists<br>aws s3api list-object-versions --bucket &quot;$bucket&quot; --prefix &quot;$key&quot;</pre><pre>{<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>After uploading an object, we see the object and a single object version:</p><pre>dd if=/dev/urandom bs=1M count=2 | aws s3 cp - s3://$bucket/$key<br>aws s3 ls s3://$bucket/$key                                      </pre><pre>2023-11-21 00:33:09    2097152 testfile</pre><pre>aws s3api list-object-versions --bucket &quot;$bucket&quot; --prefix &quot;$key&quot;</pre><pre>{<br>    &quot;Versions&quot;: [<br>        {<br>            &quot;ETag&quot;: &quot;\&quot;18f78ecf07a0c41e8ec2defa200a5029\&quot;&quot;,<br>            &quot;Size&quot;: 2097152,<br>            &quot;StorageClass&quot;: &quot;STANDARD&quot;,<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;bUWdb14EppQR13KLVapM699xySNlo9yR&quot;,<br>            &quot;IsLatest&quot;: true,<br>            &quot;LastModified&quot;: &quot;2023-11-20T23:33:09+00:00&quot;,<br>            &quot;Owner&quot;: {...}<br>        }<br>    ],<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>Deletion of the file adds a deletion marker:</p><pre>aws s3 rm s3://$bucket/$key<br>aws s3api list-object-versions --bucket &quot;$bucket&quot; --prefix &quot;$key&quot;</pre><pre>{<br>    &quot;Versions&quot;: [<br>        {<br>            &quot;ETag&quot;: &quot;\&quot;18f78ecf07a0c41e8ec2defa200a5029\&quot;&quot;,<br>            &quot;Size&quot;: 2097152,<br>            &quot;StorageClass&quot;: &quot;STANDARD&quot;,<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;bUWdb14EppQR13KLVapM699xySNlo9yR&quot;,<br>            &quot;IsLatest&quot;: false,<br>            &quot;LastModified&quot;: &quot;2023-11-20T23:33:09+00:00&quot;,<br>            &quot;Owner&quot;: {}<br>        }<br>    ],<br>    &quot;DeleteMarkers&quot;: [<br>        {<br>            &quot;Owner&quot;: {},<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj&quot;,<br>            &quot;IsLatest&quot;: true,<br>            &quot;LastModified&quot;: &quot;2023-11-20T23:34:59+00:00&quot;<br>        }<br>    ],<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>Listing objects using aws s3 ls returns an empty result. Removal of a delete marker would restore the object. You could do this as follows (but let’s not for now):</p><pre>vid=&quot;kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj&quot;<br>aws s3api delete-object --bucket &quot;$bucket&quot; --key &quot;$key&quot; --version-id &quot;$vid&quot;</pre><h4>What are current and noncurrent object delete markers?</h4><p>If a delete marker is the latest or current item on the version stack, we refer to it as a current object delete marker. Otherwise, we refer to it as a noncurrent object delete marker.</p><p>In the example above, the single delete marker at the prefix tmp/testfile is a current delete marker. Uploading another object to the same location creates a new object version:</p><pre>dd if=/dev/urandom bs=1M count=2 | aws s3 cp - s3://$bucket/$key<br>aws s3api list-object-versions --bucket &quot;$bucket&quot; --prefix &quot;$key&quot;</pre><pre>{<br>    &quot;Versions&quot;: [<br>        {<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;l3_yEX5pQZCloovHVMsbIgbzP1pqZ4iU&quot;,<br>            &quot;IsLatest&quot;: true,<br>            ...<br>        },<br>        {<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;bUWdb14EppQR13KLVapM699xySNlo9yR&quot;,<br>            &quot;IsLatest&quot;: false,<br>            ...<br>        }<br>    ],<br>    &quot;DeleteMarkers&quot;: [<br>        {<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj&quot;,<br>            &quot;IsLatest&quot;: false,<br>            ...<br>        }<br>    ],<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>At this point, the delete marker with version ID kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj is still there but has become “noncurrent”, as indicated by its property IsLatest with value false.</p><h4>What are expired object delete markers?</h4><p>Expired object delete markers are delete markers at a prefix with no noncurrent object versions.</p><p>We can turn the delete marker from our example into an expired object delete marker by removing the object versions at the same location:</p><pre>aws s3api delete-object --bucket $bucket --key $key \<br>  --version-id l3_yEX5pQZCloovHVMsbIgbzP1pqZ4iU<br>aws s3api delete-object --bucket $bucket --key $key \<br>  --version-id bUWdb14EppQR13KLVapM699xySNlo9yR<br>aws s3api list-object-versions --bucket &quot;$bucket&quot; --prefix &quot;$key&quot;</pre><pre>{<br>    &quot;DeleteMarkers&quot;: [<br>        {<br>            &quot;Key&quot;: &quot;tmp/testfile&quot;,<br>            &quot;VersionId&quot;: &quot;kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj&quot;,<br>            &quot;IsLatest&quot;: true,<br>            ...<br>        }<br>    ],<br>    &quot;RequestCharged&quot;: null<br>}</pre><p>The delete marker with version IDkNig7WIYhADWCr47u_nRrQ8QYdeW4eIj is now an expired object delete marker.</p><h4>Why are expired object delete markers bad?</h4><p>The problem is that expired object delete markers can be current and can remain in your bucket forever unless you do something about them. Lingering markers can slow list requests and result in redundant results when listing object versions.</p><p>When enabling bucket versioning, you typically implement an expiration policy for noncurrent items, as not doing so often quickly becomes prohibitively expensive. With such a policy, <em>all</em> delete markers eventually become current, expired ones. Your expiration policy will not remove these from your bucket. In our example, the delete marker at tmp/testfile will never be automatically removed by a policy expiring noncurrent versions.</p><h4>How can I remove expired object delete markers?</h4><p>You can remove delete markers manually:</p><pre>id=kNig7WIYhADWCr47u_nRrQ8QYdeW4eIj<br>aws s3api delete-object --bucket $bucket --key $key --version-id $id</pre><h4>How can I really remove expired object delete markers?</h4><p>As for multipart uploads, the best way to remove expired object delete markers is through an explicit (expiration) lifecycle policy targeting expired object delete markers. One way to do this is by ticking a box in the Console (see the screenshot above). (You should probably use an IaC tool, though.)</p><h4>Are there any disadvantages to this?</h4><p>In the improbable event that you would like to have an account of which objects existed in the past and when they were removed but do not need the ability to restore said objects, consider not removing expired object delete markers.</p><p>As for aborted multipart uploads, removing delete markers through lifecycle policies is free. Delete markers only exist in versioned buckets, but having a policy to remove them is never harmful.</p><h3>Conclusion</h3><p>We have seen what incomplete multipart uploads are and why you should abort the abandoned ones. We have also seen what expired object delete markers are, that they can be current and that you should remove them.</p><p>AWS does not abort or remove anything for you by default. Therefore, whenever you create a bucket, create an expiration lifecycle policy that aborts multipart uploads after some days and removes expired object delete markers.</p><ul><li>👏 If you liked this article, don’t forget to clap</li><li>🗣️ Share your insights in the comments; I’ll try to respond</li><li>🗞️ Follow me and subscribe to <a href="https://medium.com/datamindedbe">datamindedbe</a> for more posts on cloud, platform-, data-, and software engineering</li><li>👀 For more about Data Minded, visit our <a href="https://hubs.li/Q0295h4B0">website</a></li></ul><figure><a href="https://hubs.li/Q0295h4B0"><img alt="" src="https://cdn-images-1.medium.com/max/200/0*5UXa76GxgI83178u" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f5e51436c060" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/two-lifecycle-policies-every-s3-bucket-should-have-f5e51436c060">Two Lifecycle Policies Every S3 Bucket Should Have</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Connecting to Databases using JDBC from the CLI]]></title>
            <link>https://medium.com/datamindedbe/connecting-to-databases-using-jdbc-from-the-cli-d2f5b1c30f5d?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/d2f5b1c30f5d</guid>
            <category><![CDATA[jdbc]]></category>
            <category><![CDATA[sysadmin]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[sqlline]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Mon, 11 Dec 2023 13:59:13 GMT</pubDate>
            <atom:updated>2023-12-11T13:59:13.407Z</atom:updated>
            <content:encoded><![CDATA[<h3>How To Establish JDBC Connections From the CLI</h3><h4>A quick guide to using the CLI tool SQLLine and any JDBC driver to construct, test, and use connections to any database</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1022/1*PBVz1-rrEXP05obwZfPxrg.png" /></figure><p>Many applications connect to databases using JDBC drivers, often configured by <a href="https://learn.microsoft.com/en-us/sql/connect/jdbc/building-the-connection-url?view=sql-server-ver16">JDBC Connection URLs</a>. As a system or platform administrator, constructing these strings and testing credentials or permissions is often painful, especially as databases are often only accessible from specific networks; connecting to them from your laptop using graphical tools like <a href="https://dbeaver.io/">DBeaver</a> or <a href="https://squirrel-sql.sourceforge.io/">SQuirrel</a> may not be an option.</p><blockquote>Testing JBDC Connection URLs as an administrator can be a painful experience</blockquote><p>sqlline is a command-line shell for issuing SQL to relational databases via JDBC. It can connect to any database for which there exist JDBC drivers. This makes it easy to quickly construct and test JDBC connection URIs to connect to your database. Although sqlline is not under very active development, I have found it to work reliably and for various database types (MySQL, Oracle, PostgreSQL, …).</p><p><a href="https://github.com/julianhyde/sqlline">GitHub - julianhyde/sqlline: Shell for issuing SQL to relational databases via JDBC</a></p><h3>Installation</h3><p>You can download the sqlline JAR pre-packaged together with all its dependencies from <a href="https://central.sonatype.com/artifact/sqlline/sqlline">the Sonatype Maven Central Repository</a>:</p><pre>ver=1.12.0<br>wget https://repo1.maven.org/maven2/sqlline/sqlline/$ver/sqlline-$ver-jar-with-dependencies.jar</pre><p>You should now be able to run sqlline:</p><pre>java -jar sqlline-$ver-jar-with-dependencies.jar --help<br># Usage: java sqlline.SqlLine <br>#   -u &lt;database url&gt;               the JDBC URL to connect to<br>#   -n &lt;username&gt;                   the username to connect as<br>#   -p &lt;password&gt;                   the password to connect as<br>#   -d &lt;driver class&gt;               the driver class to use</pre><p>You will also want to download JDBC drivers, for example, <a href="https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html">Oracle’s JDBC driver if you want to connect to Oracle servers</a>:</p><pre>wget https://download.oracle.com/otn-pub/otn_software/jdbc/233/ojdbc11.jar</pre><p>If you have special requirements, you can follow the instructions in <a href="https://github.com/julianhyde/sqlline">the sqlline repository</a> to compile sqlline from source.</p><h3>Usage</h3><p>You can now start sqlline and use it to connect to a database:</p><pre>java -cp &quot;*&quot; sqlline.SqlLine \<br>  -n myusername -p supersecretpassword \<br>  -u &quot;jdbc:oracle:thin:@my.host.name:1521:my-sid&quot;</pre><p>This will open up an interactive interface into which you can type commands. For example, you can issue the order !tables to list all available tables. Type !help to get a list of all commands.</p><p>You can run SQL queries by typing !sql followed by SQL command:</p><pre>0: jdbc:oracle:thin:@my.host.name&gt; !sql SELECT COUNT(1) FROM mytable;<br>+----------+<br>| COUNT(1) |<br>+----------+<br>| 3358     |<br>+----------+<br>1 row selected (0.02 seconds)</pre><p>To quit sqlline, type !quit.</p><h3>Conclusion</h3><p>SQLLine is an excellent tool that you can use to quickly construct JDBC connection strings and test JDBC drivers and database credentials. It is especially useful to test connections from headless servers. As we have seen in this blog post, installation is a breeze. Stop port forwarding to DBeaver and start using SQLLine!</p><ul><li>👏 If you liked this article, don’t forget to clap!</li><li>🗣️ Share your insights in the comments.</li><li>👀 For more about Data Minded, visit our <a href="https://hubs.li/Q0295h4B0">website</a>.</li></ul><figure><a href="https://hubs.li/Q0295h4B0"><img alt="" src="https://cdn-images-1.medium.com/max/200/0*5UXa76GxgI83178u" /></a></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d2f5b1c30f5d" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/connecting-to-databases-using-jdbc-from-the-cli-d2f5b1c30f5d">Connecting to Databases using JDBC from the CLI</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Debugging Running Pods on Kubernetes]]></title>
            <link>https://medium.com/datamindedbe/debugging-running-pods-on-kubernetes-2ba160c47ef5?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/2ba160c47ef5</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[debugging]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Wed, 25 Oct 2023 08:45:42 GMT</pubDate>
            <atom:updated>2025-03-07T12:00:11.748Z</atom:updated>
            <content:encoded><![CDATA[<h4>Exploring Kubernetes’s debugging feature, kubectl debug, and introducing kubectl superdebug — an enhanced kubectl debug supporting volume mounts.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OJTXMW3ZhEGvZxT4-UG8vQ.png" /><figcaption>Some pods require debugging.</figcaption></figure><h3>Executing commands using kubectl exec</h3><p>If you run software on Kubernetes, you will, at some point, want to debug some aspect of what you deploy. A simple approach to debugging that is natural to people used to working with VMs is to connect to a running pod and hack away:</p><pre>kubectl exec -it podname -c containername -- bash</pre><p>This often works and is very useful. However, there are at least two Kubernetes “best practices” limiting exec’s usefulness in the real world:</p><ul><li><em>Not running as root</em>. Containers run with as few privileges as possible and may even run with randomized UIDs.</li><li><em>Minimal images</em>. Images are kept as small as possible, with binaries installed into a <a href="https://github.com/GoogleContainerTools/distroless">distroless image</a> as an extreme.</li></ul><p>When applying these best practices, connecting to your container using kubectl exec is either impossible or drops you into a barren wasteland-like environment unsuited for debugging.</p><blockquote>kubectl exec does not allow you to specify a user flag or capabilities to start your process with, instead copying those settings from the target container’s main command. <a href="https://github.com/kubernetes/kubernetes/issues/30656">Some Kubernetes users think this should be changed</a>.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EgZ4CwmRO5j1wz6AE-N5Pg.png" /><figcaption><a href="https://data-minded-webinars.webinargeek.com/optimizing-kubernetes-for-data-batch-processing-a-deep-dive?cst=medium_blogs">Register here &gt;&gt;</a></figcaption></figure><h3>Debug containers</h3><p><a href="https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/">The Kubernetes-native answer to debugging running containers is to use </a><a href="https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/">kubectl debug.</a> The debug command spins up a new container into a running pod. This new container can run as a different user and from any image you choose. Because the debug container runs within the same pod as the container it targets (and therefore on the same node), the isolation between both containers does not need to be absolute. The debug container can share system resources with other containers running in the same pod.</p><p>Consider wanting to inspect the CPU usage of a PostgreSQL database running in the container postcont in the pod postpod. The pod does not run as root, and the Postgres image does not have tools like top or htop installed — in other words, the kubectl exec command is of little use. You can then run the following command:</p><pre>kubectl debug -it \<br>--container=debug-container \<br>--image=alpine \<br>--target=postcont \<br>postpod</pre><p>You will be logged in as root (this is the default for the Alpine image) and can easily install your favorite interactive process viewer htop (apk add htop). You share the same process namespace as the postcont container and can see and even kill all the processes running there! When you exit the process, the ephemeral container stops existing, too.</p><blockquote><em>Note</em>: Specifying --target is non-optional if you want your debug container to share the same process namespace as postcont, even if postcont is the only container running in postpod.</blockquote><blockquote><em>Note</em>: You can disconnect from your ephemeral container / bash session without exiting (killing) it by pressing CTRL+P CTRL+D. You can then later reconnect to it using kubectl attach.</blockquote><blockquote><em>Note</em>: kubectl debug offers more functionality than outlined here, such as the copying of pods with a modified startup command or starting a “node” pod with access to the node’s filesystem.</blockquote><h3>Under the hood</h3><p>The kubectl debug command above works by creating something called an <a href="https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/">ephemeral container</a>. These containers are supposed to run <em>temporarily</em> in an <em>existing</em> pod to support actions such as troubleshooting.</p><p>The difference between “normal” containers and ephemeral containers is slim. Nothing <em>really</em> prevents ephemeral containers from running for a long time. I think the reason for having ephemeral containers is best understood by looking at foundational architectural choices made by Kubernetes at its inception:</p><ul><li>Pods should be disposable and replaceable, and, supporting this,</li><li>the Pod specification is immutable.</li></ul><p>This made a lot of sense when Kubernetes was used primarily for deploying stateless workloads — when pods themselves could be considered ephemeral. It can be restrictive in this new world where Kubernetes is used for <em>everything</em>. The Pod spec remains immutable, but Kubernetes models ephemeral containers as a <em>subresource</em> of Pod. Unlike “normal” containers, ephemeral containers are not part of the Pod <em>spec</em>, even if they are part of the pod. This subtle distinction keeps everyone happy 🥳!</p><blockquote>Ephemeral containers are still relatively new; they have been stable since Kubernetes v1.25 (August 2022), beta since v1.23 (December 2021) and alpha since v1.22 (August 2021).</blockquote><h3>Mounting volumes</h3><p>The built-in command kubectl debug can be very useful. It allows you to add an ephemeral container to a running pod, optionally sharing its process namespace with that of a running container. However, if you were expecting to use kubectl debug to inspect or modify any part of the running container’s filesystem, you’re out of luck — the filesystem of the debug pod is disjoint from that of the container you connect it to.</p><p>Luckily, we can do better. The idea is simple:</p><ul><li>Retrieve the specification of the running target container.</li><li>Patch an ephemeral container into the pod. Configure it to share the same process namespace as the target container and additionally to include the same volume mounts.</li></ul><p>There is no kubectl command for creating ephemeral containers, so we need to craft a PATCH request to the K8s API to create it. The kubectl proxy command allows reaching the K8s API.</p><p>This process is not exactly user-friendly, so it makes sense to wrap the procedure into a script or kubectl plugin. You can find an example implementation of such a script over here:</p><p><a href="https://github.com/JonMerlevede/kubectl-superdebug">JonMerlevede/kubectl-superdebug</a></p><p>Note that this approach and script can easily be extended to also copy the environment variable specification from the target container.</p><p>If you save this script as kubectl-superdebug and make it available on your path, you can run it as kubectl superdebug from anywhere as follows:</p><pre>kubectl superdebug \<br>--container=debug-container \<br>--image=alpine \<br>--target=postcont \<br>postpod</pre><p>You may also want to extend this script to copy other aspects of the target container into your debug container, such as references to environment variables.</p><p>This completes the overview of Kubernetes-native approaches to debugging running containers and should cover most people’s needs. However, read on if you’re particularly interested or have <em>special</em> needs!</p><h3>Non-Kubernetes native approaches</h3><p>Kubernetes does not offer a way to connect to a running container as root (unless the main process is running as root) or to access a container’s root filesystem from another container. This does not mean that these things are impossible to do. Kubernetes is, after all, simply a container orchestrator sitting on top of a containerization engine. You can usually do whatever you want by removing layers of abstraction if you, for some reason, really have to. Just make sure that you have to…</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/0*EI8aEfjFFQ7IlVKx.jpeg" /></figure><p>If you use the Docker Engine and can access your engine directly from a node or through a privileged container running on a node, then you can run docker exec --user and execute a process as a user of your choice. Plugins such as <a href="https://github.com/jordanwilson230/kubectl-plugins/blob/master/kubectl-ssh">kubectl ssh</a> and <a href="https://github.com/mikelorant/kubectl-exec-user/tree/master">kubectl exec-user</a> implement this approach. Unfortunately, modern engines such as <a href="https://containerd.io/">containerd</a> and <a href="https://cri-o.io/">CRI-O</a> no longer offer the --user flag functionality — which means that these plugins do not work on modern Kubernetes installations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/456/1*_SukzXUwLIdjf_q9WKEjAw.jpeg" /></figure><p>However, even these modern engines usually just interface with Linux namespaces. You can run commands in whatever “container” you want by entering the appropriate set of Linux namespaces. The tool <a href="https://github.com/ssup2/kpexec">kpexec</a> implements this approach. It starts a privileged pod on the same node as the target container, then determines which (Linux) namespaces to target, executes commands in those (Linux) namespaces, and finally streams their output to your terminal. As an added bonus, it can overlay a set of tools useful for debugging on top of the target container’s filesystem.</p><p>Unlike kubectl exec, kpexec can run commands with a different uid/gid and even different capabilities as the container’s main process. It is compatible with containerd and cri-o. kpexec takes a somewhat heavyweight and brittle approach and may not be compatible with your cluster&#39;s security configuration. It can be worth considering if kubectl (super)debug fails to suit your needs.</p><blockquote>Note that kpexec directly executes commands into namespaces using nsenter. It is compatible with the ubiquitous container runtime runc, but incompatible with runtimes such as <a href="https://katacontainers.io/">Kata Containers</a>.</blockquote><p>In this post, we looked at two Kubernetes-native approaches to debugging running containers: kubectl exec and kubectl debug. We investigated how kubectl debug works, and presented kubectl superdebug, a variation of kubectl debug that starts an ephemeral container sharing the same volumes as the target container and the same process namespace. Lastly, we reviewed some non-Kubernetes native approaches to container debugging.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EgZ4CwmRO5j1wz6AE-N5Pg.png" /><figcaption><a href="https://data-minded-webinars.webinargeek.com/optimizing-kubernetes-for-data-batch-processing-a-deep-dive?cst=medium_blogs">Register here &gt;&gt;</a></figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2ba160c47ef5" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/debugging-running-pods-on-kubernetes-2ba160c47ef5">Debugging Running Pods on Kubernetes</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Twelve-Factor Python applications using Pydantic Settings]]></title>
            <link>https://medium.com/datamindedbe/twelve-factor-python-applications-using-pydantic-settings-f74a69906f2f?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/f74a69906f2f</guid>
            <category><![CDATA[coding]]></category>
            <category><![CDATA[software-architecture]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Wed, 27 Sep 2023 13:26:49 GMT</pubDate>
            <atom:updated>2023-10-23T19:00:58.286Z</atom:updated>
            <content:encoded><![CDATA[<h4>A look at Pydantic Settings and how it can help you reliably deploy applications across environments</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rdA_6cgriXxwkbcRUMH_qg.png" /><figcaption>Python configuration</figcaption></figure><h3>Pydantic Settings</h3><p><a href="https://github.com/pydantic/pydantic-settings">Pydantic Settings</a> is a Python package closely related to the popular <a href="https://docs.pydantic.dev/latest/">Pydantic</a> package. It allows defining type-checked “settings” objects that can be automatically populated from environment variables or a dotenv (.env) file.</p><p>As a small example, consider the following Python code snippet:</p><pre>from pydantic_settings import BaseSettings, SettingsConfigDict<br><br>class TestSettings(BaseSettings, frozen=True):<br>    model_config = SettingsConfigDict(env_file=&quot;.env&quot;)<br>    foo: str<br>    bar: bool<br>    baz: int<br><br><br>if __name__ == &quot;__main__&quot;:<br>    print(TestSettings(baz=1))</pre><p>The test settings object initializes values that are not configured in the code with values sourced from the environment:</p><pre>$ foo=bar BAR=true baz=5 python example.py<br># foo=&#39;bar&#39; bar=True baz=1</pre><p>If settings remain unset, Pydantic refuses to create the settings object:</p><pre>$ python example.py<br># pydantic_core._pydantic_core.ValidationError: 2 validation errors for TestSettings<br># foo<br>#   Field required [type=missing, input_value={}, input_type=dict]<br>#     For further information visit https://errors.pydantic.dev/2.3/v/missing<br># bar<br>#   Field required [type=missing, input_value={}, input_type=dict]<br>#     For further information visit https://errors.pydantic.dev/2.3/v/missing</pre><p>Pydantic additionally checks whether settings match their declared type:</p><pre>$ foo=bar bar=baz python example.py<br># pydantic_core._pydantic_core.ValidationError: 1 validation error for TestSettings<br># bar<br>#   Input should be a valid boolean, unable to interpret input [type=bool_parsing, input_value=&#39;baz&#39;, input_type=str]<br>#     For further information visit https://errors.pydantic.dev/2.3/v/bool_parsing</pre><p>If possible, Pydantic implicitly converts types; if you set “bar” to 1, Pydantic converts this to True. Settings can also be read automatically from a .env file:</p><pre>$ echo -e &quot;foo=bar\nbar=1\nbaz=5&quot; &gt; .env &amp;&amp; python example.py<br># foo=&#39;bar&#39; bar=True baz=1</pre><h3>Twelve-factor applications</h3><p>Now that we know what Pydantic Settings is, let’s look at what we need it for.</p><p>It is almost always a good idea™️ to separate application configuration from core application logic. A good way to do this is by injecting configuration through the environment. A fancier way of saying this is that Pydantic Settings helps create applications that adhere to the <a href="https://12factor.net/config">twelve-factor methodology</a>. The twelve-factor methodology is an influential set of best practices that maximize portability between execution environments, intending to facilitate your application’s deployment. Injecting configuration through environment variables is a big part of that.</p><p>Say you build or package your application, and then deploy it to a development or integration environment with certain settings. Injecting all configurations into it from the environment enables moving your application from your testing to your production environment exactly as it was built and tested.</p><h3>Why Pydantic Settings?</h3><p>Separating configuration values from your core application code is definitely a good idea. Whether you should use Pydantic Settings for this is more circumstantial. Pydantic Settings does offer several advantages over using <a href="https://pypi.org/project/python-dotenv/">python-dotenv</a> directly and/or reading from environment variables or configuration files from all over your application:</p><ul><li><strong>Validity</strong>. Pydantic Settings checks for the presence and types of your setting variables, allowing you to fail early in case of incorrect configuration.</li><li><strong>Error messages</strong>. Pydantic Settings presents clear validation errors that tell you exactly which settings are missing or wrong.</li><li><strong>Overriding</strong>. Environment settings can easily be overridden from within your code. This can be useful, e.g., to set settings to localized values when testing.</li><li><strong>Loose coupling</strong>. Although the twelve-factor methodology is very specific about using environment variables, I consider this an implementation detail — the important part is that you keep configuration and code separated. Environment variables cannot always be considered sufficiently secure and are resolved at your application’s startup time, so there may be situations where retrieving configuration in another way is more appropriate. For example, <a href="https://cloud.google.com/secret-manager/docs/best-practices#coding_practices">Google recommends against storing secret configuration values in environment variables</a>. Creating settings objects at the fringes of your application instead of calling os.env everywhere directly decouples your application logic from how it retrieves configuration, making it easy to source settings from somewhere else, such as a database, a YAML file, or a secret vault. Although such functionality is not built into Pydantic Settings, loose coupling means you can implement it without changing most of your application.</li><li><strong>Features</strong>. Pydantic (Settings) has many nice-to-have features not discussed in this introductory article. One example is that it allows cleanly defining custom validators. Another is its ability to mark strings as “secret”, which helps prevent them from ending up in logs. Check out the Pydantic and Pydantic Settings <a href="https://docs.pydantic.dev/latest/api/base_model/">documentation</a> to learn more!</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1021/1*0yBy6SfdEN9hjW228WlrQg.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f74a69906f2f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/twelve-factor-python-applications-using-pydantic-settings-f74a69906f2f">Twelve-Factor Python applications using Pydantic Settings</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Upserting Data using Spark and Iceberg]]></title>
            <link>https://medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/9e7b957494cf</guid>
            <category><![CDATA[apache-spark]]></category>
            <category><![CDATA[etl]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Thu, 25 May 2023 11:13:51 GMT</pubDate>
            <atom:updated>2023-11-21T22:06:37.609Z</atom:updated>
            <content:encoded><![CDATA[<h4>Use Spark and Iceberg’s MERGE INTO syntax to efficiently store daily, incremental snapshots of a mutable source table.</h4><p>Iceberg allows tracking a table’s history by storing incremental diffs. Unfortunately, there are some caveats, and getting this to work as you likely want it to requires non-obvious querying. In this post, we look at the why and the how.</p><blockquote>We use Spark as our analytical engine, but this post should also apply to other engines working with Iceberg.</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1021/1*kmbF5iy_MKOCe4N4nDGeWg.png" /><figcaption>Merge iceberg into table. Thanks, DALL-E!</figcaption></figure><h3>Problem setting</h3><p>A typical pattern in analytical data processing is to ingest data from an operational system on a daily basis. Often, we store previously ingested table versions in addition to the current state of things, for example, to support reproducing machine learning results. When working with “standard” Spark and Parquet, we do this by storing daily snapshots and by partitioning on the ingestion date.</p><p>We aim to use Apache Iceberg to achieve the same result — storing a queryable history of table snapshots — more efficiently. Iceberg is a project offering a metadata format and a set of execution engine plugins. This extends popular analytical processing engines like <a href="https://spark.apache.org/">Spark</a>, <a href="https://flink.apache.org/">Flink</a>, and <a href="https://trino.io/">Trino</a> with features such as incremental updates, time travel, and ACID transactions.</p><h3>Upserting</h3><p>Using Spark with Iceberg unlocks the SQL MERGE INTO statement, which implements a table “upsert”, a portmanteau formed by combining a “table insert” and “table update”:</p><pre>MERGE INTO prod.db.target t -- a target table<br>USING (SELECT ...) s        -- the source updates<br>ON t.id = s.id              -- condition to find updates for target rows<br>WHEN MATCHED AND &lt;predicate_x&gt; THEN DELETE<br>WHEN MATCHED AND &lt;predicate_y&gt; THEN UPDATE *<br>WHEN NOT MATCHED THEN INSERT * </pre><p>The code above uses the result of the SELECTstatement to delete, update and insert rows from and into the table prod.db.target, depending on whether an id exists in the source table or not and whether or not &lt;predicate_x&gt; or &lt;predicate_y&gt; are true. For an overview of the MERGE INTO statement, check out the Iceberg documentation <a href="https://iceberg.apache.org/docs/latest/spark-writes/#merge-into">here</a>.</p><p>On the storage side, executing an upsert statement like the one above triggers Iceberg to <a href="https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-read-in-apache-iceberg/">create new data files corresponding to any modified partitions (~ copy-on-write) or to create small files expressing e.g. deletes (~ merge-on-read, available since Iceberg v2)</a>. Iceberg also creates new metadata files pointing to these new data files. Unchanged data files (partitions) are re-used. This allows efficient storing of snapshot series by keeping only the snapshot “deltas”. Earlier versions of the prod.db.target table can be “recalled” by time traveling, <a href="https://iceberg.apache.org/docs/latest/spark-queries/#sql">using the TIMESTAMP AS OF or VERSION AS OF clauses</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*N9ovkiRmVuK_anbd.png" /><figcaption>Iceberg table format spec. (<a href="https://iceberg.apache.org/spec/">source</a>)</figcaption></figure><p>TL;DR Upserting allows us to keep our target copy up-to-date while maintaining a complete history of previous states of a source table, without storing full snapshots of the data.</p><h3>MERGE INTO limitations</h3><p>How can we use upserts to reconcile differences between an existing Iceberg table and a newly extracted snapshot?</p><p>Assume that we extract or create a mutable source table snapshot on a daily basis, and want to use it to upsert an Iceberg table called iceberg. Ideally, we would be able to write the following:</p><pre>MERGE INTO iceberg<br>USING snapshot<br>ON iceberg.id = snapshot.id<br>WHEN MATCHED THEN UPDATE *<br>WHEN NOT MATCHED BY TARGET THEN INSERT *<br>WHEN NOT MATCHED BY SOURCE THEN DELETE</pre><p>Unfortunately, this straightforward query does not work for two reasons.</p><p><strong>Deletes</strong>. Unlike Delta, Iceberg does not support the syntax MATCHED BY SOURCE. Iceberg’s NOT MATCHED statement corresponds to NOT MATCHED BY TARGET, that is, it fires when a row exists in snapshot but not in iceberg. This is problematic if rows can be deleted from the source system.</p><p><strong>Superfluous copies.</strong> For rows that are the same in iceberg and snapshot, the WHEN MATCHED THEN UPDATE * results in an identical duplicate of the data being stored on your filesystem. This means that Iceberg will not bring storage benefits over storing multiple snapshots of the source table.</p><p>Under the hood, Iceberg decides which partitions it will re-write based on the ON conditional (see <a href="https://github.com/apache/iceberg/issues/3607">this issue</a> for a discussion on the impact on performance). Rows matched by the ON statement but not by any guard on the match conditions will, therefore, still be copied every time you run the upsert statement. Practically, this means that re-writing the MERGE INTO statement above to read +- as follows still results in duplicate partitions being stored:</p><pre>MERGE INTO iceberg<br>USING snapshot<br>ON iceberg.id = snapshot.id<br>-- condition expressing change is true if one or more columns is different<br>-- in iceberg and target, i.e.<br>-- (iceberg.col1 != snapshot.col1) OR (iceberg.col2 != snapshot.col2) OR ...<br>WHEN MATCHED AND &lt;condition_expressing_change&gt; THEN UPDATE *<br>WHEN NOT MATCHED BY TARGET THEN INSERT *<br>WHEN NOT MATCHED BY SOURCE THEN DELETE</pre><blockquote>(More detailed analysis with references to source code below)</blockquote><blockquote>In fact, your MATCHED conditions are always re-written <a href="https://github.com/apache/iceberg/blob/f5131af5db564427acf7618649796272c1784784/spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala#L210">to include a catchall condition that emits the target row</a>.</blockquote><blockquote>Iceberg will construct modified partitions differently depending on whether your MERGE INTO statement contains only MATCHED conditions, NOT MATCHED conditions or both. If only MATCHED conditions exist, <a href="https://github.com/apache/iceberg/blob/f5131af5db564427acf7618649796272c1784784/spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala#LL124C40-L124C40">a right outer join between target and source suffices</a> (with source being on the right). If only NOT MATCHED conditions exist, Iceberg <a href="https://github.com/apache/iceberg/blob/f5131af5db564427acf7618649796272c1784784/spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala#L80">uses a left anti join and performs a simple append operation instead of re-writing partitions</a>. If both MATCHED and NOT MATCHED conditions exist<a href="https://github.com/apache/iceberg/blob/f5131af5db564427acf7618649796272c1784784/spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala#L155">, a full outer join between target and source is required</a>.</blockquote><blockquote>To determine which rows/partitions of the target to re-write, Iceberg <a href="https://github.com/apache/iceberg/blob/f5131af5db564427acf7618649796272c1784784/spark/v3.1/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteMergeInto.scala#L205">performs a quick inner join between source and target tables</a>. To support NOT MATCHED BY SOURCE, a right outer join would be required, as is implemented by Delta <a href="https://github.com/delta-io/delta/blob/1c18c1d972e37d314711b3a485e6fb7c98fce96d/core/src/main/scala/org/apache/spark/sql/delta/commands/MergeIntoCommand.scala#L490">here</a>.</blockquote><h3>Overcoming MERGE INTO limitations</h3><p>Luckily, we can easily overcome these properties of Iceberg’s upserting functionality. We do this by first preparing a table containing only the changes to your iceberg table. When dealing with sources where rows can be updated and deleted, this requires a full outer join or a sequence of anti-joins. Then, we can use this changes table as the source of updates for our MERGE INTO statement.</p><p>One way to do this is by using a CTE as follows:</p><pre>WITH changes AS<br>SELECT <br>  COALESCE(b.id, a.id) AS id,<br>  b.col1 as col1,<br>  b.col2 as col2,<br>  ...<br>  CASE WHEN b.id IS NULL THEN &#39;D&#39; WHEN a.id IS NULL THEN &#39;I&#39; ELSE &#39;U&#39; END as cdc<br>FROM iceberg a<br>FULL OUTER JOIN snapshot b ON a.id = b.id<br>WHERE NOT (a.col1 = b.col1 AND a.col2 = b.col2 AND ...)<br><br>MERGE INTO iceberg<br>USING changes<br>ON iceberg.id = changes.id<br>WHEN MATCHED AND changes.cdc = &#39;D&#39; THEN DELETE<br>WHEN MATCHED AND changes.cdc = &#39;U&#39; THEN UPDATE SET *<br>WHEN NOT MATCHED THEN INSERT *</pre><p>This results in only changes being stored in the target table iceberg, and supports insertions, updates, and deletes in the source table snapshot.</p><p>The table changes includes a row for every new insert, update or delete in the source table. To construct changes, we perform a full outer join between the recent snapshot of the source table (snapshot) and the existing Iceberg table (iceberg):</p><ul><li>If a row existed in iceberg but no longer exists in snapshot (b.id is null), this corresponds to a delete operation in the source table.</li><li>If an id exists in snapshot but does not yet exist in iceberg (a.id is null), this corresponds to an insert operation in the source table.</li><li>If a row with a specific id exists in both iceberg and snapshot (both a.id and b.id are non-null), the row with this id was unchanged or updated. We filter out unchanged rows by specifying WHERE NOT (a.col1 = b.col1 AND a.col2 = b.col2 AND ...).</li></ul><p>The changes tables only has entries for rows actually requiring changes in the iceberg table, working around the problem of superfluous updates. Merging changes into iceberg using MERGE INTO is straightforward and works the way you would expect it to.</p><p>This post looked at how we can leverage Iceberg to maintain a history of full table snapshots efficiently.</p><p>Iceberg requires some tinkering for it to work the way we want it to, but enables patterns that were previously impossible or inefficient to use with Spark. Iceberg extends Spark&#39;s capabilities with functionality that was previously only available within data warehouses like Snowflake. We hope that with time, Iceberg will become even easier to use. It&#39;s definitely a technology worth exploring!</p><p><em>Edit 21/11: Added some sentences to the introduction and re-wrote some sentences for clarity.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9e7b957494cf" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/upserting-data-using-spark-and-iceberg-9e7b957494cf">Upserting Data using Spark and Iceberg</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Make Gitpod Open Sites in the Browser]]></title>
            <link>https://medium.com/datamindedbe/make-gitpod-open-sites-in-the-browser-d4e946e44c1c?source=rss-e29e99a9aba7------2</link>
            <guid isPermaLink="false">https://medium.com/p/d4e946e44c1c</guid>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[gitpod]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Jonathan Merlevede]]></dc:creator>
            <pubDate>Fri, 16 Dec 2022 15:24:10 GMT</pubDate>
            <atom:updated>2022-12-16T15:24:10.117Z</atom:updated>
            <content:encoded><![CDATA[<h4>Gitpod configuration for opening links in the browser instead of in the terminal.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1lCDMgvLg9pPmq7-mhYBrA.png" /><figcaption>The Lynx browser</figcaption></figure><p>At Data Minded, we ❤️ Gitpod for allowing us to easily configure reproducible and sharable development environments. However, we all have our pet peeves — and one small thing I do not like about Gitpod’s default configuration is that opening a URL from a terminal application fires up <a href="https://lynx.invisible-island.net/">LYNX</a>… and this despite Gitpod running in a perfectly usable browser 🌎! Luckily, we can change this default and configure Gitpod to open links in a browser window by using a startup script.</p><h3>Gitpod helper</h3><p>Gitpod’s helper application gp can open browser windows. As an example, the following command will open Data Minded’s website in a browser window:</p><pre>gp preview --external <a href="https://www.dataminded.com/">https://www.dataminded.com/</a></pre><h4>Updating alternatives</h4><p>We can configure the above command as the default way to open links. The idea is to create a small script /usr/local/bin/open.sh with the following contents:</p><pre>#!/bin/sh<br>exec gp preview --external &quot;$@&quot;</pre><p>Then, make it the default way to open links by running:</p><pre>sudo chmod +x /usr/local/bin/open.sh<br>sudo update-alternatives --install /usr/bin/www-browser www-browser /usr/local/bin/open.sh 100</pre><p>The reason for creating the script is that update-alternatives requires you to point to a single executable.</p><h3>Startup configuration</h3><p>You can make Gitpod perform the steps above when starting your workspace starts by adding the following <a href="https://www.gitpod.io/docs/config-start-tasks">start task</a> to your gitpod.yml file:</p><pre>tasks:<br>  command: |<br>    cat &lt;&lt;&#39;EOF&#39; | sudo install /dev/stdin /usr/local/bin/open.sh<br>    #!/bin/sh<br>    exec gp preview --external &quot;$@&quot;<br>    EOF<br>    sudo update-alternatives --install /usr/bin/www-browser www-browser /usr/local/bin/open.sh 100</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d4e946e44c1c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/datamindedbe/make-gitpod-open-sites-in-the-browser-d4e946e44c1c">Make Gitpod Open Sites in the Browser</a> was originally published in <a href="https://medium.com/datamindedbe">Dataminded</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>