<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Nicolai Antiferov on Medium]]></title>
        <description><![CDATA[Stories by Nicolai Antiferov on Medium]]></description>
        <link>https://medium.com/@nklya?source=rss-4e31e7e49895------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*ayMyDWexePBEwGqsOr9Tzg@2x.jpeg</url>
            <title>Stories by Nicolai Antiferov on Medium</title>
            <link>https://medium.com/@nklya?source=rss-4e31e7e49895------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Thu, 07 May 2026 19:01:44 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@nklya/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Redis ACL enforcement without downtime]]></title>
            <link>https://nklya.medium.com/redis-acl-enforcement-without-downtime-1f2343c3ed6c?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/1f2343c3ed6c</guid>
            <category><![CDATA[redis]]></category>
            <category><![CDATA[acl]]></category>
            <category><![CDATA[security]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sun, 05 Apr 2026 16:00:02 GMT</pubDate>
            <atom:updated>2026-04-05T16:00:02.149Z</atom:updated>
            <content:encoded><![CDATA[<h4><strong>Disclaimer</strong>: this article is not about ACL in general, Redis has very good <a href="https://redis.io/docs/latest/operate/oss_and_stack/management/security/acl/">documentation</a> on that. Rather it summarises my experience on how to execute ACL migration seamlessly, without downtime or other interruptions to your services.</h4><p>Redis ACL was first introduced in v6.x and extended in v7.x with ability to limit subcommands. It replaces previous password protection with flexible ACL, where you can define multiple users with multiple permissions and rotate their credentials easily.</p><p>After update from older versions, Redis automatically adds default user ACL to the configuration file. It defines that user default doesn’t have password and all commands are allowed on all keys and Pub/Sub channels, example:</p><pre>user default on nopass sanitize-payload ~* &amp;* +@all</pre><p>It’s possible to define users in config file one by one OR use aclfile directive, which is more flexible, as it allows update <strong>whole</strong> ACL with single command like <a href="https://redis.io/commands/acl-load">ACL LOAD</a> / <a href="https://redis.io/commands/acl-save">ACL SAVE</a>, instead of multiple SET USER commands and then CONFIG REWRITE. Downside of it is that redis-server requires restart in order to switch to aclfile, which can take a lot of time on big installations.</p><h4>Preparations</h4><p>I recommend to start with defining which users are required and their permissions. Example:</p><ul><li><strong>admin</strong> user(s) (full access to manage/debug): ACL setuser admin on #passwordhash ~* &amp;* +@all</li><li><strong>replica</strong> user (sync from master): ACL setuser replica on #passwordhash +psync +replconf +ping</li><li><strong>sentinel</strong> user (for master/replica setup with sentinel): ACL SETUSER sentinel on #passwordhash allchannels +multi +slaveof +ping +exec +subscribe +config|rewrite +role +publish +info +client|setname +client|kill +script|kill</li><li><strong>exporter</strong> user (for <a href="https://github.com/oliver006/redis_exporter">redis_exporter</a>): ACL SETUSER exporter on #passwordhash +<a href="http://twitter.com/connection">@connection</a> +memory -readonly +strlen +config|get +xinfo +pfcount -quit +zcard +type +xlen -readwrite -command +client -wait +scard +llen +hlen +get +eval +slowlog +cluster|info +cluster|slots +cluster|nodes -hello -echo +info +latency +scan -reset -auth -asking</li><li><strong>application(s)</strong> user: ACL depends on what commands your service using. You might also just limit default user permissions and leave it without password if you just want to prohibit executing dangerous commands, like flushall , example: ACL setuser app on #passwordhash ~* &amp;* +@all -@dangerous</li></ul><p>Metrics from <a href="https://github.com/oliver006/redis_exporter">redis_exporter</a> are great help with understanding which commands are used now. Check promQL query: sum(rate(redis_commands_total[5m])) by (cmd) results for details.</p><p>List of all the commands is available in <a href="https://redis.io/docs/latest/commands/">documentation</a> and each command contains <strong>ACL categories</strong> list in the beginning, which simplifies ACL definition by using aliases instead of list of each command to allow, example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AJyXqWgID0hm8rdIUK3s6A.png" /></figure><h4>Migration</h4><p>With all preparations done, it’s time to start migration. From single master/replica redis or cluster without ACL to ACL enforced setup.</p><ol><li>It requires at least one restart, when all nodes first rolled out with ACL, but default user still has admin permissions.</li><li>Then you need to populate <strong>replica</strong> user to all the nodes with commands: config set masteruser replica + config set masterauth replica-pass.</li><li>Then other components of the setup, like exporter/etc should be switched to separate users if they need special permissions or you plan to disable default user.</li><li>For master/replica setup, <a href="https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/">sentinel user</a> should be then set by executing these commands on <strong>each sentinel node</strong> for <strong>each redis</strong> it monitors:</li></ol><pre>sentinel auth-user &lt;master-name&gt; &lt;username&gt;<br>sentinel auth-pass &lt;master-name&gt; &lt;password&gt;</pre><p>4. Then you limit/disable default user and update ACL either manually by running config set or ACL LOAD in case of aclfile is used.</p><p>Afterwards all that&#39;s left — is to monitor redis exporter metrics for NOPERM/WRONGPASS errors with promQL query like this: sum(rate(redis_errors_total[5m])) by (err)</p><h4>Important details</h4><ol><li>Switch between aclfile and plain users in config file requires restart</li><li>ACL is not replicated between master/replicas and should be set on each node separately</li><li>Sentinel requires additional permissions to be able to manage single master/replica setups</li><li>Replica requires permissions to sync from master in all setups (single, cluster)</li><li>Prefer storing hashed passwords (starting with #) in config files rather than plaintext (starting with &gt;) for security reasons</li><li>Open source Redis supports only simple SHA-256 hashes without salt</li><li>At the moment all Redis v8.x releases crash on ACL LOAD with Search module (Vector DB) enabled, <a href="https://github.com/RediSearch/RediSearch/issues/8342">issue</a></li><li>Hashicorp Vault Redis plugin doesn’t support anything but single redis, <a href="https://github.com/hashicorp/vault-plugin-database-redis/issues/46">issue</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1f2343c3ed6c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ansible, macOS and “A worker was found in a dead state” fix]]></title>
            <link>https://nklya.medium.com/ansible-macos-and-a-worker-was-found-in-a-dead-state-fix-553a2d44e1f1?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/553a2d44e1f1</guid>
            <category><![CDATA[ansible]]></category>
            <category><![CDATA[iac]]></category>
            <category><![CDATA[macos]]></category>
            <category><![CDATA[devcontainer]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sat, 04 Apr 2026 14:53:40 GMT</pubDate>
            <atom:updated>2026-04-04T14:57:02.927Z</atom:updated>
            <content:encoded><![CDATA[<p>If you’re using macOS as Ansible controller, you might’ve seen this issue even before, when ansible command run failed with ERROR! A worker was found in a dead state in the middle of execution.</p><p>And one of these environment variables helped to fix it:</p><pre>export no_proxy=*<br>export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES</pre><p>This issue affects not only Ansible, but also other Python projects. Details can be found in the issue <a href="https://github.com/ansible/ansible/issues/32554">https://github.com/ansible/ansible/issues/32554</a></p><p>But with recent macOS update to Tahoe (v26) these workarounds don’t work anymore unfortunately and you’ll receive ERROR! A worker was found in a dead state error in almost 100% of runs.</p><p>There are couple of possible fixes for this issue:</p><ul><li>Find <strong>another workaround</strong> to continue running Ansible on macOS</li><li>Switch controller OS to Linux via <strong>Docker/Devcontainers</strong> or other way</li></ul><p>For a long time after this issue appeared about half a year ago, the only option was <strong>Docker/Devcontainers</strong>. Recently however, there was an update in <a href="https://www.reddit.com/r/ansible/comments/1nx7zdd/ansiblepython_fork_issue_reoccurring_since_macos/">Reddit thread</a>, which suggested to use another environment variable and I can confirm it’s working fine. So if you want to quickly fix the issue without changing anything , just export it and that’s it:</p><pre>export OS_ACTIVITY_MODE=disable</pre><p>However, switching to <strong>Devcontainers </strong>might be still beneficial as with this controller environment can be easily abstracted and it creates easy reproducible environment for your teammates. You don’t have to worry about installing Ansible and its dependencies locally and keep them updated.</p><p><a href="https://containers.dev/">Devcontainers</a> allows you to use a container as a full-featured development environment. It can be used to run an application, to separate tools, libraries, or runtimes needed for working with a codebase, and to aid in continuous integration and testing. Dev containers can be run locally or remotely, in a private or public cloud, in a variety of <a href="https://containers.dev/supporting">supporting tools and editors</a>.</p><p>In this example I’ll focus on using devcontainers with VS Code, which has native support for it. Code can be found in the repo:</p><p><a href="https://github.com/Nklya/ansible-devcontainer-example">GitHub - Nklya/ansible-devcontainer-example: Example of usage Devcontainers with Ansible</a></p><p>There’re multiple ways to start using Ansible with Devcontainers. From manual creation of all required folders/files to using helper in official Ansible extension (<strong>NOTE</strong>: It requires ansible-creator installed):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4AzKOzg-LowxGESAHGYmDg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_V_uMOqqlqUrJLsQyhx2Xw.png" /></figure><p>Dev Containers extension will be automatically installed on .devcontainer folder creation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*q6zdkxMh8-afaxOGpXWuWA.png" /></figure><p>After that VS Code will suggest you automatically to switch to one of Dev containers, but also you can do it yourself by clicking on the bottom left button &gt;&lt; and selecting Reopen in Container:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uzGv1Krarv8V83Z424MV3g.png" /></figure><p>First open will take some time due to container build, but after that it will be fast. To switch back, click same bottom left button &gt;&lt; and select “Reopen folder locally”.</p><p>Git access is mirrored from local, so it’s possible to contribute from inside of container, same as usual. In extensions: [] you can add the extentions you’re usually using.</p><p>After you change something in .devcontainer/devcontainer.json , there’ll be pop-up request shown to rebuild it:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/920/1*26ujvHuUopsaBun728hlpg.png" /></figure><p>Aforementioned solution works fine until you need customisations — like specific versions of Ansible, additional tools installed, etc. It’s definitely possible to add them there, but I think it’s easier to create manually Dev container with all the customisations you need based on devcontainers/python image. Full configuration file specification is can be found <a href="https://containers.dev/implementors/json_reference/">here</a>.</p><p>Basically all that we need — is to create Dockerfile file and point to it in devcontainer.json like:</p><pre>  &quot;build&quot;: {<br>    &quot;dockerfile&quot;: &quot;Dockerfile&quot;<br>  },</pre><p>And define in Dockerfile all the stuff you want to install/customize, example:</p><pre>FROM mcr.microsoft.com/devcontainers/python:3.13<br><br># Install Ansible 13 packages<br>RUN pip install --no-cache-dir ansible==13.* ansible-compat==26.* ansible-lint==26.*</pre><p>Full example can be found in repo <a href="https://github.com/Nklya/ansible-devcontainer-example">https://github.com/Nklya/ansible-devcontainer-example</a></p><h4>Additional things about Devcontainers</h4><p><strong>AWS access. </strong>If it’s required inside container, you need to mount them:</p><pre>&quot;mounts&quot;: [<br>     &quot;source=${localEnv:HOME}/.aws,target=/home/vscode/.aws,type=bind,readonly&quot;<br> ]</pre><p>In case of <strong>AWS SSO usage</strong>, access should be read-write as temporary credentials generated:</p><pre>&quot;mounts&quot;: [<br>     &quot;source=${localEnv:HOME}/.aws,target=/home/vscode/.aws,type=bind&quot;<br> ]</pre><p><strong>User id mismatch.</strong> Doesn’t matter which approach you’re using, user id in containers and local machine won’t match. This might create issues if you’re using Ansible with SSH access and username on remote hosts is matching your local one, so you don’t override ansible_user in your Ansible configuration.</p><p>To fix this, it’s possible to set override in Dockerfile in /etc/ssh/ssh_config.d/custom.conf with User from ${localEnv:USER} passed via args.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=553a2d44e1f1" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[AWS quotas and their usage monitoring]]></title>
            <link>https://awstip.com/aws-quotas-and-their-usage-monitoring-3b4fa2622b51?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/3b4fa2622b51</guid>
            <category><![CDATA[golang]]></category>
            <category><![CDATA[monitoring]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[prometheus]]></category>
            <category><![CDATA[exporters]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Wed, 22 Oct 2025 20:48:48 GMT</pubDate>
            <atom:updated>2026-02-09T10:19:07.258Z</atom:updated>
            <content:encoded><![CDATA[<p>When you’re actively using AWS services, quite soon you will find that they have <a href="https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html">quotas</a> — like limit how much EC2 instances, RDS databases you can create, rate limits for different operations, etc.</p><p>Some of them could be changed, some — not. Often default limits for quotas are quite low, for example only 5 CPU cores allowed for <em>“Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances (L-1216C47A)”</em> quota.</p><p>And it’s quite frustrating to realise that quota is exhausted when something got broken out of nowhere and especially bad if there’s ongoing incident and you cannot solve it without increasing quotas. With support, you can at least escalate quotas increase, but without it you’re stuck with quota increase request, which may take days to apply, depending on your luck.</p><p>That’s why it’s important to keep an eye on quotas values and utilisation for services you use and request their increase proactively with usage growth.</p><p>While IaC with tools like Terraform (<a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/servicequotas_service_quota.html">aws_servicequotas_service_quota</a> resource) can help with changing quotas and tracking their history (if you don’t want to rely only on AWS console for it), some monitoring solution is required to monitor quotas usage.</p><p><em>Disclaimer</em>: it’s worth to mention that you might find this article not very useful if you enjoy <strong>CloudWatch</strong> as monitoring/alerting solution. Because basically it’s recommended way to monitor quotas by AWS (<a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Quotas-Visualize-Alarms.html">docs</a>) and alerts are embedded nicely into quotas UI, example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sQVMwjWW3EDg6Y9X-eXRkA.png" /></figure><p>But this also means that you’ll have to configure CloudWatch alerts for each quota you want to monitor in every account and region you run your workloads in. Which could be quite tedious even with IaC if you have multiple accounts and use a lot of services in different regions.</p><p>But I believe there’s a better way to monitor quotas —<strong> </strong><a href="https://prometheus.io/"><strong>Prometheus</strong></a> + open source <strong>exporter</strong>, which allows to <strong>collect all available quotas usage</strong>, nicely visualise them with <a href="https://grafana.com/">Grafana</a> and create <strong>universal alerts</strong> thanks to promQL and Alertmanager. <strong>NOTE</strong>: Feel free to scroll down to skip story and check how to setup quotas monitoring with <em>aws_quota_exporter</em>.</p><p>When I started researching this problem, my first though was to use <a href="https://github.com/prometheus/cloudwatch_exporter">cloudwatch_exporter</a> or <a href="https://github.com/prometheus-community/yet-another-cloudwatch-exporter"><br>yet-another-cloudwatch-exporter</a>, which are well known projects to collect metrics from CloudWatch to Prometheus. But unfortunately quota usage is not CloudWatch metric per se — it requires computing using “metric math”, which is not implemented in neither of exporters, <a href="https://github.com/prometheus/cloudwatch_exporter/issues/406">issue</a>.</p><p>Then I checked quota exporter projects on <a href="https://github.com/search?q=quota%20exporter&amp;type=repositories">GitHub</a> just to find dozens of them in different stages of maintenance and each working differently. It seems this variety comes from AWS itself, which evolved over time on approaches how to monitor quotas:</p><ul><li><a href="https://github.com/thought-machine/aws-service-quotas-exporter">thought-machine/aws-service-quotas-exporter</a> — collects quota value information from Quotas API, but usage directly from API. Like to get amount of EC2 instances running — get running instances and count them. As a result, not that much quotas reported, 8 in total. Last update in 2023.</li><li><a href="https://github.com/brennerm/aws-quota-checker">brennerm/aws-quota-checker</a> — similar in terms how usage is calculated (plain API calls to count items), but written in Python and more quotas reported. Also works as CLI. Last update in 2022.</li><li><a href="https://github.com/danielfm/aws-limits-exporter">danielfm/aws-limits-exporter</a> — this one depends on<a href="https://aws.amazon.com/premiumsupport/technology/trusted-advisor/"> AWS Trusted Advisor API</a>, which requires <a href="https://aws.amazon.com/premiumsupport/plans/">Business or Enterprise support plan</a> to monitor quotas usage and from my observation Trusted Advisor is <a href="https://docs.aws.amazon.com/awssupport/latest/user/service-limits.html">missing some services</a> I wanted to monitor, like SageMaker. Last update in 2023.</li><li><a href="https://github.com/lablabs/aws-service-quotas-exporter">lablabs/aws-service-quotas-exporter</a> — uses bash wrapper around aws-cli to collect quotas usage.</li><li><a href="https://github.com/emylincon/aws_quota_exporter">emylincon/aws_quota_exporter</a> — at the beginning of the year it was able only to collect quota values without usage.</li></ul><p>After checking aforementioned exporters, I decided to extend <a href="https://github.com/emylincon/aws_quota_exporter">emylincon/aws_quota_exporter</a> with functionality to collect quotas usage. Project was relatively active, able to report quotas for any service in Quotas API, supported caching and code was easy to understand and extend.</p><p>I thought that it shouldn’t be that hard — just call API for specific quota to get usage and report new metric, since AWS Web UI shows quota usage with history directly in the interface, example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FMMA13nMw2RD_NRMm56lYQ.png" /></figure><p>But as I soon found out, UI is doing under the hood calls to CloudWatch API to collect usage, so exporter should do the same.</p><p>Both <a href="https://docs.aws.amazon.com/servicequotas/2019-06-24/apireference/API_ListServiceQuotas.html"><strong>ListServiceQuotas</strong></a><strong> </strong>and<strong> </strong><a href="https://docs.aws.amazon.com/servicequotas/2019-06-24/apireference/API_GetServiceQuota.html"><strong>GetServiceQuota</strong></a> API calls return only quota itself and definition for request to CloudWatch if usage available in UsageMetric field, but not usage value itself. You can check it youself with aws-cli, example for <strong>ec2</strong> quota:</p><pre>$ aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A <br><br>{<br>    &quot;Quota&quot;: {<br>        &quot;ServiceCode&quot;: &quot;ec2&quot;,<br>        &quot;ServiceName&quot;: &quot;Amazon Elastic Compute Cloud (Amazon EC2)&quot;,<br>        &quot;QuotaArn&quot;: &quot;arn:aws:servicequotas:eu-north-1:123456789:ec2/L-1216C47A&quot;,<br>        &quot;QuotaCode&quot;: &quot;L-1216C47A&quot;,<br>        &quot;QuotaName&quot;: &quot;Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances&quot;,<br>        &quot;Value&quot;: 16.0,<br>        &quot;Unit&quot;: &quot;None&quot;,<br>        &quot;Adjustable&quot;: true,<br>        &quot;GlobalQuota&quot;: false,<br>        &quot;UsageMetric&quot;: {<br>            &quot;MetricNamespace&quot;: &quot;AWS/Usage&quot;,<br>            &quot;MetricName&quot;: &quot;ResourceCount&quot;,<br>            &quot;MetricDimensions&quot;: {<br>                &quot;Class&quot;: &quot;Standard/OnDemand&quot;,<br>                &quot;Resource&quot;: &quot;vCPU&quot;,<br>                &quot;Service&quot;: &quot;EC2&quot;,<br>                &quot;Type&quot;: &quot;Resource&quot;<br>            },<br>            &quot;MetricStatisticRecommendation&quot;: &quot;Maximum&quot;<br>        },<br>        &quot;QuotaAppliedAtLevel&quot;: &quot;ACCOUNT&quot;<br>    }<br>}</pre><p>So in order to collect quota usage, code should:</p><ul><li>check if quota has UsageMetric in response</li><li>form request to CloudWatch based on it</li><li>execute request to CloudWatch</li><li>get latest value from response</li><li>build usage metric to report to Prometheus</li></ul><p>To retrieve CloudWatch metrics there’re 2 API calls available — <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html">GetMetricStatistics</a> and <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html">GetMetricData</a>. GetMetricData is more efficient and cheaper when requesting large batches of metrics, that’s why it’s used in different exporters, like <a href="https://github.com/influxdata/telegraf/issues/5420">here</a>. However, in this case we <strong>need</strong> <strong>only latest value</strong> to report as quota usage to Prometheus during scrape. And the best part — GetMetricStatistics requests are free for up to 1 million API requests, <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_billing.html">docs</a>.</p><p>So I decided to go with <em>GetMetricStatistics</em> as more cost effective. And first issue I faced — CloudWatch returns empty response for quota usage (i.e. Datapoints=[]) if there’s none. But it could be also caused by too short time window in request. With initially set 5 minute window in some of requests no data was returned.</p><blockquote>If you send 5-minute metrics from CloudWatch, there can be ~5–15 minute delay in receiving your metrics. This is because CloudWatch makes your data available with a 5–10 minute delay. Additionally, CloudWatch API limitations can introduce another 5 minutes of delay.</blockquote><p>Then I found this ⬆️ and increased it to 15 minutes. If you’re interested, check <a href="https://github.com/emylincon/aws_quota_exporter/pull/165">PR#165</a> with initial implementation of quotas usage collection.</p><p>Just to discover later that RDS usage metrics are coming with ~1 hour delay, so request window for them should be increased to 60+15 minutes (<a href="https://github.com/emylincon/aws_quota_exporter/pull/208">PR#208</a>).</p><p>Now, with this feature released, you can easily monitor quotas usage with <a href="https://github.com/emylincon/aws_quota_exporter">aws_quota_exporter</a> and Prometheus.</p><p>I won’t describe how to run it as it depends on your preferences. In general, exporter should have at least these these IAM permissions:</p><pre>{<br>    &quot;Version&quot;: &quot;2012-10-17&quot;,<br>    &quot;Statement&quot;: [<br>        {<br>            &quot;Effect&quot;: &quot;Allow&quot;,<br>            &quot;Action&quot;: [<br>                &quot;servicequotas:ListAWSDefaultServiceQuotas&quot;,<br>                &quot;servicequotas:ListServiceQuotas&quot;,<br>                &quot;cloudwatch:GetMetricStatistics&quot;,<br>            ],<br>            &quot;Resource&quot;: &quot;*&quot;<br>        }<br>    ]<br>}</pre><p>Define config file with all <strong>services/accounts/regions</strong> you want to monitor, see example below. Service names could be found in AWS UI or with <em>aws-cli</em>: aws service-quotas list-services.</p><pre>---<br>jobs:<br>  - serviceCode: ec2<br>    accountName: account2<br>    regions:<br>      - eu-west-1<br>      - eu-west-2<br>      - eu-north-1<br>  - serviceCode: lambda<br>    accountName: account1 # optional, but if set will be in labels<br>    regions:<br>      - eu-west-1<br>      - eu-north-1<br>  - serviceCode: cloudformation<br>    accountName: account2<br>    regions:<br>      - eu-west-1<br>      - eu-north-1</pre><p>Set -collect.usage flag in exporter deployment in order to collect quotas usage. I also recommend to increase cache interval to lower rate of API calls, especially for big deployments with something like -cache.duration 15m (default 5m) and also set -cache.serve-stale in order to avoid gaps on graphs when cache updates.</p><p>After the start exporter gathers quotas values and usage for all jobs defined in config and report same metrics on each scrape until cache expires. Metric name is formed based on quota name, for example: aws_quota_lambda_concurrent_executions for “Concurrent executions” quota of lambda service. Metrics for quota value will have {type=”quota”} label, for usage — {type=”usage”} label.</p><p>This allows to easily define<strong> universal alerts</strong> for utilisation of any quota which has usage with promQL query: {job=”quota-exporter”, type=”usage”} / ignoring (type) {job=”quota-exporter”, type=”quota”}&gt;0.75 for alert when usage &gt; 75%.</p><p>You can also visualise this nicely with Grafana, example from exporter <a href="https://github.com/emylincon/aws_quota_exporter/blob/main/img/grafana.png">repository</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LEKhTB5tJDhSDzzFxK3LPA.png" /></figure><p>One more question left — how to monitor usage of quota, that doesn’t report it in API? Well, in this case you still might be able to collect usage from other sources — like collect them via <a href="https://github.com/prometheus/cloudwatch_exporter">cloudwatch_exporter</a> (if they’re available there) to Prometheus and then use <strong>quota value</strong> from <a href="https://github.com/emylincon/aws_quota_exporter">aws_quota_exporter</a> and <strong>quota usage</strong> from <a href="https://github.com/prometheus/cloudwatch_exporter">cloudwatch_exporter</a> to build promQL query for Alertmanager, but this requires research to find which metrics that reported by service in CloudWatch could be used as usage.</p><p><strong>UPD</strong>: In October 2025 AWS released “<a href="https://aws.amazon.com/about-aws/whats-new/2025/10/automatic-quota-management-service-quotas/">Automatic quota management for AWS Service Quotas</a>”, providing <strong>usage monitoring</strong> for supported quotas and in the future it should be able to request quotas increase automatically. Note that you have to configure it per each account/region manually and it doesn’t look like supported by IaC atm.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wV4gPAIjMc7v-AtUUKQEgQ.png" /></figure><p>So in the end it’s up to you what to choose — just CloudWatch alerts per each quota, this new automatic quota management or Prometheus + <a href="https://github.com/emylincon/aws_quota_exporter">aws_quota_exporter</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3b4fa2622b51" width="1" height="1" alt=""><hr><p><a href="https://awstip.com/aws-quotas-and-their-usage-monitoring-3b4fa2622b51">AWS quotas and their usage monitoring</a> was originally published in <a href="https://awstip.com">AWS Tip</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ansible / How to (almost) transparently switch from ssh to ssm-agent]]></title>
            <link>https://nklya.medium.com/ansible-how-to-almost-transparently-switch-from-ssh-to-ssm-agent-c46ac04101f9?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/c46ac04101f9</guid>
            <category><![CDATA[ansible]]></category>
            <category><![CDATA[ssm-agent]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Mon, 09 Jun 2025 21:32:54 GMT</pubDate>
            <atom:updated>2025-08-02T08:41:55.263Z</atom:updated>
            <content:encoded><![CDATA[<p>Start using Ansible is very simple — you just need ssh access to host you want to manage and control host from which you run ansible-playbook with your code.</p><p>However, ssh users management was never an easy task. If you leave this to manual actions, it will end up in a mess. And then begins story of different solutions, internally developed or external tools/services, which help to fix this issue. And on top of this you get topics related to certifications, audits, etc.</p><p>But if you’re running your workloads on AWS, you can relatively easy switch from running Ansible via ssh to <a href="https://github.com/aws/amazon-ssm-agent">amazon-ssm-agent</a> and forget about all the issues with ssh users management. Additionally, you’ll get ability to nicely audit access to the ec2 (<a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html">docs</a>) and provide access based on IAM policies with AWS auth.</p><p>There are some parts missing and in general documentation isn’t great, that’s why I think it’s worth to describe shortly what’s required for this switchover.</p><p><strong>NOTE</strong>: Documentation improved in this <a href="https://github.com/ansible-collections/amazon.aws/pull/2661">PR</a> and now describes how to show hostnames with <strong>aws_ssm</strong> connection.</p><p>SSM Agent runs on EC2 instances and enables you to quickly and easily execute remote commands or scripts against one or more instances. It doesn’t require network connectivity, only proper IAM profile attached to ec2 instance, policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore. Majority of AMIs come with ssm-agent pre-installed, but if you’re missing it, you can install it, <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/manually-install-ssm-agent-linux.html">docs</a>.</p><p><strong>NOTE</strong>: In order to have SSM Agent -&gt; SSH properly working, <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-connect-set-up.html"><strong>EC2 Instance Connect</strong></a> package should be installed as well.</p><p>Let’s imagine you’re using <a href="https://docs.ansible.com/ansible/latest/collections/amazon/aws/aws_ec2_inventory.html">aws_ec2</a> dynamic inventory to get hosts you want to run against with Ansible. It gets some tag for the hostname, for example tag:Name and it will be shown in the log and could be used in --limit , etc.</p><pre>---<br>plugin: amazon.aws.aws_ec2<br>regions:<br>  - us-east-1<br>filters:<br>  instance-state-name: running<br>...<br>hostnames:<br>  - tag:Name # will return some-hostname in example</pre><p>And log in this case will look like:</p><pre>TASK [Some test task] ***********************************************<br>ok: [some-hostname] =&gt; {<br>    &quot;msg&quot;: &quot;Running task on host&quot;<br>}<br>...<br>PLAY RECAP **********************************************************<br>some-hostname : ok=2    changed=0    unreachable=0    failed=0    </pre><p>As described in docs, if your tag:Name is not FQDN, you can use combine filter, to set <em>ansible_host</em> to <em>private_ip</em>, so Ansible will know how to reach host, example:</p><pre>---<br>plugin: amazon.aws.aws_ec2<br>regions:<br>  - us-east-1<br>filters:<br>  instance-state-name: running<br>...<br>hostnames:<br>  - tag:Name # will return some-hostname in example<br>compose:<br>  ansible_host: private_ip_address</pre><p>When you run ansible-playbook with this config, log will be the same, i.e. <em>some-hostname</em> will be shown.</p><p>If you search Google for how to use ansible with ssm-agent, you’ll get bunch of articles from AWS, which describe how to pack your code into bundle, upload it to S3 and execute with <strong>AWS Systems Manager</strong>. It’s not good or bad, but I think that there aren’t enough docs describing clearly how to switch from ssh to ssm-agent 1:1, without complicating everything.</p><p>So in short, first you need to check docs for <a href="https://docs.ansible.com/ansible/latest/collections/community/aws/aws_ssm_connection.html">community.aws.aws_ssm connection</a>, which provides you ability to run ansible via ssm-agent connection instead of ssh and define required variables in your playbook/group_vars/etc, example:</p><pre>---<br>- name: Wait for connection to be available<br>  vars:<br>    ansible_connection: aws_ssm<br>    ansible_aws_ssm_bucket_name: some-bucket<br>    ansible_aws_ssm_region: us-east-1<br>    ansible_aws_ssm_profile: some-profile<br>  tasks:<br>    - name: Wait for connection<br>      wait_for_connection:</pre><p>Basically, with ansible_connection: aws_ssm, Ansible uploads code to provided S3 bucket instead of transferring them via ssh and passes presigned S3 url to managed host to execute via ssm-agent connection.</p><p>But there’s a catch — according to docs, configuration for dynamic inventory must be changes a little bit:</p><pre>---<br>plugin: amazon.aws.aws_ec2<br>regions:<br>  - us-east-1<br>filters:<br>  instance-state-name: running<br>...<br>hostnames:<br>  - instance-id</pre><p>With this example, instead of readable hostname from tag:Name ansible will show you in logs instance id, like i-123456789 . And same name should be used in --limit , etc. Which is way less convenient and won’t help with painless migration.</p><pre>TASK [Some test task] ***********************************************<br>ok: [<strong>i-123456789</strong>] =&gt; {<br>    &quot;msg&quot;: &quot;Running task on host&quot;<br>}<br>...<br>PLAY RECAP **********************************************************<br><strong>i-123456789</strong> : ok=2    changed=0    unreachable=0    failed=0</pre><p>What should be done instead — same trick with compose option, mentioned earlier, but instead of private IP address, we define instance_id.</p><p>This will allow us to migrate transparently from ssh connection to ssm-agent without anything changed in behaviour — i.e. same old hostnames will be shown in logs and used in command line options 🎉</p><pre>---<br>plugin: amazon.aws.aws_ec2<br>regions:<br>  - us-east-1<br>filters:<br>  instance-state-name: running<br>...<br>hostnames:<br>  - tag:Name # will use `tag:Name` for hostname<br>compose:<br>  ansible_host: instance_id # but connect to InstanceID via ssm_agent</pre><h4>Limitations</h4><p>While this approach allows to replace ssh with ssm-agent 1:1, it still has limitations:</p><ol><li>Overall, execution via <em>aws_ssm</em> is slower. According to my tests, at least 2–4x times. Not critical on relatively well written and short code, but with some legacy long running stuff might be concerning. There’s an <a href="https://github.com/ansible-collections/amazon.aws/issues/2636">issue</a> to improve this in Ansible backlog.</li><li>Connection via <em>aws_ssm</em> fixes only part of equation — i.e. how to run Ansible, but not in general how to replace ssh completely. So in order to login to the host, <a href="https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ssm/start-session.html">aws-cli</a> should be used with start-session command. It requires <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html">Session Manager plugin installed</a> (Ansible requires it as well), example: aws ssm start-session target &quot;i-123456789&quot; . But you might remember instances by hostname (tag:Name) for example, but here you have to know InstanceId. In order to simplify this, you might want to create some bash oneliner with aws-cli querying ec2 by tags and returning InstanceId, but I think the best would be to wrap it into some script, like bash or Invoke task (check <a href="https://awstip.com/better-make-for-automation-ad371bf42a97">article</a> for details).</li><li>SSM agent provides only ability to login/execute commands on remote hosts. In order to copy files back and from ec2, you need to use SSM Agent -&gt; SSH proxying (docs), which allows to run commands like: ssh i-123456789 or scp i-123456789:/tmp/something .</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c46ac04101f9" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ansible / How to measure command duration]]></title>
            <link>https://nklya.medium.com/ansible-how-to-measure-command-duration-4f6bbe0a8523?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/4f6bbe0a8523</guid>
            <category><![CDATA[ansible]]></category>
            <category><![CDATA[ansible-tutorial]]></category>
            <category><![CDATA[iac]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sun, 05 Jan 2025 12:38:58 GMT</pubDate>
            <atom:updated>2025-01-05T12:38:58.207Z</atom:updated>
            <content:encoded><![CDATA[<p>For some reasons, you might end up in a situation, when you need to measure the time some command spent on running. I had to do this some years ago running DB migrations during deploy with Ansible (I know this is not great solution, but that was PHP 😀).</p><p>If you only want to know how long each task runs, it’s pretty easy: just add to ansible.cfg <a href="https://docs.ansible.com/ansible/latest/collections/ansible/posix/profile_tasks_callback.html">profile_tasks</a>, like this:</p><pre>[defaults]<br>callback_whitelist = profile_tasks</pre><p>And in the end of the ansible-playbook run you will have summary on how long each task was running, example:</p><pre>Sunday 05 January 2025  14:15:42 +0200 (0:00:20.028)       0:00:30.058 ******** <br>=============================================================================== <br>Do something even longer ---------------------------------------------- 20.03s<br>Do something long ----------------------------------------------------- 10.02s</pre><p>But what if you want to know how long one particular task was running and <strong>change flow of running tasks</strong> in playbook depending on this.</p><p>One way could be provide duration from module you’re running. If this is your custom module, it should be possible and not very hard.</p><p>To achieve this with any module, I think the easiest approach would be to register task data and calculate duration based on stop-start timestamps, example:</p><pre>---<br>- gather_facts: false<br>  hosts: all<br>  tasks:<br>    - name: Do something long<br>      ansible.builtin.pause:<br>        seconds: 30<br>      register: this<br><br>    - set_fact:<br>        duration: &quot;{{ (this.stop|to_datetime(&#39;%Y-%m-%d %H:%M:%S.%f&#39;) - this.start|to_datetime(&#39;%Y-%m-%d %H:%M:%S.%f&#39;)).total_seconds() }}&quot;<br><br>    - debug:<br>        var: duration</pre><pre>PLAY [all] ****************************************************************************************************************************************<br><br>TASK [Do something long] **************************************************************************************************************************<br>Sunday 05 January 2025  14:36:33 +0200 (0:00:00.007)       0:00:00.007 ******** <br>Pausing for 30 seconds<br>(ctrl+C then &#39;C&#39; = continue early, ctrl+C then &#39;A&#39; = abort)<br>ok: [localhost]<br><br>TASK [set_fact] ***********************************************************************************************************************************<br>Sunday 05 January 2025  14:37:03 +0200 (0:00:30.031)       0:00:30.039 ******** <br>ok: [localhost]<br><br>TASK [debug] **************************************************************************************************************************************<br>Sunday 05 January 2025  14:37:03 +0200 (0:00:00.035)       0:00:30.075 ******** <br>ok: [localhost] =&gt; {<br>    &quot;duration&quot;: &quot;30.005181&quot;<br>}<br><br>PLAY RECAP ****************************************************************************************************************************************<br>localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   <br><br>Sunday 05 January 2025  14:37:03 +0200 (0:00:00.013)       0:00:30.088 ******** <br>=============================================================================== <br>Do something long ------------------------------------------------------------------------------------------------------------------------- 30.03s<br>set_fact ----------------------------------------------------------------------------------------------------------------------------------- 0.04s<br>debug -------------------------------------------------------------------------------------------------------------------------------------- 0.01s</pre><p>And then based on this duration variable, you can decide which tasks/roles to run.</p><p>P.S. Another option might be <a href="https://stackoverflow.com/a/78101279">this approach</a> from SO, when timestamps are registered before and after executed command(s).</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4f6bbe0a8523" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to adapt Grafana dashboard to changed metrics]]></title>
            <link>https://nklya.medium.com/how-to-adapt-grafana-dashboard-to-renamed-metrics-7e72500b3741?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/7e72500b3741</guid>
            <category><![CDATA[prometheus]]></category>
            <category><![CDATA[grafana]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Fri, 03 Jan 2025 06:36:20 GMT</pubDate>
            <atom:updated>2025-01-05T17:16:20.985Z</atom:updated>
            <content:encoded><![CDATA[<p>From time to time happens that either project you’re using is changing metric names or your own services do this and after that you face a dilemma — how to visualize new metrics, but keep possibility to check historical data as well. Examples: metrics rename in <a href="https://karpenter.sh/v1.0/upgrading/v1-migration/#updated-metrics">karpenter</a>, <a href="https://github.com/oliver006/redis_exporter/pull/256">redis_exporter</a>, <a href="https://github.com/prometheus/node_exporter/blob/master/docs/V0_16_UPGRADE_GUIDE.md">nodes_exporter</a>.</p><p>One option — you can create Dashboard v2 and keep old metrics in old one and use both. But you need to remember about it, announce change properly in your team/company-wide and worst — it most probably won’t be the last change, so you’ll have to create eventually Dashboard v3, v4 and so on, which is not sustainable.</p><p>Another — you can duplicate existing panel in dashboard and change query to a new one. It’s possible to group them in another row or something for better visibility, but it still has same issue — what will happen with next rename — more duplicates/rows?</p><p>Third option — setup recording rules and hide mestrics change, but it’s not great either (creates more metrics) and I think makes sense only if you don’t own Grafana dashboard and cannot change it.</p><p>Best solution from my perspective would be just use of multiple queries in panels. This way you can not only visualize all renamed metrics in one panel, but also mark them with different legends, so it will be visible on graph if you see old one or new version of metric.</p><p><strong>Example</strong>: Imagine you’re <a href="https://medium.com/@nklya/karpenter-v1-upgrade-nuances-15a52642e9d1">updating Karpenter to v1</a> and NodePool usage metric changed name from karpenter_nodepool_usage to karpenter_nodepools_usage.</p><p>Before update, you had panel, which showed how many cpu cores used: sum(karpenter_nodepool_usage{resource_type=”cpu”}) by (nodepool, resource_type)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hpzSacd1OS0MrG8ZwqFqWg.png" /><figcaption>One query</figcaption></figure><p>Now you need to clone query A with “Duplicate query” button (second to the right) and modify query accordingly to new naming schema. Also worth to update legend for old query, so it will be visible on graph which version you see.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4xZIAQ39zuJHT314A3yGMw.png" /><figcaption>Two queries, old renamed to legacy</figcaption></figure><p><strong>NOTE</strong>: If only <strong>metric name</strong> changed, but not <strong>labels</strong>, you can also use OR and update <strong>only PromQL query</strong>, like: karpenter_nodepools_usage or karpenter_nodepool_usage.</p><p>That was pretty easy, but what to do if your variable selector depends on metric, which was renamed?</p><p>Usually Grafana uses label_values query type and variable definition looks like this: nodepool: label_values(karpenter_nodepool_usage,nodepool)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/950/1*owkaOaSAMMxLrspEja9jvg.png" /></figure><p>In order to adapt this variable to support both metric names, you need to slightly modify this query to: nodepool: label_values({__name__=~”karpenter_nodepool_usage|karpenter_nodepools_usage”},nodepool)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AzF7PINnl3wbP93zwLsghQ.png" /></figure><p>This way either of this metrics (old or new) will be used as source for nodepool variable.</p><p><strong>NOTE</strong>: Please keep in mind that this will require that label nodepool exists in both metrics, so this won’t help in all cases.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7e72500b3741" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Karpenter v1 upgrade gotchas]]></title>
            <link>https://awstip.com/karpenter-v1-upgrade-nuances-15a52642e9d1?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/15a52642e9d1</guid>
            <category><![CDATA[karpenter]]></category>
            <category><![CDATA[kubernetes]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Thu, 02 Jan 2025 18:47:28 GMT</pubDate>
            <atom:updated>2025-02-16T17:47:39.991Z</atom:updated>
            <content:encoded><![CDATA[<p>In August 2024 Karpenter 1.0 was <a href="https://aws.amazon.com/blogs/containers/announcing-karpenter-1-0/">released</a>, marking important milestone in project development.</p><p>This article <strong><em>is not</em></strong> step-by step upgrade guide (it already <a href="https://karpenter.sh/v1.0/upgrading/v1-migration/">exists</a>), but more details about nuances of this process, that might be not clear, based on my experience and I think it might be useful if you’re planning update.</p><h4>Drift</h4><p>Karpenter v1 brings quite a lot of significant changes, one of which is: <strong>Drift disruption</strong>, which was before under feature gate, now enabled by default, which means that Karpenter will reprovision your nodes with every new AMI release. Now it’s possible to disable it on NodePool level, but that should be done <strong>before</strong> update to v1.</p><p>In order to disable drift, you need to have nodes: “0” for reasons: Drifted in <strong>v1</strong> budget definition, example:</p><pre>  disruption:<br>    consolidationPolicy: WhenEmptyOrUnderutilized<br>    budgets:<br>      - nodes: &quot;10%&quot;<br>        reasons: <br>        - &quot;Empty&quot;<br>        - &quot;Underutilized&quot;<br>      - nodes: &quot;0&quot; # disable Drift disruption completely<br>        reasons: <br>          - &quot;Drifted&quot;</pre><h4>Namespace</h4><p>If you use Karpenter since early versions, you might be running it in karpenter namespace. With update to update to v1, conversion webhooks are required and if you install only karpenter Helm chart, karpenter-crd chart will be installed as subchart, which will cause errors, because karpenter will look for service in kube-system namespace instead, <a href="https://github.com/aws/karpenter-provider-aws/issues/6818#issuecomment-2451081638">issue</a>.</p><p>To fix this, there are 2 options:</p><ul><li>migrate Karpenter installation to kube-system namespace. Which is recommended namespace now and could be done without any interruptions to workloads.</li><li>install karpenter-crd Helm chart separately, as described in aforementioned <a href="https://github.com/aws/karpenter-provider-aws/issues/6818">issue</a>.</li></ul><h4>Update order</h4><p><strong>NOTE</strong>: You can update only <strong>0.33–0.37</strong> to <strong>1.0</strong> version, i.e. v1beta1 -&gt; v1 .</p><p>1. You should ensure before upgrade that you’re running latest Karpenter 0.33–0.37 version, which provides conversion webhooks and supports both v1 and v1beta1 APIs. I’ve done this from latest <a href="https://github.com/aws/karpenter-provider-aws/releases/tag/v0.37.6">v0.37.6</a>.</p><p>2. With that, you should be able to get converted to v1 manifests from cluster by running kubectl get nodepool/&lt;your-nodepool&gt; -oyaml.</p><p>3. Update terraform-aws-eks module (if you’re using it) at least to <a href="https://github.com/terraform-aws-modules/terraform-aws-eks/releases/tag/v20.24.1">v20.24.1</a> <strong>before upgrade</strong>. It includes support for v1 IAM policy, which differs from 0.33–0.37.</p><p>4. I hope you’re using GitOps, so next prepare PR with Karpenter version update to v1 and EC2NodeClass/NodePool manifests updated to v1.</p><p>To address issue with drift disruption, you need to ensure that budget has section mentioned before:</p><pre>  disruption:<br>    budgets:<br>      - nodes: &quot;0&quot; # disable Drift disruption completely<br>        reasons: <br>          - &quot;Drifted&quot;</pre><p>5. Set enable_v1_permissions = true for karpenter <a href="https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/modules/karpenter">terraform module</a> and apply it.</p><p>6. Merge PR with Karpenter and manifests update, GitOps should provision everything and if you disabled drift properly, there will be no disruptions to workloads.</p><h4>Metrics</h4><p>Quite a lot of metrics changed their names (<a href="https://karpenter.sh/v1.0/upgrading/v1-migration/#updated-metrics">details</a>), mostly related to nodes, nodepools and disruptions, like karpenter_nodepool_usage → karpenter_nodepool_usage.</p><p>You need to check Karpenter <strong>dashboard</strong> and <strong>alert queries</strong> and update them accordingly (this <a href="https://nklya.medium.com/how-to-adapt-grafana-dashboard-to-renamed-metrics-7e72500b3741">article</a> could help). Otherwise, you might miss hitting NodePool limits or issues with nodes provisioning.</p><p><strong>NOTE</strong>: If you have alert on karpenter_cloudprovider_errors_total metric, you need to exclude NodeClaimNotFoundError error, like {error!=”NodeClaimNotFoundError”} , <a href="https://karpenter.sh/v1.0/upgrading/v1-migration/#updated-metrics">details</a>.</p><h4>Resources</h4><p>As a side note, please keep in mind that Karpenter v1 is consuming more resources from my experience.</p><p>If you’re solving chicken and egg problem by running Karpenter on Fargate, you need to ensure that you’re providing enough cpu/memory resources. Check it with kubectl describe &lt;karpenter-pod&gt;|grep Capacity.</p><p>Otherwise, it could cause issue with Karpenter unable to do consolidation and metrics fetch timeouts (gaps on graphs).</p><p>Check rate for <a href="https://karpenter.sh/v1.0/reference/metrics/#karpenter_voluntary_disruption_consolidation_timeouts_total"><em>karpenter_voluntary_disruption_consolidation_timeouts_total</em></a> metric in order to detect consolidation timeouts.</p><h4>V1.1 upgrade</h4><p>Karpenter 1.1.0 drops the support for v1beta1 APIs. Seems like because of this, if you’re using Flux for GitOps, you might get error during reconciliation after update to v1.1, which looks like: timeout waiting for: [EC2NodeClass/name status: &#39;NotFound&#39;, &lt;list of all other EC2NodeClass here too&gt;] .</p><p>To fix that, you need to restart Flux components.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=15a52642e9d1" width="1" height="1" alt=""><hr><p><a href="https://awstip.com/karpenter-v1-upgrade-nuances-15a52642e9d1">Karpenter v1 upgrade gotchas</a> was originally published in <a href="https://awstip.com">AWS Tip</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Summary “Five Steps to Make Your Go Code Faster & More Efficient” FOSDEM 04.02.2023]]></title>
            <link>https://nklya.medium.com/summary-five-steps-to-make-your-go-code-faster-more-efficient-fosdem-04-02-2023-cc6fc28c3b11?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/cc6fc28c3b11</guid>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sat, 04 Feb 2023 15:57:33 GMT</pubDate>
            <atom:updated>2023-02-04T16:10:08.112Z</atom:updated>
            <content:encoded><![CDATA[<h2>Summary “Five Steps to Make Your Go Code Faster &amp; More Efficient” FOSDEM 04.02.2023 by Bartek Plotka</h2><p>Summary of the ‘<a href="https://bwplotka.dev/2022/efficient-go-release/">Efficient Go</a>’ Book. Story from Thanos project inspired it.</p><p>First Thanos compactor was implemented not very optimized and with increased usage in different companies issues started arising with memory overuse, OOM, etc.</p><p>Solutions:</p><ul><li>Add an option to use less memory. But Go is not Java, it’s already quite effective</li><li>Vertical scale up. Waste of money 💰</li><li>Horizontal scale out. Complexity of service became huge</li><li>Use other solutions, like Cortex, Mimi’s, etc</li><li>Switch to vendor solution</li></ul><p>Meanwhile code is not effective and this is the main issue. Finally, compactor code was reviewed and algorithm optimized.</p><p>In the past, in lots of cases code was over optimized from the beginning. But now with all elasticity of clouds and orchestrators optimizations are happening often later.</p><h3>Five pragmatic steps towards more efficient Go programs</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kE-1W1S7xT4xTCl_jO6XBw@2x.jpeg" /></figure><ol><li>Use TFBO test/fix/benchmark/optimize. Which is TDD wrapped with BDO.</li><li>Understand current efficiency level with micro benchmarks. Using embedded go benchmark framework. It has different options and results should be reproducible. Use <a href="https://pkg.go.dev/golang.org/x/perf/cmd/benchstat"><strong>benchstat</strong></a> tool to get human readable results from benchmarks.</li><li>Understand your efficiency requirements. To not start with premature optimizations. RAER (resource aware efficiency requirements). Try to estimate roughly requirements.</li><li>Focus on hot path. Using profiling. Adding couple options to the same benchmark command line for cpu and memory. And then check in flamegraphs where cpu is spend.</li><li>Optimize and repeat.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*a_769L95W1oIh-9F5ROcHA@2x.jpeg" /></figure><p>Link to presentation page where video and presentation will be added https://fosdem.org/2023/schedule/event/gofivestepsefficient/</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cc6fc28c3b11" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Summary “Squeezing a go function” FOSDEM 04.02.2023 by Jesús Espino, Mattermost]]></title>
            <link>https://nklya.medium.com/tldr-squeezing-a-go-function-fosdem-04-02-2023-def76a5f3d70?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/def76a5f3d70</guid>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sat, 04 Feb 2023 13:42:47 GMT</pubDate>
            <atom:updated>2023-02-04T16:56:58.008Z</atom:updated>
            <content:encoded><![CDATA[<p>Optimize what you need when you need. Don’t over optimize.</p><p>Don’t guess. Measure everything and optimize based on data.</p><h3>Benchmarks</h3><p>go benchmark — built in with go, like tests.</p><p>“go test -bench .”</p><p>It’s possible to report allocations from benchmark context.</p><h3>Profiling</h3><p>Usually first you profile and then benchmark parts.</p><p>But in these examples we do first benchmark with output and then profiling on this output.</p><h3>Reducing cpu usage</h3><p>Return faster for example.</p><h3>Reduce allocations</h3><p>Pre sized slice is about the same of speed, but has less allocations and memory usage. Array even faster.</p><h3>Packing</h3><p>A lot of variables are less effective then they packed in struct in terms of memory used.</p><h3>Function inlining</h3><p>Aligned functions are faster.</p><h3>Escape analysts</h3><p>Pass by value copies value in stack and there is no allocations.</p><h3>Escape analysis and function inlining</h3><p>Combining both you achieve less allocations and less time spend.</p><h3>Concurrency</h3><p>Goroutines are lightweight, but not free.</p><p>If you run them on more than cpu cores and workload is heavy, whole performance will suck and generate a lot of allocations and memory usage.</p><h3>References</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ByVMaMgA-9xMS8Uh7TgUgQ@2x.jpeg" /></figure><p>Link to the page where video and presentation will be added https://fosdem.org/2023/schedule/event/gosqueezingfunction/</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=def76a5f3d70" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Summary “Recipes for reducing cognitive load” FOSDEM 04.02.2023 by Federico Paolinelli]]></title>
            <link>https://nklya.medium.com/tldr-recipes-for-reducing-cognitive-load-fosdem-04-02-2023-1d84928151ca?source=rss-4e31e7e49895------2</link>
            <guid isPermaLink="false">https://medium.com/p/1d84928151ca</guid>
            <dc:creator><![CDATA[Nicolai Antiferov]]></dc:creator>
            <pubDate>Sat, 04 Feb 2023 11:42:44 GMT</pubDate>
            <atom:updated>2023-02-04T16:58:30.304Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>Based on experience of reviewing PRs in <a href="https://github.com/metallb/metallb">metallb</a> project.</blockquote><p>The less cognitive load, the less you spend energy.</p><p>The simpler is solution, the less is load.</p><h3><strong>Line of sight</strong></h3><p>One line of happy path and next ident for exceptions. “Align to the left”</p><p>There are different ways to achieve this. Like return earlier, wrap to functions, etc.</p><p>https://medium.com/@matryer/line-of-sight-in-code-186dd7cdea88</p><h3>Package size and name</h3><p>Utils.copyNode is not that good as Node.copy</p><p>Package names — The Go Programming Language</p><h3>Errors handling</h3><p>Working with Errors in Go 1.13 — The Go Programming Language</p><h3>Pure functions</h3><ul><li>Easier to test</li><li>No side effects</li></ul><h3>Environment variables</h3><p>It’s very easy to use in modern containers world environment variables.</p><p>But it’s hard to track all the parameters read during execution.</p><p>That’s why env variables should be read once in the main and propagated to other parts.</p><h3>Booleans</h3><p>When functions uses multiple booleans as an input, you might end up in a call like dosmth(true, false, true, true). It’s impossible to understand what is what without function definition.</p><p>Use constant or structure to pass it to the function for clear understanding of parameters.</p><h3>Function overload</h3><p>No support in Golang. But functional options come to the rescue.</p><h3>Methods should be functions if possible</h3><p>Easier to test, easier to understand, easier to write and reuse.</p><h3>Pointers</h3><p>If functions don’t change object, pass it as value, otherwise pointer.</p><p>It’s could be more expensive to copy, but it will make code easier and less prone to unclear behavior.</p><h3>Structure</h3><p>Code should be read as newspaper.</p><ul><li>split package to file</li><li>Put definitions on top of file</li></ul><p>https://github.com/golang/go/wiki/CodeReviewComments</p><h3>Asynchronous functions</h3><p>Move business logic to synchronous functions and then call them in goroutines.</p><p>It’s easier to test synchronous functions.</p><h3>Functions that lie</h3><p>If function says for example, clearnode(), but then it don’t clean it in some cases, it will mislead others.</p><h3>Conclusion</h3><p>Pareto principle. Around 80% of outcome is achieved by 20% of efforts.</p><p>Simplicity is complicated, but clarity worth it.</p><p>Link to the talk where slides and video will be added later <a href="https://fosdem.org/2023/schedule/event/goreducecognitive/">https://fosdem.org/2023/schedule/event/goreducecognitive/</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1d84928151ca" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>