Stories by Nicolai Antiferov on Medium

Redis ACL enforcement without downtime

Nicolai Antiferov — Sun, 05 Apr 2026 16:00:02 GMT

Disclaimer: this article is not about ACL in general, Redis has very good documentation on that. Rather it summarises my experience on how to execute ACL migration seamlessly, without downtime or other interruptions to your services.

Redis ACL was first introduced in v6.x and extended in v7.x with ability to limit subcommands. It replaces previous password protection with flexible ACL, where you can define multiple users with multiple permissions and rotate their credentials easily.

After update from older versions, Redis automatically adds default user ACL to the configuration file. It defines that user default doesn’t have password and all commands are allowed on all keys and Pub/Sub channels, example:

user default on nopass sanitize-payload ~* &* +@all

It’s possible to define users in config file one by one OR use aclfile directive, which is more flexible, as it allows update whole ACL with single command like ACL LOAD / ACL SAVE, instead of multiple SET USER commands and then CONFIG REWRITE. Downside of it is that redis-server requires restart in order to switch to aclfile, which can take a lot of time on big installations.

Preparations

I recommend to start with defining which users are required and their permissions. Example:

admin user(s) (full access to manage/debug): ACL setuser admin on #passwordhash ~* &* +@all
replica user (sync from master): ACL setuser replica on #passwordhash +psync +replconf +ping
sentinel user (for master/replica setup with sentinel): ACL SETUSER sentinel on #passwordhash allchannels +multi +slaveof +ping +exec +subscribe +config|rewrite +role +publish +info +client|setname +client|kill +script|kill
exporter user (for redis_exporter): ACL SETUSER exporter on #passwordhash +@connection +memory -readonly +strlen +config|get +xinfo +pfcount -quit +zcard +type +xlen -readwrite -command +client -wait +scard +llen +hlen +get +eval +slowlog +cluster|info +cluster|slots +cluster|nodes -hello -echo +info +latency +scan -reset -auth -asking
application(s) user: ACL depends on what commands your service using. You might also just limit default user permissions and leave it without password if you just want to prohibit executing dangerous commands, like flushall , example: ACL setuser app on #passwordhash ~* &* +@all -@dangerous

Metrics from redis_exporter are great help with understanding which commands are used now. Check promQL query: sum(rate(redis_commands_total[5m])) by (cmd) results for details.

List of all the commands is available in documentation and each command contains ACL categories list in the beginning, which simplifies ACL definition by using aliases instead of list of each command to allow, example:

Migration

With all preparations done, it’s time to start migration. From single master/replica redis or cluster without ACL to ACL enforced setup.

It requires at least one restart, when all nodes first rolled out with ACL, but default user still has admin permissions.
Then you need to populate replica user to all the nodes with commands: config set masteruser replica + config set masterauth replica-pass.
Then other components of the setup, like exporter/etc should be switched to separate users if they need special permissions or you plan to disable default user.
For master/replica setup, sentinel user should be then set by executing these commands on each sentinel node for each redis it monitors:

sentinel auth-user  
sentinel auth-pass

4. Then you limit/disable default user and update ACL either manually by running config set or ACL LOAD in case of aclfile is used.

Afterwards all that's left — is to monitor redis exporter metrics for NOPERM/WRONGPASS errors with promQL query like this: sum(rate(redis_errors_total[5m])) by (err)

Important details

Switch between aclfile and plain users in config file requires restart
ACL is not replicated between master/replicas and should be set on each node separately
Sentinel requires additional permissions to be able to manage single master/replica setups
Replica requires permissions to sync from master in all setups (single, cluster)
Prefer storing hashed passwords (starting with #) in config files rather than plaintext (starting with >) for security reasons
Open source Redis supports only simple SHA-256 hashes without salt
At the moment all Redis v8.x releases crash on ACL LOAD with Search module (Vector DB) enabled, issue
Hashicorp Vault Redis plugin doesn’t support anything but single redis, issue

Ansible, macOS and “A worker was found in a dead state” fix

Nicolai Antiferov — Sat, 04 Apr 2026 14:53:40 GMT

If you’re using macOS as Ansible controller, you might’ve seen this issue even before, when ansible command run failed with ERROR! A worker was found in a dead state in the middle of execution.

And one of these environment variables helped to fix it:

export no_proxy=*
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

This issue affects not only Ansible, but also other Python projects. Details can be found in the issue https://github.com/ansible/ansible/issues/32554

But with recent macOS update to Tahoe (v26) these workarounds don’t work anymore unfortunately and you’ll receive ERROR! A worker was found in a dead state error in almost 100% of runs.

There are couple of possible fixes for this issue:

Find another workaround to continue running Ansible on macOS
Switch controller OS to Linux via Docker/Devcontainers or other way

For a long time after this issue appeared about half a year ago, the only option was Docker/Devcontainers. Recently however, there was an update in Reddit thread, which suggested to use another environment variable and I can confirm it’s working fine. So if you want to quickly fix the issue without changing anything , just export it and that’s it:

export OS_ACTIVITY_MODE=disable

However, switching to Devcontainers might be still beneficial as with this controller environment can be easily abstracted and it creates easy reproducible environment for your teammates. You don’t have to worry about installing Ansible and its dependencies locally and keep them updated.

Devcontainers allows you to use a container as a full-featured development environment. It can be used to run an application, to separate tools, libraries, or runtimes needed for working with a codebase, and to aid in continuous integration and testing. Dev containers can be run locally or remotely, in a private or public cloud, in a variety of supporting tools and editors.

In this example I’ll focus on using devcontainers with VS Code, which has native support for it. Code can be found in the repo:

GitHub - Nklya/ansible-devcontainer-example: Example of usage Devcontainers with Ansible

There’re multiple ways to start using Ansible with Devcontainers. From manual creation of all required folders/files to using helper in official Ansible extension (NOTE: It requires ansible-creator installed):

Dev Containers extension will be automatically installed on .devcontainer folder creation.

After that VS Code will suggest you automatically to switch to one of Dev containers, but also you can do it yourself by clicking on the bottom left button >< and selecting Reopen in Container:

First open will take some time due to container build, but after that it will be fast. To switch back, click same bottom left button >< and select “Reopen folder locally”.

Git access is mirrored from local, so it’s possible to contribute from inside of container, same as usual. In extensions: [] you can add the extentions you’re usually using.

After you change something in .devcontainer/devcontainer.json , there’ll be pop-up request shown to rebuild it:

Aforementioned solution works fine until you need customisations — like specific versions of Ansible, additional tools installed, etc. It’s definitely possible to add them there, but I think it’s easier to create manually Dev container with all the customisations you need based on devcontainers/python image. Full configuration file specification is can be found here.

Basically all that we need — is to create Dockerfile file and point to it in devcontainer.json like:

  "build": {
    "dockerfile": "Dockerfile"
  },

And define in Dockerfile all the stuff you want to install/customize, example:

FROM mcr.microsoft.com/devcontainers/python:3.13

# Install Ansible 13 packages
RUN pip install --no-cache-dir ansible==13.* ansible-compat==26.* ansible-lint==26.*

Full example can be found in repo https://github.com/Nklya/ansible-devcontainer-example

Additional things about Devcontainers

AWS access. If it’s required inside container, you need to mount them:

"mounts": [
     "source=${localEnv:HOME}/.aws,target=/home/vscode/.aws,type=bind,readonly"
 ]

In case of AWS SSO usage, access should be read-write as temporary credentials generated:

"mounts": [
     "source=${localEnv:HOME}/.aws,target=/home/vscode/.aws,type=bind"
 ]

User id mismatch. Doesn’t matter which approach you’re using, user id in containers and local machine won’t match. This might create issues if you’re using Ansible with SSH access and username on remote hosts is matching your local one, so you don’t override ansible_user in your Ansible configuration.

To fix this, it’s possible to set override in Dockerfile in /etc/ssh/ssh_config.d/custom.conf with User from ${localEnv:USER} passed via args.

AWS quotas and their usage monitoring

Nicolai Antiferov — Wed, 22 Oct 2025 20:48:48 GMT

When you’re actively using AWS services, quite soon you will find that they have quotas — like limit how much EC2 instances, RDS databases you can create, rate limits for different operations, etc.

Some of them could be changed, some — not. Often default limits for quotas are quite low, for example only 5 CPU cores allowed for “Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances (L-1216C47A)” quota.

And it’s quite frustrating to realise that quota is exhausted when something got broken out of nowhere and especially bad if there’s ongoing incident and you cannot solve it without increasing quotas. With support, you can at least escalate quotas increase, but without it you’re stuck with quota increase request, which may take days to apply, depending on your luck.

That’s why it’s important to keep an eye on quotas values and utilisation for services you use and request their increase proactively with usage growth.

While IaC with tools like Terraform (aws_servicequotas_service_quota resource) can help with changing quotas and tracking their history (if you don’t want to rely only on AWS console for it), some monitoring solution is required to monitor quotas usage.

Disclaimer: it’s worth to mention that you might find this article not very useful if you enjoy CloudWatch as monitoring/alerting solution. Because basically it’s recommended way to monitor quotas by AWS (docs) and alerts are embedded nicely into quotas UI, example:

But this also means that you’ll have to configure CloudWatch alerts for each quota you want to monitor in every account and region you run your workloads in. Which could be quite tedious even with IaC if you have multiple accounts and use a lot of services in different regions.

But I believe there’s a better way to monitor quotas — Prometheus + open source exporter, which allows to collect all available quotas usage, nicely visualise them with Grafana and create universal alerts thanks to promQL and Alertmanager. NOTE: Feel free to scroll down to skip story and check how to setup quotas monitoring with aws_quota_exporter.

When I started researching this problem, my first though was to use cloudwatch_exporter or
yet-another-cloudwatch-exporter, which are well known projects to collect metrics from CloudWatch to Prometheus. But unfortunately quota usage is not CloudWatch metric per se — it requires computing using “metric math”, which is not implemented in neither of exporters, issue.

Then I checked quota exporter projects on GitHub just to find dozens of them in different stages of maintenance and each working differently. It seems this variety comes from AWS itself, which evolved over time on approaches how to monitor quotas:

thought-machine/aws-service-quotas-exporter — collects quota value information from Quotas API, but usage directly from API. Like to get amount of EC2 instances running — get running instances and count them. As a result, not that much quotas reported, 8 in total. Last update in 2023.
brennerm/aws-quota-checker — similar in terms how usage is calculated (plain API calls to count items), but written in Python and more quotas reported. Also works as CLI. Last update in 2022.
danielfm/aws-limits-exporter — this one depends on AWS Trusted Advisor API, which requires Business or Enterprise support plan to monitor quotas usage and from my observation Trusted Advisor is missing some services I wanted to monitor, like SageMaker. Last update in 2023.
lablabs/aws-service-quotas-exporter — uses bash wrapper around aws-cli to collect quotas usage.
emylincon/aws_quota_exporter — at the beginning of the year it was able only to collect quota values without usage.

After checking aforementioned exporters, I decided to extend emylincon/aws_quota_exporter with functionality to collect quotas usage. Project was relatively active, able to report quotas for any service in Quotas API, supported caching and code was easy to understand and extend.

I thought that it shouldn’t be that hard — just call API for specific quota to get usage and report new metric, since AWS Web UI shows quota usage with history directly in the interface, example:

But as I soon found out, UI is doing under the hood calls to CloudWatch API to collect usage, so exporter should do the same.

Both ListServiceQuotas and GetServiceQuota API calls return only quota itself and definition for request to CloudWatch if usage available in UsageMetric field, but not usage value itself. You can check it youself with aws-cli, example for ec2 quota:

$ aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A 

{
    "Quota": {
        "ServiceCode": "ec2",
        "ServiceName": "Amazon Elastic Compute Cloud (Amazon EC2)",
        "QuotaArn": "arn:aws:servicequotas:eu-north-1:123456789:ec2/L-1216C47A",
        "QuotaCode": "L-1216C47A",
        "QuotaName": "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances",
        "Value": 16.0,
        "Unit": "None",
        "Adjustable": true,
        "GlobalQuota": false,
        "UsageMetric": {
            "MetricNamespace": "AWS/Usage",
            "MetricName": "ResourceCount",
            "MetricDimensions": {
                "Class": "Standard/OnDemand",
                "Resource": "vCPU",
                "Service": "EC2",
                "Type": "Resource"
            },
            "MetricStatisticRecommendation": "Maximum"
        },
        "QuotaAppliedAtLevel": "ACCOUNT"
    }
}

So in order to collect quota usage, code should:

check if quota has UsageMetric in response
form request to CloudWatch based on it
execute request to CloudWatch
get latest value from response
build usage metric to report to Prometheus

To retrieve CloudWatch metrics there’re 2 API calls available — GetMetricStatistics and GetMetricData. GetMetricData is more efficient and cheaper when requesting large batches of metrics, that’s why it’s used in different exporters, like here. However, in this case we need only latest value to report as quota usage to Prometheus during scrape. And the best part — GetMetricStatistics requests are free for up to 1 million API requests, docs.

So I decided to go with GetMetricStatistics as more cost effective. And first issue I faced — CloudWatch returns empty response for quota usage (i.e. Datapoints=[]) if there’s none. But it could be also caused by too short time window in request. With initially set 5 minute window in some of requests no data was returned.

If you send 5-minute metrics from CloudWatch, there can be ~5–15 minute delay in receiving your metrics. This is because CloudWatch makes your data available with a 5–10 minute delay. Additionally, CloudWatch API limitations can introduce another 5 minutes of delay.

Then I found this ⬆️ and increased it to 15 minutes. If you’re interested, check PR#165 with initial implementation of quotas usage collection.

Just to discover later that RDS usage metrics are coming with ~1 hour delay, so request window for them should be increased to 60+15 minutes (PR#208).

Now, with this feature released, you can easily monitor quotas usage with aws_quota_exporter and Prometheus.

I won’t describe how to run it as it depends on your preferences. In general, exporter should have at least these these IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "servicequotas:ListAWSDefaultServiceQuotas",
                "servicequotas:ListServiceQuotas",
                "cloudwatch:GetMetricStatistics",
            ],
            "Resource": "*"
        }
    ]
}

Define config file with all services/accounts/regions you want to monitor, see example below. Service names could be found in AWS UI or with aws-cli: aws service-quotas list-services.

---
jobs:
  - serviceCode: ec2
    accountName: account2
    regions:
      - eu-west-1
      - eu-west-2
      - eu-north-1
  - serviceCode: lambda
    accountName: account1 # optional, but if set will be in labels
    regions:
      - eu-west-1
      - eu-north-1
  - serviceCode: cloudformation
    accountName: account2
    regions:
      - eu-west-1
      - eu-north-1

Set -collect.usage flag in exporter deployment in order to collect quotas usage. I also recommend to increase cache interval to lower rate of API calls, especially for big deployments with something like -cache.duration 15m (default 5m) and also set -cache.serve-stale in order to avoid gaps on graphs when cache updates.

After the start exporter gathers quotas values and usage for all jobs defined in config and report same metrics on each scrape until cache expires. Metric name is formed based on quota name, for example: aws_quota_lambda_concurrent_executions for “Concurrent executions” quota of lambda service. Metrics for quota value will have {type=”quota”} label, for usage — {type=”usage”} label.

This allows to easily define universal alerts for utilisation of any quota which has usage with promQL query: {job=”quota-exporter”, type=”usage”} / ignoring (type) {job=”quota-exporter”, type=”quota”}>0.75 for alert when usage > 75%.

You can also visualise this nicely with Grafana, example from exporter repository:

One more question left — how to monitor usage of quota, that doesn’t report it in API? Well, in this case you still might be able to collect usage from other sources — like collect them via cloudwatch_exporter (if they’re available there) to Prometheus and then use quota value from aws_quota_exporter and quota usage from cloudwatch_exporter to build promQL query for Alertmanager, but this requires research to find which metrics that reported by service in CloudWatch could be used as usage.

UPD: In October 2025 AWS released “Automatic quota management for AWS Service Quotas”, providing usage monitoring for supported quotas and in the future it should be able to request quotas increase automatically. Note that you have to configure it per each account/region manually and it doesn’t look like supported by IaC atm.

So in the end it’s up to you what to choose — just CloudWatch alerts per each quota, this new automatic quota management or Prometheus + aws_quota_exporter.

AWS quotas and their usage monitoring was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ansible / How to (almost) transparently switch from ssh to ssm-agent

Nicolai Antiferov — Mon, 09 Jun 2025 21:32:54 GMT

Start using Ansible is very simple — you just need ssh access to host you want to manage and control host from which you run ansible-playbook with your code.

However, ssh users management was never an easy task. If you leave this to manual actions, it will end up in a mess. And then begins story of different solutions, internally developed or external tools/services, which help to fix this issue. And on top of this you get topics related to certifications, audits, etc.

But if you’re running your workloads on AWS, you can relatively easy switch from running Ansible via ssh to amazon-ssm-agent and forget about all the issues with ssh users management. Additionally, you’ll get ability to nicely audit access to the ec2 (docs) and provide access based on IAM policies with AWS auth.

There are some parts missing and in general documentation isn’t great, that’s why I think it’s worth to describe shortly what’s required for this switchover.

NOTE: Documentation improved in this PR and now describes how to show hostnames with aws_ssm connection.

SSM Agent runs on EC2 instances and enables you to quickly and easily execute remote commands or scripts against one or more instances. It doesn’t require network connectivity, only proper IAM profile attached to ec2 instance, policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore. Majority of AMIs come with ssm-agent pre-installed, but if you’re missing it, you can install it, docs.

NOTE: In order to have SSM Agent -> SSH properly working, EC2 Instance Connect package should be installed as well.

Let’s imagine you’re using aws_ec2 dynamic inventory to get hosts you want to run against with Ansible. It gets some tag for the hostname, for example tag:Name and it will be shown in the log and could be used in --limit , etc.

---
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
filters:
  instance-state-name: running
...
hostnames:
  - tag:Name # will return some-hostname in example

And log in this case will look like:

TASK [Some test task] ***********************************************
ok: [some-hostname] => {
    "msg": "Running task on host"
}
...
PLAY RECAP **********************************************************
some-hostname : ok=2    changed=0    unreachable=0    failed=0

As described in docs, if your tag:Name is not FQDN, you can use combine filter, to set ansible_host to private_ip, so Ansible will know how to reach host, example:

---
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
filters:
  instance-state-name: running
...
hostnames:
  - tag:Name # will return some-hostname in example
compose:
  ansible_host: private_ip_address

When you run ansible-playbook with this config, log will be the same, i.e. some-hostname will be shown.

If you search Google for how to use ansible with ssm-agent, you’ll get bunch of articles from AWS, which describe how to pack your code into bundle, upload it to S3 and execute with AWS Systems Manager. It’s not good or bad, but I think that there aren’t enough docs describing clearly how to switch from ssh to ssm-agent 1:1, without complicating everything.

So in short, first you need to check docs for community.aws.aws_ssm connection, which provides you ability to run ansible via ssm-agent connection instead of ssh and define required variables in your playbook/group_vars/etc, example:

---
- name: Wait for connection to be available
  vars:
    ansible_connection: aws_ssm
    ansible_aws_ssm_bucket_name: some-bucket
    ansible_aws_ssm_region: us-east-1
    ansible_aws_ssm_profile: some-profile
  tasks:
    - name: Wait for connection
      wait_for_connection:

Basically, with ansible_connection: aws_ssm, Ansible uploads code to provided S3 bucket instead of transferring them via ssh and passes presigned S3 url to managed host to execute via ssm-agent connection.

But there’s a catch — according to docs, configuration for dynamic inventory must be changes a little bit:

---
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
filters:
  instance-state-name: running
...
hostnames:
  - instance-id

With this example, instead of readable hostname from tag:Name ansible will show you in logs instance id, like i-123456789 . And same name should be used in --limit , etc. Which is way less convenient and won’t help with painless migration.

TASK [Some test task] ***********************************************
ok: [i-123456789] => {
    "msg": "Running task on host"
}
...
PLAY RECAP **********************************************************
i-123456789 : ok=2    changed=0    unreachable=0    failed=0

What should be done instead — same trick with compose option, mentioned earlier, but instead of private IP address, we define instance_id.

This will allow us to migrate transparently from ssh connection to ssm-agent without anything changed in behaviour — i.e. same old hostnames will be shown in logs and used in command line options 🎉

---
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
filters:
  instance-state-name: running
...
hostnames:
  - tag:Name # will use `tag:Name` for hostname
compose:
  ansible_host: instance_id # but connect to InstanceID via ssm_agent

Limitations

While this approach allows to replace ssh with ssm-agent 1:1, it still has limitations:

Overall, execution via aws_ssm is slower. According to my tests, at least 2–4x times. Not critical on relatively well written and short code, but with some legacy long running stuff might be concerning. There’s an issue to improve this in Ansible backlog.
Connection via aws_ssm fixes only part of equation — i.e. how to run Ansible, but not in general how to replace ssh completely. So in order to login to the host, aws-cli should be used with start-session command. It requires Session Manager plugin installed (Ansible requires it as well), example: aws ssm start-session target "i-123456789" . But you might remember instances by hostname (tag:Name) for example, but here you have to know InstanceId. In order to simplify this, you might want to create some bash oneliner with aws-cli querying ec2 by tags and returning InstanceId, but I think the best would be to wrap it into some script, like bash or Invoke task (check article for details).
SSM agent provides only ability to login/execute commands on remote hosts. In order to copy files back and from ec2, you need to use SSM Agent -> SSH proxying (docs), which allows to run commands like: ssh i-123456789 or scp i-123456789:/tmp/something .

Ansible / How to measure command duration

Nicolai Antiferov — Sun, 05 Jan 2025 12:38:58 GMT

For some reasons, you might end up in a situation, when you need to measure the time some command spent on running. I had to do this some years ago running DB migrations during deploy with Ansible (I know this is not great solution, but that was PHP 😀).

If you only want to know how long each task runs, it’s pretty easy: just add to ansible.cfg profile_tasks, like this:

[defaults]
callback_whitelist = profile_tasks

And in the end of the ansible-playbook run you will have summary on how long each task was running, example:

Sunday 05 January 2025  14:15:42 +0200 (0:00:20.028)       0:00:30.058 ******** 
=============================================================================== 
Do something even longer ---------------------------------------------- 20.03s
Do something long ----------------------------------------------------- 10.02s

But what if you want to know how long one particular task was running and change flow of running tasks in playbook depending on this.

One way could be provide duration from module you’re running. If this is your custom module, it should be possible and not very hard.

To achieve this with any module, I think the easiest approach would be to register task data and calculate duration based on stop-start timestamps, example:

---
- gather_facts: false
  hosts: all
  tasks:
    - name: Do something long
      ansible.builtin.pause:
        seconds: 30
      register: this

    - set_fact:
        duration: "{{ (this.stop|to_datetime('%Y-%m-%d %H:%M:%S.%f') - this.start|to_datetime('%Y-%m-%d %H:%M:%S.%f')).total_seconds() }}"

    - debug:
        var: duration

PLAY [all] ****************************************************************************************************************************************

TASK [Do something long] **************************************************************************************************************************
Sunday 05 January 2025  14:36:33 +0200 (0:00:00.007)       0:00:00.007 ******** 
Pausing for 30 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [localhost]

TASK [set_fact] ***********************************************************************************************************************************
Sunday 05 January 2025  14:37:03 +0200 (0:00:30.031)       0:00:30.039 ******** 
ok: [localhost]

TASK [debug] **************************************************************************************************************************************
Sunday 05 January 2025  14:37:03 +0200 (0:00:00.035)       0:00:30.075 ******** 
ok: [localhost] => {
    "duration": "30.005181"
}

PLAY RECAP ****************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Sunday 05 January 2025  14:37:03 +0200 (0:00:00.013)       0:00:30.088 ******** 
=============================================================================== 
Do something long ------------------------------------------------------------------------------------------------------------------------- 30.03s
set_fact ----------------------------------------------------------------------------------------------------------------------------------- 0.04s
debug -------------------------------------------------------------------------------------------------------------------------------------- 0.01s

And then based on this duration variable, you can decide which tasks/roles to run.

P.S. Another option might be this approach from SO, when timestamps are registered before and after executed command(s).

How to adapt Grafana dashboard to changed metrics

Nicolai Antiferov — Fri, 03 Jan 2025 06:36:20 GMT

From time to time happens that either project you’re using is changing metric names or your own services do this and after that you face a dilemma — how to visualize new metrics, but keep possibility to check historical data as well. Examples: metrics rename in karpenter, redis_exporter, nodes_exporter.

One option — you can create Dashboard v2 and keep old metrics in old one and use both. But you need to remember about it, announce change properly in your team/company-wide and worst — it most probably won’t be the last change, so you’ll have to create eventually Dashboard v3, v4 and so on, which is not sustainable.

Another — you can duplicate existing panel in dashboard and change query to a new one. It’s possible to group them in another row or something for better visibility, but it still has same issue — what will happen with next rename — more duplicates/rows?

Third option — setup recording rules and hide mestrics change, but it’s not great either (creates more metrics) and I think makes sense only if you don’t own Grafana dashboard and cannot change it.

Best solution from my perspective would be just use of multiple queries in panels. This way you can not only visualize all renamed metrics in one panel, but also mark them with different legends, so it will be visible on graph if you see old one or new version of metric.

Example: Imagine you’re updating Karpenter to v1 and NodePool usage metric changed name from karpenter_nodepool_usage to karpenter_nodepools_usage.

Before update, you had panel, which showed how many cpu cores used: sum(karpenter_nodepool_usage{resource_type=”cpu”}) by (nodepool, resource_type)

One query

Now you need to clone query A with “Duplicate query” button (second to the right) and modify query accordingly to new naming schema. Also worth to update legend for old query, so it will be visible on graph which version you see.

Two queries, old renamed to legacy

NOTE: If only metric name changed, but not labels, you can also use OR and update only PromQL query, like: karpenter_nodepools_usage or karpenter_nodepool_usage.

That was pretty easy, but what to do if your variable selector depends on metric, which was renamed?

Usually Grafana uses label_values query type and variable definition looks like this: nodepool: label_values(karpenter_nodepool_usage,nodepool)

In order to adapt this variable to support both metric names, you need to slightly modify this query to: nodepool: label_values({__name__=~”karpenter_nodepool_usage|karpenter_nodepools_usage”},nodepool)

This way either of this metrics (old or new) will be used as source for nodepool variable.

NOTE: Please keep in mind that this will require that label nodepool exists in both metrics, so this won’t help in all cases.

Karpenter v1 upgrade gotchas

Nicolai Antiferov — Thu, 02 Jan 2025 18:47:28 GMT

In August 2024 Karpenter 1.0 was released, marking important milestone in project development.

This article is not step-by step upgrade guide (it already exists), but more details about nuances of this process, that might be not clear, based on my experience and I think it might be useful if you’re planning update.

Drift

Karpenter v1 brings quite a lot of significant changes, one of which is: Drift disruption, which was before under feature gate, now enabled by default, which means that Karpenter will reprovision your nodes with every new AMI release. Now it’s possible to disable it on NodePool level, but that should be done before update to v1.

In order to disable drift, you need to have nodes: “0” for reasons: Drifted in v1 budget definition, example:

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    budgets:
      - nodes: "10%"
        reasons: 
        - "Empty"
        - "Underutilized"
      - nodes: "0" # disable Drift disruption completely
        reasons: 
          - "Drifted"

Namespace

If you use Karpenter since early versions, you might be running it in karpenter namespace. With update to update to v1, conversion webhooks are required and if you install only karpenter Helm chart, karpenter-crd chart will be installed as subchart, which will cause errors, because karpenter will look for service in kube-system namespace instead, issue.

To fix this, there are 2 options:

migrate Karpenter installation to kube-system namespace. Which is recommended namespace now and could be done without any interruptions to workloads.
install karpenter-crd Helm chart separately, as described in aforementioned issue.

Update order

NOTE: You can update only 0.33–0.37 to 1.0 version, i.e. v1beta1 -> v1 .

1. You should ensure before upgrade that you’re running latest Karpenter 0.33–0.37 version, which provides conversion webhooks and supports both v1 and v1beta1 APIs. I’ve done this from latest v0.37.6.

2. With that, you should be able to get converted to v1 manifests from cluster by running kubectl get nodepool/ -oyaml.

3. Update terraform-aws-eks module (if you’re using it) at least to v20.24.1 before upgrade. It includes support for v1 IAM policy, which differs from 0.33–0.37.

4. I hope you’re using GitOps, so next prepare PR with Karpenter version update to v1 and EC2NodeClass/NodePool manifests updated to v1.

To address issue with drift disruption, you need to ensure that budget has section mentioned before:

  disruption:
    budgets:
      - nodes: "0" # disable Drift disruption completely
        reasons: 
          - "Drifted"

5. Set enable_v1_permissions = true for karpenter terraform module and apply it.

6. Merge PR with Karpenter and manifests update, GitOps should provision everything and if you disabled drift properly, there will be no disruptions to workloads.

Metrics

Quite a lot of metrics changed their names (details), mostly related to nodes, nodepools and disruptions, like karpenter_nodepool_usage → karpenter_nodepool_usage.

You need to check Karpenter dashboard and alert queries and update them accordingly (this article could help). Otherwise, you might miss hitting NodePool limits or issues with nodes provisioning.

NOTE: If you have alert on karpenter_cloudprovider_errors_total metric, you need to exclude NodeClaimNotFoundError error, like {error!=”NodeClaimNotFoundError”} , details.

Resources

As a side note, please keep in mind that Karpenter v1 is consuming more resources from my experience.

If you’re solving chicken and egg problem by running Karpenter on Fargate, you need to ensure that you’re providing enough cpu/memory resources. Check it with kubectl describe |grep Capacity.

Otherwise, it could cause issue with Karpenter unable to do consolidation and metrics fetch timeouts (gaps on graphs).

Check rate for karpenter_voluntary_disruption_consolidation_timeouts_total metric in order to detect consolidation timeouts.

V1.1 upgrade

Karpenter 1.1.0 drops the support for v1beta1 APIs. Seems like because of this, if you’re using Flux for GitOps, you might get error during reconciliation after update to v1.1, which looks like: timeout waiting for: [EC2NodeClass/name status: 'NotFound', ] .

To fix that, you need to restart Flux components.

Karpenter v1 upgrade gotchas was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Summary “Five Steps to Make Your Go Code Faster & More Efficient” FOSDEM 04.02.2023

Nicolai Antiferov — Sat, 04 Feb 2023 15:57:33 GMT

Summary “Five Steps to Make Your Go Code Faster & More Efficient” FOSDEM 04.02.2023 by Bartek Plotka

Summary of the ‘Efficient Go’ Book. Story from Thanos project inspired it.

First Thanos compactor was implemented not very optimized and with increased usage in different companies issues started arising with memory overuse, OOM, etc.

Solutions:

Add an option to use less memory. But Go is not Java, it’s already quite effective
Vertical scale up. Waste of money 💰
Horizontal scale out. Complexity of service became huge
Use other solutions, like Cortex, Mimi’s, etc
Switch to vendor solution

Meanwhile code is not effective and this is the main issue. Finally, compactor code was reviewed and algorithm optimized.

In the past, in lots of cases code was over optimized from the beginning. But now with all elasticity of clouds and orchestrators optimizations are happening often later.

Five pragmatic steps towards more efficient Go programs

Use TFBO test/fix/benchmark/optimize. Which is TDD wrapped with BDO.
Understand current efficiency level with micro benchmarks. Using embedded go benchmark framework. It has different options and results should be reproducible. Use benchstat tool to get human readable results from benchmarks.
Understand your efficiency requirements. To not start with premature optimizations. RAER (resource aware efficiency requirements). Try to estimate roughly requirements.
Focus on hot path. Using profiling. Adding couple options to the same benchmark command line for cpu and memory. And then check in flamegraphs where cpu is spend.
Optimize and repeat.

Link to presentation page where video and presentation will be added https://fosdem.org/2023/schedule/event/gofivestepsefficient/

Summary “Squeezing a go function” FOSDEM 04.02.2023 by Jesús Espino, Mattermost

Nicolai Antiferov — Sat, 04 Feb 2023 13:42:47 GMT

Optimize what you need when you need. Don’t over optimize.

Don’t guess. Measure everything and optimize based on data.

Benchmarks

go benchmark — built in with go, like tests.

“go test -bench .”

It’s possible to report allocations from benchmark context.

Profiling

Usually first you profile and then benchmark parts.

But in these examples we do first benchmark with output and then profiling on this output.

Reducing cpu usage

Return faster for example.

Reduce allocations

Pre sized slice is about the same of speed, but has less allocations and memory usage. Array even faster.

Packing

A lot of variables are less effective then they packed in struct in terms of memory used.

Function inlining

Aligned functions are faster.

Escape analysts

Pass by value copies value in stack and there is no allocations.

Escape analysis and function inlining

Combining both you achieve less allocations and less time spend.

Concurrency

Goroutines are lightweight, but not free.

If you run them on more than cpu cores and workload is heavy, whole performance will suck and generate a lot of allocations and memory usage.

References

Link to the page where video and presentation will be added https://fosdem.org/2023/schedule/event/gosqueezingfunction/

Summary “Recipes for reducing cognitive load” FOSDEM 04.02.2023 by Federico Paolinelli

Nicolai Antiferov — Sat, 04 Feb 2023 11:42:44 GMT

Based on experience of reviewing PRs in metallb project.

The less cognitive load, the less you spend energy.

The simpler is solution, the less is load.

Line of sight

One line of happy path and next ident for exceptions. “Align to the left”

There are different ways to achieve this. Like return earlier, wrap to functions, etc.

https://medium.com/@matryer/line-of-sight-in-code-186dd7cdea88

Package size and name

Utils.copyNode is not that good as Node.copy

Package names — The Go Programming Language

Errors handling

Working with Errors in Go 1.13 — The Go Programming Language

Pure functions

Easier to test
No side effects

Environment variables

It’s very easy to use in modern containers world environment variables.

But it’s hard to track all the parameters read during execution.

That’s why env variables should be read once in the main and propagated to other parts.

Booleans

When functions uses multiple booleans as an input, you might end up in a call like dosmth(true, false, true, true). It’s impossible to understand what is what without function definition.

Use constant or structure to pass it to the function for clear understanding of parameters.

Function overload

No support in Golang. But functional options come to the rescue.

Methods should be functions if possible

Easier to test, easier to understand, easier to write and reuse.

Pointers

If functions don’t change object, pass it as value, otherwise pointer.

It’s could be more expensive to copy, but it will make code easier and less prone to unclear behavior.

Structure

Code should be read as newspaper.

split package to file
Put definitions on top of file

https://github.com/golang/go/wiki/CodeReviewComments

Asynchronous functions

Move business logic to synchronous functions and then call them in goroutines.

It’s easier to test synchronous functions.

Functions that lie

If function says for example, clearnode(), but then it don’t clean it in some cases, it will mislead others.

Conclusion

Pareto principle. Around 80% of outcome is achieved by 20% of efforts.

Simplicity is complicated, but clarity worth it.

Link to the talk where slides and video will be added later https://fosdem.org/2023/schedule/event/goreducecognitive/