Saving networking costs for traffic flow between Flux <> Github

Dev Shah

Published in

Tenets

4 min readApr 11, 2024

Behind mysterious NAT gateway (AWS) cost increases for outbound traffic

What is flux?

Flux is a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), and automating updates to configuration when there is new code to deploy.

Source

Working with flux means adopting the Gitops workflow, where the source code (or indirectly image repositories) act as the source of truth and flux polls updates to reconcile the state of the kubernetes cluster to what’s defined in code.

This works great in theory, however at times you can run into a situation where your networking costs can go out of hand when relying on Flux to poll Github repositories, especially monorepos as they can grow huge in size if left unchecked.

Problem

We ran into this problem at VTS Engineering, when one fine day we saw our region’s NAT gateway costs spike 2x from $200 → $400 daily. This finding led to a deeper investigation on finding the needle in the haystack for this network cost increase.

The investigation

Initial investigation

This stage comprised of searching for source code changes or network level changes which might have led to an increase in traffic like —

New services deployed in the kubernetes clusters
Feature work leading to higher outbound traffic to fetch data from the internet
Integration services whose sole job is to fetch/update data from third party providers

However, these didn’t turn out to be the reason for increase in costs.

Analyzing VPC flow logs

The best place is to analyze the VPC flow logs to understand where exactly and which IP addresses is the majority of outbound traffic going to.

To our surprise we found that most of the traffic outbound leading to higher charges was going to Github IPs. This means something in our VPCs was hitting the Github servers to fetch huge amounts of data suddenly.

Root Cause

Part 1:

After some sleuthing, we discovered it was the flux source controllers (HelmRepository, GitRepository) incessantly polling Github servers & upstream helm repositories across multiple K8s clusters running in different VPCs.

However this was always the case and it didn’t answer why the traffic or data transferred increased suddenly.

Part 2:

We found a bad commit went into a monorepo we use to host our custom helm charts increasing it’s size to ~1GB! The commit was reverted but didn’t help in this case as Github maintains history aka the size didn’t reduce.

We were pulling a big git repo (1GB) every 2 minutes multiple times (different charts) across different clusters

Source: Diagram explaining the source controller overfetching problem.

Solution

Now, solving the problem should be as simple as doing a shallow clone to fetch the latest commit on the branch instead of everything. However flux doesn’t support shallow clones by setting fetch depth. Only way is to specify a particular commit which isn’t an option.

Therefore, we bumped the spec.interval field for GitRepository resource from 2m → 10m (minutes) and also for ImageUpdateAutomation controller to 10m as it is the one that triggers the GitRepository source controller.

Note: There’s a careful tradeoff to be made here as increasing the interval will affect deployment times as flux will take longer to fetch new changes & deploy

For future proofing we also bumped the intervals of HelmRepository source controller to 2 hours (120m) since we rarely pulled the latest and thus wanted to avoid unnecessary outbound calls.

# Example GitRepository resource with update time for spec.interval
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: monorepo
  namespace: flux-system
spec:
  interval: 6m0s
  ref:
    branch: main
  url: 'ssh://git@github.com/org_name/monorepo'
  secretRef:
    name: monorepo

Results

Bumping the spec.interval for GitRepository did the trick as we reduced our outbound calls to pull the big monorepo by half, thereby saving ~$70K USD annually.

Tangentially we also set fetch-depth setting in Github Actions (CI) for all workflows to do a shallow clone of the affected monorepo to further save on NAT gateway costs for our self hosted setup.

Final Thoughts

Flux not supporting shallow clones severely limits the use cases of monorepos which grow larger in size over time (not just by a problematic commit).

Long term solution: We switched to use a new lightweight artifact-only github repository which hosts our custom helm charts.

Thanks for reading! Hope this helps you solving this gotcha & saving costs.