Saving networking costs for traffic flow between Flux <> Github
Behind mysterious NAT gateway (AWS) cost increases for outbound traffic
What is flux?
Flux is a tool for keeping Kubernetes clusters in sync with sources of configuration (like Git repositories), and automating updates to configuration when there is new code to deploy.
Working with flux means adopting the Gitops workflow, where the source code (or indirectly image repositories) act as the source of truth and flux polls updates to reconcile the state of the kubernetes cluster to what’s defined in code.
This works great in theory, however at times you can run into a situation where your networking costs can go out of hand when relying on Flux to poll Github repositories, especially monorepos as they can grow huge in size if left unchecked.
Problem
We ran into this problem at VTS Engineering, when one fine day we saw our region’s NAT gateway costs spike 2x from $200 → $400 daily. This finding led to a deeper investigation on finding the needle in the haystack for this network cost increase.
The investigation
Initial investigation
This stage comprised of searching for source code changes or network level changes which might have led to an increase in traffic like —
- New services deployed in the kubernetes clusters
- Feature work leading to higher outbound traffic to fetch data from the internet
- Integration services whose sole job is to fetch/update data from third party providers
However, these didn’t turn out to be the reason for increase in costs.
Analyzing VPC flow logs
The best place is to analyze the VPC flow logs to understand where exactly and which IP addresses is the majority of outbound traffic going to.
To our surprise we found that most of the traffic outbound leading to higher charges was going to Github IPs. This means something in our VPCs was hitting the Github servers to fetch huge amounts of data suddenly.
Root Cause
Part 1:
After some sleuthing, we discovered it was the flux source controllers (HelmRepository, GitRepository) incessantly polling Github servers & upstream helm repositories across multiple K8s clusters running in different VPCs.
However this was always the case and it didn’t answer why the traffic or data transferred increased suddenly.
Part 2:
We found a bad commit went into a monorepo we use to host our custom helm charts increasing it’s size to ~1GB! The commit was reverted but didn’t help in this case as Github maintains history aka the size didn’t reduce.
We were pulling a big git repo (1GB) every 2 minutes multiple times (different charts) across different clusters
Solution
Now, solving the problem should be as simple as doing a shallow clone to fetch the latest commit on the branch instead of everything. However flux doesn’t support shallow clones by setting fetch depth. Only way is to specify a particular commit which isn’t an option.
Therefore, we bumped the spec.interval
field for GitRepository resource from 2m → 10m (minutes) and also for ImageUpdateAutomation controller to 10m as it is the one that triggers the GitRepository source controller.
Note: There’s a careful tradeoff to be made here as increasing the interval will affect deployment times as flux will take longer to fetch new changes & deploy
For future proofing we also bumped the intervals of HelmRepository source controller to 2 hours (120m) since we rarely pulled the latest and thus wanted to avoid unnecessary outbound calls.
# Example GitRepository resource with update time for spec.interval
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: monorepo
namespace: flux-system
spec:
interval: 6m0s
ref:
branch: main
url: 'ssh://git@github.com/org_name/monorepo'
secretRef:
name: monorepo
Results
Bumping the spec.interval
for GitRepository did the trick as we reduced our outbound calls to pull the big monorepo by half, thereby saving ~$70K USD annually.
Tangentially we also set fetch-depth
setting in Github Actions (CI) for all workflows to do a shallow clone of the affected monorepo to further save on NAT gateway costs for our self hosted setup.
Final Thoughts
Flux not supporting shallow clones severely limits the use cases of monorepos which grow larger in size over time (not just by a problematic commit).
Long term solution: We switched to use a new lightweight artifact-only github repository which hosts our custom helm charts.
Thanks for reading! Hope this helps you solving this gotcha & saving costs.