From Flux v1 to Flux v2… Stairway to Heaven or Highway to Hell?🤘

Published in

BlaBlaCar

11 min readDec 6, 2023

It is a safe assumption that attempting to migrate from any v1 to a v2 will not be a quiet trip. But how tricky is it when it comes to Flux? We’ve had the chance to perform the migration ourselves, and we’ve certainly learned a few things along the way. Fresh from that experience, let’s look back on the journey. If we had to do it all over again, would we do it the same way? Are we happy with Flux v2 now?

Let’s start with a little bit of context.

Flux: modern Continuous Delivery/Deployment

BlaBlaCar applications run on Kubernetes and we use FluxCD (or Flux, for friends) to deploy them from our CI/CD pipelines. Flux is a GitOps tool (have a look at https://opengitops.dev/) made popular by the creator of the GitOps principle, WeaveWorks, it helps apply files in kubernetes clusters without too strong workflow logic.

Flux brings automation and autonomy to the dev teams at BlaBlaCar. It has been a game changer, helping us push the “you built it, you run it” principle even further and we were happy with the tool.

From an infrastructure point of view, BlaBlaCar is:

8 Kubernetes clusters
~ 500 HelmReleases (~ 200 in Production)
~ 25 different teams using our clusters
~ 80 namespaces (23 in Production)

But in 2020, WeaveWorks announced Flux v2, a brand new version that kept (almost) nothing of Flux v1. The move was made to tackle architecture issues and user requests. The end-of-life of Flux v1 was announced at the same time, with a rapid transition to maintenance mode only as first step.

We had no real choice but to migrate to Flux v2. It was somewhat sudden and unexpected, but it was also a chance for us to break the scaling limits we had with Flux v1: the new version was also answering some of our needs.

The stage is set. Let’s move on to the actual migration.

On the road again: migration time!

BlaBlaCar is a medium-sized company with ~250 engineers deploying applications in Production on a daily basis (several times a day). We needed our migration to ensure two key points: 0 downtime in production, and helping the dev teams increase their knowledge of Flux while limiting the impact on their roadmaps.

We came up with a four-stage plan to pull it off.

Discovery

We started with a small discovery before the summer of 2021. The idea was to get familiar with Flux v2, building a POC to identify its most important aspects for the migration.

And we were confident enough to greenlight the migration.

We had allocated about two months to this discovery stage.

Migration design and groundwork

We wrapped-up on other projects and in November 2021, we finally had some bandwidth to actively prepare the migration. Things were getting serious, and we designed this fine-grained migration procedure for each of our clusters:

Step 1 — Flux v2 is installed but does not manage any Kubernetes resources.
Flux v1 keeps reconciling Kubernetes resources in the cluster, including the newly-created Flux v2’s resources.
Step 2 — Progressively migrate to Flux v2’s image automation (automatically updating an image tag when a new tag is available in the registry) by removing Flux v1 annotations and configuring Flux v2-dedicated manifests on all concerned resources.
Step 3 — All changes to stop Flux v1 and enable Flux v2 takeover on resources reconciliation are prepared with pull requests ready to be merged.
Step 4 — Flux v1 is shut down, but not uninstalled yet, and Flux v2 both reconciles the Kubernetes cluster with the source of truth, Git repository, and manages container image update automation.
This stage should be performed with maximum care and efficiency to reduce the duration of the code freeze.
Step 5 — The Helm Operator access is configured so it can progressively operate in read-only mode on each namespace, as documented here: Migrate to the Helm Controller
Step 6 — Progressively migrate to Flux v2’s Helm Controller by removing Helm Operator’s accesses and transforming HelmReleases in v2 format.
Duration depends on the number of HelmReleases batches to migrate in the cluster. Code Freeze only applies to the current batch of HelmReleases.
Step 7 — One day after HelmReleases migration, remove any Flux v1 and Helm Operator footprint on the migrated cluster.

The fundamental outputs of this stage were the cluster migration plan and some custom tools (image automation migration tool, Helm Release migration tool, Pull Request automatic creation tool, etc.). The idea was to give to the team the necessary tooling to automate as much of the migration work as possible and make life as easy as possible for users.

The design phase was a background task for four months. We had initially allocated less, but building tools in anticipation of the implementation took longer than expected. And finally, were those four months enough? Probably not.

Migration

The next stage was about executing the migration according to the steps. We moved forward clusters after clusters. Some of it went according to plan, but of course… not everything did. We’ll discuss it more in depth later on.

We had planned 3 months of work for this part of the project but it took us 4.

Aftercare

No migration timeline is truly complete without aftercare. Migrating is certainly not the easiest part of the task, but we also planned to deliver improvements (who gets everything right on the first try?) and observability in a second phase. Everything is in flux after all, pun intended.

We planned for two separate milestones.

First and immediately after delivery, we had saved up 1 month of bandwidth for bug fixes, quick wins, and stability improvements.
We set the date for a thorough review 8 months later, once the dust had settled and the teams had experienced the new Flux.

As we’ve already hinted at, not everything went as expected. So let’s debrief.

The “heaven” part: what went well

Let’s first consider what went well and what we got right:

We had no downtime for our users. 🎉
The dev teams were involved through presentations and were actively involved in the migration during peer reviews.
The dev teams only had to validate those PRs. They didn’t have to code, to change any yaml file, nor debug anything.

All of this thanks to the plan we defined!

What helped us here was that we didn’t focus exclusively on the technical aspect of the migration. We tackled it with a “project management” mindset, opening clear lines of communication with the dev teams.

Developer teams feedback was largely positive. “I appreciated the clear ownership and associated reporting to Slack” said Guillaume, a member of the Single Page Application dev team. “It was reassuring to see that the project was considered major, that it was within the scope of a team and led by a Tech lead.”

The added benefit was that none of our changes came as a surprise. As Guillaume pointed out, we spent time “gathering the requirements in order to list the features to be supported, particularly in the way people deploy, and asking for feedback on the tech choices made and the impact of the changes on the service teams”.

The migration brought us the expected technical benefits. We benefited from major performance improvements compared with Flux v1 thanks to the reconciliation split by namespace, reducing our Time To Market for the features we publish.

Flux v2 also helps us handle more use cases than Flux v1. We are now able to have better observability based on Kubernetes events for example.

The planned aftercare was also successful. “The project didn’t stop with Flux v2 in production,” says Guillaume, “but continued to provide a fairly high and robust level of service for the teams, particularly on the monitoring side, where the challenge was to provide service-oriented information that masked flux tech details.”

Each team is indeed now responsible for their Flux resources with a dedicated alerting system. We are no longer the single point of knowledge of all things Flux. This is not a direct benefit of Flux v2 of course, but migrating gave us the opportunity to make things better in more than the technical aspect.

With that mindset, we have built a use-cases-oriented monitoring and alerting system that isn’t just technical but speaks to the interests of its consumers.

Example of dashboard:

A section of the user dedicated Dashboard for Flux on Datadog

Example of alert:

Take a look at the body part of the alert

We also provide SLOs! This helps us to detect issues with the service and provide visibility to the teams on the Flux status.

What didn’t go so well…

We achieved a lot, but we made some mistakes to learn from.

It took us 1 year and 3 months to go from a POC discovery to the end of the migration in production. That was not the plan. We had originally budgeted for about 4/5 months. This can be explained by internal organization challenges (resources, priority, etc.) but that isn’t the full story.

There were three other major issues.

The first one was our k8s cluster inconsistency. We run 8 clusters, and none have the same purpose: a tools cluster for internal tooling, a preprod cluster for application testing, a production cluster, etc.

That made it difficult to design a consistent and reproducible migration procedure. We naturally had to wait until we were migrating the last one (Production) to find all the edge cases. It made the whole process less predictable, rooted in a trial-and-error approach that certainly took time. That part alone probably added several months to the project.

Flux v2 itself was also an unexpected time sink. First because Flux v2 is…not Flux v1. It certainly has the same mission, but its implementation is of course different. Its complexity (6 controllers, many Custom Resource Definitions — CRDs) compared to Flux v1 made the migration relatively painful, something that the documentation wasn’t helpful in alleviating. Our sizing of each task became less accurate, since we couldn’t rely on our experience. Predictability was no longer achievable.

As an example, Image Update Automation on Flux v2 was, at the end, difficult to understand for many of our users. We had to mitigate that by building our own deploy tool (fluxctl release command like in Flux v1).

We were also doing most of the migration work while Flux v2 was in beta. This obviously wasn’t the most comfortable: we often needed to backtrack and tackle unexpected breaking changes.

Implementing a new observability and alerting layer on top of Flux v2 was also appreciated by consumers, but created alert fatigue for the migrating team. We then had to transfer our knowledge of that layer to the consuming teams, which took a significant amount of time.

Our third critical issue was indeed that we underestimated the size of the project from the start. The discovery was too superficial, not aligned with what was at stake with Flux v2, and we were actually left with a lot of uncertainties that we tackled on the fly.

The design phase regarding the different k8s clusters took much longer than expected. We needed to write extra tools to support the work, some we had planned for, but many more that we didn’t think we would need.

At the end of the day, there was also a fundamental darker side to our high level project management medal. It certainly made our consumers happy, but it required extra work from the infrastructure teams involved and we were not used to it. We needed to be extra cautious at every step so that we didn’t disturb the teams’ flow. Every env migration was paired with a peer review for each impacted team (25 in Production!). That’s fine in hindsight, but it felt like a bottleneck while moving through the timeline.

So, are we happy with our migration?

Overall, yes we are. But it wasn’t the smoothest ride either.

Flux

Flux v2 does deliver a more complete software solution, more capable than Flux v1. It will be supported for years. And it helps us scale (since we have now reached 1300 HelmReleases on all the clusters, with 750 in Production!).

The complexity of its architecture made the migration slower than expected, and continues to require extra care. However Flux v2 matches our needs, and we manage 100% of our workloads with it.

If your service teams are used to Flux v1, they will be able to leverage Flux v2 similarly. You can design your migration plan with that continuity of service in mind.

If we had to do it all over again, would we do it the same way?

No. But what would we change then?

The team explicitly decided to not evaluate Flux alternatives before starting this migration, and to go with Flux v2 while still being in beta.

If we had to do it all over again, we certainly would have invested more in the exploration and design phase, waiting for the Flux v2 GA to be released, and comparing the tool against the industry (like ArgoCD for example). This would have kept us from hitting surprise roadblocks that put a dent in our timeline.

Allocating more space for tooling from the start would also have helped make the migration more predictable and comfortable.

All this would have helped us reduce the pressure on the team’s work balance.

Are we happy now?

Yes! It does provide a faster CD for BlaBlaCar, with better reliability than what we had with Flux v1. In that aspect, choosing Flux v2 was good for the teams at BlaBlaCar.

Its overall complexity remains the big question. The benefits are obvious to us, but will the system be maintainable and understood long term? Time will tell.

Migrating to Flux v2

Should you need a few takeaways to migrate to Flux v2 on your side, here are some important tips.

First, invest in building tools to automate the migration of Flux manifests (image automation, HelmReleases). That’s the dull and repetitive part of the job, and your team will hate it by the end of the project.

Then, build scripts to automate PR creation, for the same reasons. Trust me.

Finally use Project Manager skills to follow the project and invest in strong and constant communication with the impacted teams during the migration. This will be long, but happy consumers make (almost) all of the efforts worth it.

Special thanks to the BlaBlaCar Engineering Productivity Team and Nicolas Salvy. And a big thanks to Guillaume Wuip and Benoit Rajalu for your review and for sharing your impressive blog poster / journalist tips.