EXPEDIA GROUP TECHNOLOGY — ENGINEERING

Sailing uncharted waters with Spinnaker

How Expedia Group navigated our Spinnaker migration

Alexandra Spillane
Expedia Group Technology

--

Photo by Geran de Klerk on Unsplash

In mid-2020, the Expedia Group Delivery Platform team embarked on an effort to adopt an enterprise-grade continuous delivery tool to replace our existing in-house tool. After evaluating both external and internal solutions, the team selected the Netflix-developed open source continuous delivery tool, Spinnaker. Read on as we share insights into our Spinnaker general availability journey, and about some of the performance headwinds we encountered along the way.

Throughout 2020, the team prepared for the cutover to Spinnaker. We stood up a production Spinnaker instance on AWS infrastructure, and developed tooling to (where possible!) seamlessly migrate nearly ten thousand application pipelines from our in-house tool to Spinnaker. We also augmented existing publicly available documentation with in-house documentation. Towards the end of December 2020, we migrated over a subset of applications, to “dogfood” the solution.

By the beginning of 2021, the team was ready to make Spinnaker generally available. The next step: start migrating pipelines over en masse. Our goal was to complete the migrations by the last week of Q1 2021 and decommission the in-house tool in Q2 2021. We commenced migrations on 11 January, 2021.

Photo by Max Brinton on Unsplash

Storms brew on the horizon

On 8 January 2021, Delivery Platform team members attempting to make changes to the migration automation noticed difficulties performing test migrations. The “save pipeline” phase of each migration was taking much longer than expected, before throwing a 400 Client Error back to the client. The first port of call to diagnose this behaviour was, of course, examining the Spinnaker logs.

We noticed two frequent problems in the logs. The first problem we observed was around Fiat (Spinnaker's authorization service) taking too long to perform one particular step during pipeline save operations:

  1. Client initiates a request to save a pipeline definition, including the roles which should be associated with the pipeline trigger, which is received and accepted by Gate (Spinnaker's API frontend).
  2. Gate initiates a request to Orca (Spinnaker's request orchestration service) to fulfil the request.
  3. Orca initiates a request to Front50 (Spinnaker's metadata repository service) to UPSERT an internal Spinnaker service account representing the pipeline trigger roles included in the original request.
  4. Front50 initiates a request to Fiat to "sync roles". That is, requesting that Fiat update its Redis cache (an Elasticache instance) with all known users, roles and Spinnaker service accounts.
  5. Fiat processes the role sync, but...
  6. Before Fiat finishes the role sync, Front50 times out after 20 seconds and propagates the timeout error back up the stack. It is eventually bubbled up to the client as a 400 Client Error.
  7. Meanwhile, Fiat finishes the role sync successfully a few moments after the timeout.

The second frequent problem was a duplicate key error encountered during the role sync (step 5. above). This error manifested in the logs as follows. The particular user that the error referenced was always the same: jdoe (usernames have been changed to protect the innocent!).

java.lang.IllegalStateException: Duplicate key jdoe (attempted merging values UserPermission(id=jdoe, accounts=[Account(resourceType=account, name=someaccount, cloudProvider=null, permissions=Permissions(permissions={}))], applications=[], serviceAccounts=[], roles=[Role(resourceType=role, name=somerole, source=null)], buildServices=[], extensionResources=[], admin=false) and UserPermission(id=jdoe, accounts=[Account(resourceType=account, name=someaccount, cloudProvider=null, permissions=Permissions(permissions={}))], applications=[], serviceAccounts=[], roles=[Role(resourceType=role, name=somerole, source=null)], buildServices=[], extensionResources=[], admin=false))

Making the case for lowercasing

The team temporarily mitigated the first issue by increasing timeouts throughout Spinnaker from 20 seconds to 30 seconds. This gave us a small amount of breathing room in which to investigate the second issue.

At Expedia Group (EG), our production Spinnaker instance integrates with a third-party authentication tool for single sign-on (SSO). This means that Spinnaker obtains group (“role”) information via SAML assertion. Consequently, Spinnaker never has a global or complete view of all the users and roles present in the organisation’s Active Directory (AD) domain. Rather, as each user logs in and provides Spinnaker with their particular list of assigned groups, Spinnaker’s Fiat microservice builds up a gradually more complete "picture" of the users and groups available. The role sync process mentioned above ensures that Fiat's backing store (Redis) is kept up-to-date as new information comes in.

When we looked up the particular user (jdoe) that was triggering our duplicate key error, we noticed that their canonical username in Active Directory was actually — unusually — JDoe (title-cased). In contrast, the vast majority of other EG users have lowercase usernames. We discovered that during the role sync process, Fiat deduplicates and then lowercases a list of usernames. This resulted in a "deduplicated" List that contained both jdoe and JDoe which, following normalisation, turned into a List containing two jdoe entries. This List was subsequently used as the keys for a Map, resulting in the key duplication error above.

We raised a pull request upstream to address this bug, whilst reconfiguring our SSO system to convert usernames to lowercase prior to building each SAML assertion. Subsequently, role syncs were able to run to completion without encountering duplicate key errors.

Taking on water: user-impacting performance problems loom

Meanwhile, although we had increased timeouts throughout the Spinnaker suite, we soon encountered more timeout errors. Every application migration we attempted required at least one or two retries before it succeeded. We watched as role sync times easily eclipsed 30 seconds and headed for 60 seconds on average. We re-applied our band-aid and increased timeouts again, and then again, up to 300 seconds.

Having noticed various recent fixes around role sync in Fiat, we elected to also bump up Spinnaker from 1.22 to 1.23. Unfortunately, this did not result in any appreciable improvement in the situation.

Then, towards the end of January, users began to experience random and intermittent permissions errors, coupled with PT1S errors in the Fiat logs: a sign that not only was role sync performing badly, but real-time authorisation was also impacted, causing spurious failures in any operation requiring authorisation. Such failures during user-initiated operations, including automatic pipeline triggers, manifested in the UI as Access denied to application foo - required authorization: READ. This proved to be very confusing for users who knew they ought to have permission.

On 9 February, 2021, we paused all pipeline migrations to Spinnaker to allow us time to remediate these issues.

Photo by Artem Verbo on Unsplash

Tacking into the wind with Spinnaker performance profiling

We took copies of Fiat's Redis cache, and also attached JVM profilers to the running Fiat instances in production, attempting to understand what was happening under the hood.

We discovered that EG’s AD domain contains tens of thousands of security groups and distribution lists in total. Some users belong to hundreds of groups. Through SAML assertions, Spinnaker had learned of around ten thousand of them — so far.

We also learned that Fiat role sync times increase with the number of unique username-role-service account combinations, and that every pipeline causes a new, unique service account to be created. In our case, every user is a member of a dedicated SpinnakerUsers group, and almost every pipeline uses the role SpinnakerUsers in its pipeline trigger(s). This causes an increasingly large amount of data to be read from and written back to Redis during role sync.

So far, Spinnaker had seen less than 15% of the total number of groups in the AD domain. We predicted that performance was not going to plateau at an acceptable level without intervention. So, the team went back and forth around options for reducing the amount of data in Redis, and the time taken by Fiat to process that data.

We discovered that we were not the only organisation encountering problems, and that one organisation had contributed a PR upstream to allow us to configure more precisely when a role sync happened. Running this patch allowed us to successfully perform application migrations without incurring a full sync. It did not, however, improve the fundamental data-scale issue causing our problems.

Rebuilding the boat while it’s sailing

We spent some time refactoring the Fiat Redis implementation with whatever performance improvements we could glean. We raised a total of three PRs (so far) with performance improvements (1, 2, 3), but none of them improved performance to the point where we were "out of the woods".

In order to reduce the amount of data Fiat has to deal with during role syncs and real-time authorisation calls, we developed a new Spinnaker feature. Ordinarily, when managed service accounts have been enabled, Spinnaker creates a unique service account for every single pipeline. Instead, we modified the system to create one service account per unique combination of roles in a pipeline trigger. We called the feature "shared managed service accounts", and rolled it out in mid-February 2021. It cut the number of managed service accounts from 4,000 down to 17. Subsequent to the rollout of this feature, real-time authorisation call performance improved dramatically. We went from authorisation calls in excess of one second to a much healthier 20-30ms. This resolved issues users were having with random/intermittent permissions errors, and has been contributed upstream (1, 2).

We also split Fiat's Redis cache out of the Spinnaker shared Elasticache cluster into its own dedicated cluster. This instantly and impressively improved role sync times (they dropped from 5+ minutes down to less than one second). However, role sync times very quickly (and linearly) increased, as users logged in again and the new Redis cluster once more "learned" about their groups. Nevertheless, this did give us a "break glass in emergency" option. If necessary, we could simply flush or re-create the Redis cluster to bring things back under control.

Subsequent to these changes, we saw improved performance during migrations and during user interactions with the UI, and we recommenced migrations on 22 February, 2021.

Land ahoy!

Finally, and most significantly, we decided to move Fiat from Redis to an SQL backing store. A number of other Spinnaker microservices (including Front50 and Clouddriver) have SQL backends available for use in environments with high availability and/or high performance requirements, which we had already adopted. The only problem with this option was that an SQL backend for Fiat hadn't been written yet!

So we wrote a fiat-sql backend and contributed it upstream. However, after a few days on the new backend, we encountered more problems, this time around the performance of inserts into the database. The team rolled back to the Redis backend and investigated. This initial version of the SQL backend effectively used the same document structure as the original Redis backend. Each user record contained large amounts of duplicated role and security account data, resulting in 7.5 million very big records.

To combat this, we employed the classic tool of any relational database developer: normalisation. After normalising the schema, we still have 7.5 million records, but they are 7.5 million very small records. This change (1) has finally closed this chapter on Spinnaker authorisation performance.

Photo by Vita Marija Murenaite on Unsplash

Clear skies ahead. Where to from here?

Having implemented the solutions and fixes outlined above, we have successfully migrated all application pipelines from our in-house tool to Spinnaker and decommissioned the previous tool, all within our original timeframe.

Successfully administering a new system in production is almost always challenging. It takes time for a team to develop those crucial operational “muscles” required to notice, diagnose and resolve problems. It’s especially difficult when faced a complex, unfamiliar system, and the requirement to migrate users & applications en masse from an existing system in a relatively short time frame.

We in the Delivery Platform team, many with backgrounds in infrastructure and operations, are no strangers to pernicious and impactful performance issues. Indeed, one of the driving factors behind the decommission of our in-house pipeline tooling was poor performance and a complex, sometimes nondeterministic system, which was immensely difficult to reason about.

In selecting a continuous delivery tool, we knew that Spinnaker was also complex. We also knew that, as we moved into the general availability phase, we would inevitably encounter operational “bumps in the road” (although we did not expect them to be quite so dramatic!). However, one of our greatest advantages was having access to an active, interested community of developers and fellow enterprise users. We engaged with that community in multiple fora (including the Spinnaker Slack workspace and the Spinnaker GitHub community) as we worked through the issues outlined here (and others!), and took opportunities to both benefit from others’ contributions and to contribute back to the community ourselves.

Before, we were adrift; as we surge forward with Spinnaker, we find ourselves positioned to take advantage of a global community of like-minded users and enterprises, enabling us to build a world-class travel platform atop a world-class continuous delivery platform.

--

--