Dev Journal, Ed. 4

Rob Zienert
6 min readMar 12, 2018

--

Current status on 2018Q1: Acceptance. Taking the week off to unpack and decorate my new place! I’ve got a nice story for you today if you like them post mortems.

Redis & Dynomite Migration

We didn’t go to production with Dynomite last week like I was hoping to. A few things happened in our staging environment that I need to work through before we can even entertain flipping the switch again. Fortunate timing!

  1. Despite my best efforts to avoid non-deterministic commands, I forgot to switch an INCR to an app-generated UUID. This caused some quorum write failures. The impact would be that two different clouddriver servers could (and did) receive an operation request at the same time and Dynomite would generate the same ID for both of them. I’m still seeing some other non-quorum write errors after fixing this — I suspect I’m performing writes with SCAN results.
  2. I deployed a Dynomite-friendly ClusteredSortAgentScheduler, which is more efficient and correct than the ClusteredAgentScheduler. AgentSchedulers are responsible for scheduling what caches (ex: AWS security groups in the test account in us-east-1) are re-indexed and by what clouddriver server. Unfortunately, the switch wasn’t as seamless as I’d imagined and seem to have corrupted the cache. Since we red/black, I hadn’t considered that I’d be running two different schedulers at the same time with different algorithms. [insert half-laugh, half-crying here]
  3. That cache corruption didn’t fix itself on its own. VERY SUSPICIOUS given I wrote in some “self-healing” magic into the cache. What I’d like to do, for this and other reasons, is have Clouddriver support zero downtime re-indexes of a cache or its subset via an admin API.

I haven’t really had the time to roll out any of this because of a sweet Orca performance problem we had to work through.

Orca performance regression fun-times

Last year Rob Fletcher and I rewrote Orca’s work scheduling codebase from Spring Batch to our own purpose-built work queue library now lifted into its own project called Keiko. While Keiko supports plugging in different queue implementations, we have written one atop Redis.

While developing Keiko we cut a few corners so we could deliver; as one does. One of the corners we cut was in our message locking algorithm which had a serious performance regression that laid dormant for nearly a year. We knew this would be a problem eventually, but our benchmarks showed it wasn’t a blocker for getting things initially released. When cutting corners, do remember to go back and un-cut before it’s a problem!

The performance regression manifested itself as the work queue’s ready messages growing (“ready” being a delivery time of now or in the past), yet the worker thread pool not filling up. For end-users this gets exhibited as Spinnaker’s pipeline and task executions being slow. One user reported, “My deploy pipeline usually takes about 2 minutes, but today it’s taking 6 minutes.” Yiiiiiiikes.

Every time the QueueProcessor performs a poll, we peek the oldest ready message through a ZRANGEBYSCORE and we do this every 10ms on every Orca instance. After peeking the message, each Orca instance will try to acquire a lock via SET NX/EX, then finally read the message off the queue for processing. We did all of this inside via individual Jedis commands.

As we’ve grown, we’ve gone from 3 Orca instances, to 4, then 6, 8, 10… The growth really started to take off a few weeks ago and we still weren’t keeping up with the backlog of ready messages. We even tried scaling to 15 Orca instances for about 10 minutes for “science”, despite realizing this probably wouldn’t help. It didn’t.

Then, because I’m brilliant [sarcasm], I decided to start peeking multiple ready messages and looping over them until a lock was acquired. This didn’t help, in fact it made things way worse. OK, what. I’ve just made it possible to avoid lock contention, right?! We can safely write off Redis being slow because Orca is not a hungry hungry hippo. What then?

Looking through the logs after my change, we saw a huge uptake in an error message along the lines of “could not read message body for: {UUID}”. That’s bad. The message itself isn’t necessarily bad but an increase in quantity is. Somehow, the queue was getting a lock on a message ID that didn’t have a body. In practice, this isn’t a data loss problem, or a sign that Spinnaker broke a pipeline or anything: Message bodies are cleaned up once the message is handled. What was actually happening is that an Orca instance was able to peek the message, acquire a lock, read the message, fully handle it and clean up before the other Orca instance even had a chance to move from peeking the message ID to acquiring a lock. The cleanup phase of a message explicitly deletes the lock key for use cases like re-delivery. So, a message ID was peeked, then it tries to acquire a lock on a message that no longer exists, which, yes… you can set that lock, no problem.

As part of another cut corner, we didn’t try to poll the queue more than once per cycle. So if the processor fails for whatever reason on a poll cycle, we’ve totally lost that cycle. So effectively by adding more Orca instances and adding multiple-read cycles, we were making the problem worse.

The ultimate fix was that I rewrote the read message / polling logic of Keiko in Lua, so that the peek, lock and read operations are performed as an atomic operation. Once that was deployed, the ready messages dropped to 0.

While I was at it, I also changed the QueueProcessor behavior to fill up the worker thread pool every poll cycle, rather than processing a single message every interval. (This is default turned off in Orca, but you can enable it via keiko.queue.fillExecutorEachCycle=true). This change really put some zoomies on our queue throughput and was simple to do. There’s more efficient ways to do it but this was low impact.

Of course fixing a performance bottleneck one place just moves it somewhere else. Improving Orca ended up causing downstream performance problems for Clouddriver because now we’re sending a lot more work to it at any given moment. As my old boss used to say, “That’s Future Rob’s problem. You hate Future Rob.” At least we have a system we can dial back at this point if Clouddriver truly can’t handle the load, which gives us a little breathing room to make the changes we need to in Clouddriver without an immediate panic.

As part of all this work, Fletcher also added a new metric for Orca, which is definitely a key performance indicator for Orca: queue.message.lag. Lag is defined as the amount of time a message is in a ready state, but has not yet been processed. I’m surprised we weren’t already tracking this, but here we are. Here’s the last week, average is blue, red is max.

Since you can’t even see the blue, here’s a zoom of the last two days:

1–2 seconds max isn’t bad. 20.7 seconds is bad. We’ll be changing this metric to a histogram and alert off of p99.

Other Takeaways

We’ve been talking for a long time about building up some performance tests for Spinnaker. We’ve got a loadtest env, which roughly mirrors our production deployment, but no real way to generate representative load. There’s a performance test we can use to blast an environment (and actually spin up & tear down real AWS and Titus infrastructure), but that’s expensive.

We’ll need to come up with a better solution. Mocking AWS & Titus seems like the right thing to do, but they’re hardly simple systems to mock.

I could just keep load testing with my awesome pipeline…

“Load testing” Orca

Etc

I haven’t made any progress on the scaling Spinnaker articles, I know there’s a good chunk of people out there that are waiting. Until then, um, work on getting your Spinnaker services reporting metrics and aggregating logs. You’ll need them for when we get into it.

It’s really surprising to me that most people in the Spinnaker community don’t seem to collect metrics or logs. I’m not yet sure why that is.

--

--