Interview Questions in Real Work

by Kent Ross, Software Engineer at Cisco ThousandEyes

I’m sure most of us have had the thought, at some point during our Software Engineering careers, as we go through many interviews:

“It’s not like I’d ever solve a problem like this during my real work!”

I certainly have had that thought. And most of my days spent configuring our Kubernetes deployments or updating our database models have very little resemblance to the interviews asking me to create an LRUCache from scratch; any good Engineer would simply use the Guava library for it — or failing that, they’d simply Google the algorithm and call it a day.

However, I have found that some of the more interesting problems we’ve worked on for the Alerts and Integrations team have, in fact, been what I would call “interview-esque.” Today, I’d like to walk through a couple of these, as well as discuss the creative solutions we found for them.

Dynamic Baselines Using Weighted Linked Lists

How to store a sliding window of data with configurable compression

Many of our customers’ Alert Rules are set up with numeric thresholds against specific network metrics. For example: “Alert me if the response time of a ping to google.com from my enterprise servers is greater than 100 milliseconds.”

However, the usefulness of such a rule is limited if you want to include servers from all over the world, and the target has DNS routing that is not equally globally distributed. A 100ms ping time from a server in San Francisco to a target in San Jose likely represents a problem; the same cannot be said for a server in Sydney.

Enter Dynamic Baselines. Rather than alerting on a fixed number, the customer can now set up a rule like “Alert me if the response time of a ping to google.com is 2 standard deviations above the average over the last 4 hours for this enterprise server.”

This is much more useful for customers that have a geographically diverse presence. However, the challenge is that the alerting infrastructure now needs to track and calculate the average and standard deviation of all of the network data we receive. In addition, a product requirement was to have the ability to calculate this for several different time windows, like the last 15 minutes, 1 hour, 4 hours, 24 hours, and so on.

A simplistic approach would be to simply maintain some sort of store of every datapoint we receive over 24 hours. For the purposes of latency, the ideal way to do this would be to simply hold this data in Heap space; however, this was very quickly determined to be impractical at scale. The Alerting system at present receives roughly 20,000 data points per second. Even assuming each data point could be compressed to only occupy 32 bytes of memory, that would mean we’d need roughly 55GB of Heap space to store 24 hours worth of data at once. While not impossible, this would require several times the resources that the alerters currently consume to be dedicated to just this one system.

The next approach we explored was to possibly store the data on a storage solution that could handle this volume, and potentially also do some of the aggregations for us. We researched several third-party tools, but unfortunately these solutions ended up not being a good match for the specific problem we wanted to solve.

So what we need is an algorithmic solution that allows us to drastically reduce the storage requirement for our data, while still producing reasonably accurate averages and standard deviations. I did a fair bit of research and ran across several articles that discussed maintaining average and standard deviation on an accumulating dataset; however, these solution didn’t address the primary difficulty: that if we pre-aggregate the data, then we also lose the ability to evict the oldest data out of any given fixed window of time, like the “last 1 hour.”

The solution we eventually came up with is a hybrid approach, where we store fixed chunks of data that are pre-aggregated for blocks of time. Let’s illustrate to see how this works.

First, for a small number of datapoints, we simply store each data point. If we want to query for the last 5 minutes, we simply iterate through this Linked List and apply normal formulas for average and standard deviation.

If we want to query for the last 10 minutes, we look at an additional bucket of data. We pre-aggregate the 5 data points from T=11 to T=15, and stored the data as 3 numbers per metric: the sum, the power sum, and the count. Using this, we can calculate the average and standard deviation of the past 10 minutes.

When we want to look even further back, say 20 minutes, then we just step back one additional bucket where we pre-aggregated 10 minutes worth of data, and apply the same formula.

When we receive another new datapoint for T=21, we kick off a compression that aggregates the data from T=16 to T=20 into a new pre-aggregated bucket. When the tail bucket is over 24 hours old, we simply remove that bucket from the list.

As we go back further in time, the size of the pre-aggregated buckets grows. With our current implementation, this allows us to store 24 hours of data in just 20 or so buckets, rather than the 1440 we would need to store each minute individually. A fully populated 24 hours of data looks like this, with T in minutes:

When the next Round comes in at T = 1441, it then shifts the entire linked list:

And the oldest bucket for T = 1 to 360 falls off and is de-scoped for eventual garbage collection.

The drawback is that, in some cases, the accuracy of the time window becomes lower; for instance, after the compression shown in the last diagram, when we query for the last 24 hours, we end up returning the average over the last 18 hours and 1 minute instead, because we had de-scoped the oldest 6 hours of data. This seemed like a reasonable trade-off as these sorts of alerts typically are not interested in minute-by-minute precision but ongoing issues or trends over time. In addition, the exact composition of the number of buckets and their size is easily configured to meet the precision sensitivity of the customers using this feature.

Sticky Agents Using Combinatorics and Bit Arithmetic

How to find the intersection of multiple large sets with low computational complexity

A key feature in our Alert Rules is the ability to specify that you only want to be alerted if servers in multiple locations are violating a set of conditions across multiple runs of the same test. Internally, the language we use to describe a machine in a distinct location is an Agent, and the results of a test run during a fixed time period is called a Round.

So for instance, a customer might want to set up a rule that says “Alert me if the response time to google.com is > 100ms from 3 of my agents, 3 out of 3 rounds.” This allows customers to narrow their alerts down to situations that are problematic if persistent and/or widespread, but not a problem if the situation is transient.

A feature we added in the last year was to make this even more sensitive to persistent problems by allowing customers to set up the Alert Rule to say “Alert me if the same 3 agents…” rather than this being any 3 servers within the set of servers — internally called Sticky Agents. This allows customers to set up a single rule that covers hundreds of servers without needing to worry about the alerts being too noisy to be useful.

Initially, we thought this would be a pretty straightforward implementation. If a rule said the same N agents, X out of Y rounds, then we would iterate through our Y rounds stored in memory, and for each ‘base’ round we would intersect the triggering agents with each other round. If a given intersection still contained more than N triggering agents, then we would increment a counter. If the counter reached X, then we would know we had fulfilled the criteria.

In this illustration, we show that the intersection of round 1 with round 2 has 3 agents, and round 1 with round 3 has 4 agents, so along with the original round 1, we fulfill our requirement of the same 3 agents, 3 out of 3 rounds (in this case agents 1, 2, and 4) so we can trigger the alert rule described in our original example.

However, this approach had a subtle bug that we didn’t notice at first. Can you spot it?

The issue comes when we have a dataset like this:

Our counter still says we have 3 separate intersections with Round 1 that meet our minimum, so we would trigger this alert. However, the intersections of Round 1:2 and Rounds 1:3 do not match each other. In this case, only Agents and 1 and 4 are triggering on all 3 rounds. So we must go back to the drawing board for an algorithm that can account for this possibility.

The first thought was that since trying to iterate through rounds didn’t work out, could we drive the check by generating all possible combinations of agents, and then checking each combination if they all trigger in enough rounds? However, as mentioned earlier, a given Alert Rule could have hundreds of potential agents involved. So the computational cost of checking every possible combination of agents has a cost of O(N!), which is clearly out of the bounds of acceptable.

However, the number of rounds that an Alert Rule can span is limited to 10 by design. So as long as we can make the computational cost of comparing rounds to each other cheap, then we could simply generate every combination of rounds that could potentially be sufficient to alert, and then see if any of the combinations had sufficient overlap in triggering agents. Generating the combinations of rounds was easily solved by using the Apache commons-math library CombinatoricsUtils. However, at our maximum potential combinations for something like “5 out of 10 rounds,” this still requires 252 different combinations to be examined, and doing something like a Set intersection of 5 separate sets is still somewhat costly, as it becomes a large constant multiple of O(N). Can we possibly do better?

In fact, we can! The final optimization we made was to store our set of triggering agents as a BitSet rather than as a Set object. By doing so, finding the intersection becomes a boolean AND operator, which runs in O(1)* time rather than O(N). This allowed us to cut down the computation time for even extremely large agent sets to single digit milliseconds.

* Technically this can be more than O(1), as a BitSet is internally represented as an array of 64 bit longs called words. Each word will have a boolean operation time of O(1), but if the BitSet contains multiple words, it will need O(words) time to compute the AND or Cardinality of the BitSet. In practice, most Alert Rules have less than 64 agents involved so the amortized cost should be O(1).

What are some problems you might have encountered like this? Tell us about them by applying on our Careers page!

--

--