The 24 Hours of “I’ve killed the company”

What does it feel like to think you’ve just obliterated $14M of VC cash?


Answer: it feels slightly better than thinking that you’ve totally screwed your co-founders and employees via your ineptitude.

At Allegro Systems (2000-2001) we were building a very high-speed and high-scale hardware-based IPsec product — on a rocket-ship timeline. Starting-gun to working-at-scale in nine months. We probably spent about two weeks nailing down the hardware + microcode + software architecture, and then split up for a week or so to get the sub-sections documented and pseudo-coded. After that, we were designing, coding, hiring, simulating, debugging, etc. Not much time to ensure perfection up-front, which is a little dangerous with hardware …

We were doing some tricky packet classification at gigabit speeds, and we decided to use TCAMs to do matching (search) on up to five header fields. TCAMs are awesome — ours could simultaneously compare 128k 128-bit words against the search word and produce an ordered search result in nanoseconds (this was 14 years ago!). TCAMs have mask bits, and work great for exact match and subnet matches — typical networking stuff. It seemed like the right choice to handle big policy statements, and also to index the destination tunnel for each packet.

TCAMs are less perfect at matching ranges. Pretend for a moment that someone gives you a cardboard cut-out of a triangle, and you have to cover it with masking tape. You lay down a series of different-length tape strips adjacent to each other until the entire triangle is covered in tape. Each “strip of tape” from the analogy eats an entry in the TCAM (we had 128k entries). So if your policy rule has something like “UPD source port < 1000", then that rule eats multiple TCAM entries. A simple IP + mask rule would eat only one TCAM entry. We knew this, and we were okay with it. We thought most rules would be IP + mask, with some port matching.

Fast-forward many months ahead. We were killing ourselves, however everything was going great. We were actually ahead of schedule. Hardware and software were executing at speed. The QA lead came to me and told me that a tester-authored policy would not load in the system. I looked at the policy and say “of course”. The policy was rife with ranges. So here’s what I haven’t told you yet about TCAMs: if your policy has two ranges, say one on UDP source port, and the other on UDP destination port, then the number of TCAM entries consumed ends up being X * Y, where X is the number of “masking tape strips” for source port, and Y is the number of “strips” for destination port. And if you also have a range on another field, then X * Y * Z — the TCAM entry consumption grows multiplicatively! This policy had enough ranges on enough fields that they were able to consume the entire TCAM.

The QA lead gave me that look — the look a QA person gives to a bullshit engineer when the engineers says “that’s how it is supposed to work”. Oh crap. I grabbed a co-founder (lead hardware architect) and we went into my office. All of a sudden, we started having serious doubts about our initial design assumptions — the nature of IPsec policies that we’d see in the field. We freaked out. I almost threw up in my trash can.

It was late in the day, but we started to reach out to people who would know more than us about existing use-cases — calls we should have made at the beginning. In 2001, people were not glued to smart phones. There was an actual time delay in people getting back to you. The web wasn’t much help back then, either. It was now night time. What followed was a horrible, anxious night of waiting, a night with zero sleep.

The next day, we began to hear back — our initial assumptions weren’t wrong. Our anxiety level dropped a bit. During the peak of the panic, we actually invented a mitigation scheme. We could limit the growth of the policy in the TCAM, and beyond that limit, the TCAM-based matching would yield a partial result. The packet-processing software could, with reasonable efficiency, finish-up the classification using arithmetic. We went ahead and implemented the scheme, and it would do about 150Mbps, while a TCAM-only policy would do 1Gbps.

Jumping forward in time again, we were acquired by Cisco, and the technology was successfully deployed all over the place. This issue rarely came up. We added a tweak here or there when it did, and we were fine.


Lessons I learned:

  • Try to force your paranoia moments to happen up front. For us, we should have dug deeper on policy requirements at the beginning. However, analysis-paralysis at a start-up is death — you must assume and manage a quantum of risk.
  • You can be carrying the emotional equivalent of a “ton of bricks” without thinking about it. When you stumble a little, you get to feel it full-force. It was “heavy” for us, even when we weren’t aware of the heaviness.
  • Horrible stress will drive you to find a way out so that you don’t die. The mitigation scheme we came up with overnight was pretty cool. I doubt we would have invented it without the gun pointed to our heads (metaphorical, of course!)

Email me when Jeff Enderwick publishes or recommends stories