The history of infrastructure at Zendesk (Part 2) — the messy middle

In my last post interviewing Zendesk’s co-founder, Morten Primdahl, we talked about Zendesk’s infrastructure development from 2006 to 2012, when we started growing like crazy and put out the first burning fires. This time, I’ll be looking at 2012 to 2017. This period of Zendesk’s history taught us that infrastructure is not static — you have to constantly assess whether you’re still on the right path, based on the environment around you. Further, you can’t change everything at once!

So, let me take you back to 2012. Ruby was still young as a language. Nobody believed it could scale and Twitter’s fail whale was doing a good job of reminding people about that. In the front-end world, React didn’t exist; the battle was on between Backbone, SproutCore and AngularJS. At Zendesk, we were still a single-product company running entirely in Rackspace.

Zendesk’s physical servers in one of our data centers

Our data center chapter

That year, we made the decision to migrate out of Rackspace into our own data centers. We had concluded this was needed to take control of our own destiny. We built our first data center in Sacramento, and then expanded to Dublin to appeal to our strong EU customer base.

Of course, physical data centers have limitations, which you’re likely to encounter when you’re in the business of rapid product innovation and growth! These unfolded over the next few years.

One example was Zendesk Connect, our proactive customer communication product. We started building Connect in our data centers, and then realized we needed a big new infrastructure investment to handle the different performance characteristics that behavior-based marketing automation requires. We spent millions of dollars on servers, racking and stacking them all around the world. We had to spin up whole new teams to set up and operate the new infrastructure. All this delayed the product schedule by about 9 months. That’s too long! Delays associated with our infrastructure were a huge drag on the Connect team as they tried to innovate.

As we struggled with these challenges, AWS and Google Cloud Platform were battling it out on price and features, providing incredible cloud-based technology that we realized would have helped us get the product to market more quickly. In the end, we acquired Outbound, a company in the customer engagement space that was 100% cloud-native, and rebuilt Connect on top of their technology. This accelerated Connect’s path to market, but it was emotionally brutal for our engineers. The limitations of physical data centers on product innovation felt very real.

The capacity limits of physical data centers also constrained our infrastructure development. When migrating large chunks of compute to a new system, say Kubernetes, you need double the capacity for some time, so you can confidently play traffic into both, then gradually shift over. But our physical data centers just didn’t have that capacity sitting around! We were limited by how quickly we could adopt new technology because we were limited by how fast and how safely we could trial it.

These same capacity limits made it difficult to manage our traffic volumes. We experienced surges in traffic as our customers grew like crazy — looking at you, Uber and Groupon! We also got slammed when our customers had significant events or launches not go to plan, such as Rockstar Games with GTA 5. Then there was all of the DDoS attacks, multiple that saturated the network into our DCs. When you layer all these factors on top of each other, it becomes incredibly difficult to predict capacity. Further, we had to contend with long lead times for provisioning new hardware. We had to wait up to six months for a new server! It was stressful not having additional capacity available to provide a reliable system to our customers.

There was a moment with Morten that I remember very clearly. We were talking about how AWS finally had solid-state drives. Up to that point, the spinning disks used for EBS couldn’t get us anywhere close to the high performance of our NAND FusionIO drives, and that was a deal breaker for us moving to AWS. The initial SSDs were an improvement, but still no match for the sustained load performance we were dependent upon. But it was clear they were catching up and we would probably be able to make it work. It was dawning on us that we could probably run in the cloud.

What really hit us was when AWS launched its Aurora managed database service. We had expended so much time and energy scaling MySQL. We were operating hundreds of extremely powerful “bare metal” MySQL database clusters. Even with our customer data split into many shards, these servers had been vertically scaled up as far as possible, and this was one of the big reasons we hadn’t been able to go to the cloud. We also had numerous engineers dedicated to running that part of our stack. Just running space reclaims across the fleet became a full time activity. Then, AWS launched Aurora, a fully-managed MySQL interface that allowed you to very quickly scale up and scale down the data store layer, combined with large enough instances with fast enough disks to credibly store our data. We could say goodbye to the constant work of free space reclamation and table optimization that our DBA team was forced to do to keep the system up and running. And instead of racking and stacking more FusionIO storage, we could leverage AWS elasticity to expand to terabytes upon terabytes of additional storage in our database clusters.

Morten one day posed a question: “What if a competitor built on top of Aurora today? Would they have any of the burden we are experiencing, managing complicated physical infrastructure?” We said “we’re an enterprise software company focused on allowing businesses to communicate with their customers. We needed to focus our energy on that core competency, rather than being experts in how to scale and run a fleet of MySQL servers. To stay competitive long-term, we must make use of this technology.” It was this competitive pressure and also the risk to our business in the long run that made the decision clear. That was the real moment we knew we were going to the cloud. We needed to make use of the unique technical advances the cloud had surfaced. It felt right.

Going to the cloud

If there is one thing that is constant in computing, it’s rapid innovation and change. With Amazon, Google and Microsoft all going head-to-head in the lucrative Cloud market, the innovation has been phenomenal. With each new technology offering brought to market you have to constantly look at how leveraging these technologies could change your approach to infrastructure. I’m pretty proud that we course-corrected, that we were open enough to adapt to a changing environment, especially as it was going to be more expensive.

People often think the goal of Zendesk’s move to AWS was cost optimization — it was not.

The goal was to enable greater efficiency for our engineers and to take advantage of the significant investments AWS put into automated service provisioning, resilience and performance. We can now build on top of AWS’s innovation! We anticipate increasing our costs initially, but plan to recoup based on the ability of our engineers to innovate faster, plus some benefits from autoscaling. We believe our engineers will spend less time managing the system and more time innovating! Ideally, we don’t want our infrastructure team to have to scale at the same rate as our product. We want our infrastructure to be more generic and more efficient so our engineers can maintain existing features and also create new ones.

Migration lessons (some learned the hard way)

While one mustn’t dwell on the past, lessons learned have great value and can help others. So, what would we have done differently? If we were migrating again, we would proceed more aggressively and start being able to leverage cloud-native functionality sooner.

To understand our migration strategy, you need to know a little about our infrastructure. The unit of deployment in Zendesk’s multi-regional infrastructure is a POD — a “Point of Delivery” — which contains a complete Zendesk stack. We had 5 PODs located around the world in co-located facilities in the US and EU. An account lives in a single POD and can be moved to any other POD without downtime or maintenance windows. We decided that we’d build a full POD in AWS, then migrate customers out of an existing POD and into the AWS one. We picked the Tokyo region as it had all the services we required, plus we had customer demand to improve performance and have primary data activities in region. It worked! Then we spun up a couple more AWS-based PODs, and started gradually migrating more customers.

This was a perfectly valid approach — it made sense to get our feet wet and gain experience before committing our company’s mission-critical infrastructure entirely to the cloud. However, it meant that we had to build to the lowest common denominator of what was available in the data centers and in AWS. We had to wait until we were fully out of the data centers to start changing the way we architected our applications. We didn’t want to run two versions of every application, but the data locality commitments we make to our customers meant we couldn’t launch functionality in only one AWS location. We had to be able to continue to innovate from a product standpoint everywhere!

Our Kubernetes migration is a great example. We built Kubernetes in our data centers and we also built it in AWS. We built it the same way in both locations, which meant neither implementation was really great. Then, when teams wanted to migrate to Kubernetes we didn’t have enough servers in the data centers to be able to sustain double the load. They wanted to switch but didn’t want to run on Kubernetes in AWS and on VMs or bare-metal in our data centers. That would complicate their architecture and increase risk. So, we were partially migrated to AWS, but unable to leverage the reason we wanted to be in AWS (apart from Aurora which was helpful).

It became blatantly clear that we needed to get out of the data centers entirely. With that, we began a much more aggressive push to rapidly migrate all of our PODs to be AWS-based, a project we called “DC Lights Out,” which as of writing this in December 2018, is nearly complete.

Lift and shift

For the most part, we took a “lift and shift” approach, where we tried to build new PODs in AWS in pretty much the same way as we built them in the data centers. We took this approach to reduce the complexity of the AWS migration, which we already knew would be a major effort even without re-architecting to be cloud-native.

Nonetheless, we had to change many technology choices in-flight. We switched from MySQL to Aurora. We switched from F5 BigIP load balancers to Elastic Load Balancer. We realized that Chef was not the best tool to be using in such a dynamic environment as AWS. With some skilled hackery, we were able to get everything to work, but the difficulty in keeping our cloud and co-lo PODs in sync reinforced our conviction that we needed to get out of the data centers and stop straddling two worlds.

Added complexity through acquisitions

While all this was going on, we acquired 3 companies. In early 2014, Zendesk’s first-ever acquisition was Zopim, a live chat company based in Singapore. Like any new parent, we quickly learned some lessons. Firstly, their infrastructure was totally different from ours. We tried to move Zopim closer to Zendesk’s infrastructure, but discovered that for various architectural reasons, we couldn’t just drop them in the PODs. When we realized this, we started building shared systems to bring the products together functionally without immediately merging the infrastructures, with the eventual goal of unifying the infrastructure as well.

We developed rules to guide our organization and defined what would be critical going forward. Simple things like: a Zendesk account will always have a subdomain.zendesk.com/[product]. We said “this is a pattern we’re going to use, there are lots of good reasons for it. If we acquire a business, they will have to fit into that mold.” We changed our mindset from “here is our architecture that we’ve built over the years” to “we actually need a scalable way to integrate lots of different types of infrastructure into a single place.” It put us in a good spot to be able to integrate our subsequent acquisitions: BIME, an analytics company, in 2015; and Outbound, a marketing automation company, in 2017.

Conclusion

The last five years have been a wild ride. We’ve scaled our infrastructure to keep up with the incredible success Zendesk continues to see, while building out a physical footprint around the world, then migrating to the Cloud.

But lifting and shifting to the cloud wasn’t enough. It was a fine first step, but we also needed to start transforming our infrastructure to make the best use of being in the cloud! We needed a team whose core focus would be wrapping AWS services and providing standards to our engineers, so they could utilize all the amazing things available in the cloud. We needed to move from doing things manually, to a world where our infrastructure was living, breathing and morphing on its own. Next time, I’ll talk about how we’re doing that.


Thanks Gary Grossman, Steve Loyd, Morten Primdahl & Ryan Seddon for reviewing and contributing to this.