Where We Went Wrong Implementing AWS

The Cloud, the growth, the growing pains

Andrew Hatch
SEEK blog
6 min readSep 15, 2016

--

This is part 3 of a series about the evolution of DevOps @ SEEK. In this post I’ll be talking about how we started using the cloud, how we screwed it up, and what we learned. Read the previous post here.

Back in 2012/13 there was one task the DevOps Team loathed more than most — rebuilding a Development and Testing environment for a Product Delivery Stream. Back then we maintained 22 separate dev/test environments for the streams, each comprising of about 10 servers featuring a mix of all our sites and services. Many were configured to integrate with databases and a middleware broker for a CRM system too. These environments were old, had no automated baseline to roll-back or forward to, and were subjected to numerous deployments, hacks and modifications — most of which weren’t documented. In summary they were very unstable, broke down regularly and over time a couple became completely unusable.

These dev/test environments also lived in our office network, but with no network segregation, meaning all the servers within each environment needed their own IP addresses and DNS names. We used colours in the names to differentiate them, in order to ensure streams would only connect to the ones they needed to e.g. the green environment, and not ones they didn’t. Configuration Management of this setup required controlling over 500 individual configuration values per environment, which amounted to 11,000 across all environments (500*22). Managing this configuration data was a nightmare, so we built tools to help automate it. They helped, but misconfigured environment settings would continually plague us, slowing down the Product Delivery Streams, requiring developers and testers to re-test constantly when things stopped connecting, wasting time working out why and then hassling Ops people when they hit a wall.

Anyway you get the picture, we had to do something about this.

Enter AWS

When AWS arrived in Sydney in 2013 VPC’s came with it. VPC’s enable you to build wholly enclosed networks — effectively your own Data Centre in the Cloud complete with subnets, routing rules, ACL’s, Security Groups and CIDR ranges of your choosing. And of course you could create, update and delete them on the fly using CloudFormation stacks. We envisioned that creating re-buildable VPC environments, cloned from a single “master” environment, that would only need maintenance of a single set of configuration data, could be the solution we needed. An added bonus was we could use our actual production environment IP’s and hostnames within these VPC’s too, getting the testing closer to Production

Sounded like a great idea, we wrote a business case, got it approved and went to work.

Sparing the gory details, success on this adventure was mixed as the technical challenges we faced were massive, requiring a lot more specialist coding and scripting than we anticipated to make it all work. Afterall we were trying to lift and shift legacy systems into the cloud! After 12 months we got it working (well, mostly), but upon reflection we realised we’d just built another environment monster that needed constant feeding and attention. Not quite as bad as what we were dealing with in the past, the overheads were a lot lower, but still, not as automated or light touch as we’d hoped.

Where we went wrong

We initially thought it would be a simple matter of replicating the network in a VPC, rebuilding servers, deploying to them and then things would mostly just work. We wouldn’t need to rewrite code either, just ring fence the systems and copy data around.

But when we started peeling back the layers of nearly 10 years of legacy scar tissue, greater complexities and issues began to emerge. It was not as simple as we first thought and we increasingly needed more help from the Product Delivery Streams than anticipated to work out why the hell certain things worked the way they did. Yet at the same time the Product Delivery Streams were incredibly busy too, they couldn’t just spare the time when we desired it; they actually needed more of our help getting things deployed and configured into Staging and Production.

So as the complexities and issues kept mounting we became more insulated, trying to solve the issues ourselves, thinking less and less about how the delivery streams, our customers, were going to use it. Not very Agile, not very MVP and of course, the outcome became very predictable. The first and subsequent releases of the solution suffered numerous issues and immediately put the product Delivery Streams offside as they grappled with testing and deployments, access over VPN’s and scale-up of computer resources to address latency and performance issues within the VPC’s.

The root cause of the root cause was….

We tried to solve a problem with technology without focusing on the people who would be using it and their processes first. We hadn’t done enough work upfront getting people involved or on-board and we did not analyse their skills and capabilities with using AWS. Simply put we shifted the nucleus of our existing problems to the cloud and learnt a very valuable lesson they probably teach the kids now in Cloud school -

Don’t lift and shift your data centre!

So when the cloud gives you lemons…

It would have been easy to run screaming from the cloud, stay mired in a data centre, lick our wounds etc… but that’s not very “SEEK”. We knew we were on the right path so we pushed through, made some tough decisions and kept the focus on taking the positives from this experience and using them to set us up for future success. Here are some of the key things we gained from doing this:

  • We gained broad and deep technical knowledge of AWS so we no longer produced blocking, gated, generic, “won’t someone think of the developer children” solutions. We encourage innovation, ownership and cross-team collaboration as we have enough skills in-house to meet the technical challenges.
  • We refined and honed each new AWS solution we built based on past learnings making best-use of cloud services available. This means we’re regularly hassling AWS to produce new solutions, make their offerings better and give us better insight into their roadmaps
  • We are really, really good at managing our cloud usage. We automate turning things off when we don’t need them, we enforce policies on tagging and cleaning up waste. We can crunch massive bills in very short amounts of time and we make use of RI’s and Spot instances where it makes sense to do so.
  • We get Architecture, Development, Testing and Operations to collaborate on producing solutions from inception. A process that has silenced many ghosts of data centre silos past.
  • Continuous Delivery Pipelines and “you build it you support it” mantras are encouraged for all new development. Learnings and solutions are shared and evolved with each new project to continually make it better.

And then there was all this growth

Building our AWS VPC environments solution lasted from late 2013 until early 2015, a timeframe that coincided with one of the biggest growth periods in SEEK’s history. We delivered a huge amount of product, hired lots of developers, testing and Ops people, did some org changes, brought in new build tools, retired old legacy systems, created new systems and by early 2015 started exclusively focusing on building and delivering Production solutions in AWS.

The impact on our DevOps team during this growth period was massive. Given we were still the bottleneck for all deployments, and were also on-call for the entire operational performance of the web site, it put tremendous strain on all of us. Even senior managers were picking up pagers and monitoring Slack channels for alerts late into the night, just to help those burning the candle at both ends. Another effect was the strain it put on our support systems. Our build tool started crashing almost weekly, the monitoring system regularly blew up and we were hacking and patching when we should have been solving problems at the source.

The more you keep adding to the monolith the higher your support costs

Quite an environment of organisational change in IT to say the least, but the pay-offs of our perseverance were massive.

So how did we get through this?

Tune in for my next post where we will look at how we brought more automation and stability into our support efforts. How we learnt the hard-way about cloud bill shock. How we got used to saying “no”. And how we went about delivering our first AWS Production system, bringing DevOps practices into the process, and turning one of the worst systems for operational stability into one of the best.

--

--

Andrew Hatch
SEEK blog

Father, Santa Cruz Surfer, fiddler of old Datsuns. Engineering resilience as best I can