Disaster Recovery Using Hybrid Cloud

By Paul Gibson

Financial Engines recently celebrated the 20th anniversary since the company was founded. Those two decades reflect our growth en route to becoming the largest registered investment advisor in the US.
During those same two decades the technology industry has changed profoundly and we have adjusted along the way. One change we completed earlier in 2016 was moving our disaster recovery footprint to a hybrid cloud solution using AWS. This document describes that effort in more detail and the results we achieved.

Our offerings have been web-based since inception. For hosting these web experiences we utilized top tier colocation providers. That relieved us from building and operating physical datacenters.
Today enterprise capable IaaS is avaiable from providers such as AWS. We are now on a journey to move “up the stack” and adopt IaaS and reduce the burden we bear for things like:

  • hardware (servers, network gear, storage) procurement
  • physical site design and engineering (space, power, cooling, rack design, cable management, etc)
  • hardware maintenance: replacing failed drives, DIMMs, CPUs, NICs, motherboards, switch blades
  • firmware maintenance: qualifying and applying updates/patches across all hardware devices
  • hypervisor work: licensing, installation, tuning, maintenance, patching, upgrades
  • physical storage: design, engineering, and maintenance for iSCSI boot and data disks and NFS/CIFS network attached storage

Moreover, when we are ready to decomission infrastructure we just call an API to terminate/free those resources. This eliminates physical maintenance at the end of hardware lifecycles.

Our transition to IaaS marks early 2016 as the point of “Peak Colo” for Financial Engines.

Over the coming quarters we expect to:

  • require fewer racks in colocation facilities
  • buy fewer servers from Cisco, Dell, or IBM
  • consume fewer VMware licenses
  • spend more on AWS for IaaS resources
  • achieve a net savings in our infrastructure total cost of ownership (see chart below for details)

Our rebuild of the DR environment had a fixed timeline due to a colocation contract ending. We therefore focused our effort on a lift-and-shift approach and moved the Linux compute portion of our stack into a VPC. We connected that VPC using Direct Connect to a reduced colo footprint resulting in a seamless LAN spanning our colo space and AWS:

This hybrid posture converts roughly 80% of our servers from on-prem hosted to cloud-hosted. In doing so we trade capital for expense and ownership for rental.

For disaster recovery this trade is attractive since these resources are rarely needed (our DR utilization is < 10% for testing, drills, etc).

This lift-and-shift hybrid project has a residual footprint in our colo consisting of:

  • backend NetApp storage
  • large database hosts which are more diffcult to run on EC2 (due to size, iops, and cpu requirements)
  • batch machines which currently run on Windows Server

Future revisions of our hybrid posture should enable more of this infrastructure to run on AWS.

Our previous generation disaster recovery consisted of a colocation-hosted footprint containing:

  • 6 racks
  • vmware compute on IBM blades
  • NetApp storage
  • RHEL subscription fees
  • Load balancers as hardware appliances

The new disaster recovery footprint built on a hybrid cloud consists of:

  • 1 rack of UCS and NetApp (tech refreshed to yield better density and performance)
  • EC2 compute (our upgrade to the latest Xeon E5 v3 hardware was just selecting from the M4/C4 instance families)
  • Ubuntu 14.04 LTS
  • Load balancing on ELBs

In terms of costs here is what the transformation looks like:

Our DR site on AWS uses the pilot light model which incurs a modest monthly expense.

In exchange for that pilot light expense we achieved large reductions in depreciation, engineering time, and colo expense.

Following this disaster recovery rebuild we are moving to other re-hosting projects such as:

  • dev and test environments
  • production, starting with cpu-intensive tiers of our footprint

We expect these new environments to utilize the same hybrid cloud architecture with similar results.

In addition to our lift-and-shift projects we are also moving to cloud native substrates for net new functionality.

These projects are using high-level primitives in AWS such as:

  • Lambda
  • API Gateway
  • DynamoDB
  • S3
  • Kinesis

Look for future blog posts covering that work.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store