Saving VA $100 Million Dollars
By: Alan Ning
The acquisition of — and transition to — modern IT infrastructure is an important way the U.S. Digital Service can make a big impact for government agencies. This post is Part Two in a two-part series that explains how the Digital Services team at Veterans Affairs (DSVA) helped the agency become one of the earliest Federal government adopters of modern, cloud-based IT infrastructure. Read Part One, which discusses how DSVA laid the foundation for this migration through stakeholder engagement and creating an Authority To Operate (ATO) and contracts that work for cloud infrastructure.
For the past year, the Veterans Affairs Enterprise Cloud team (VAEC) has been coordinating a massive IT modernization effort to migrate many of VA’s on-premise applications to the cloud. The primary objective of this effort is to improve the scalability and reliability of the VA’s applications and to reduce IT infrastructure cost through the use of cloud technologies. Through Vets.gov and Caseflow, the Digital Service team at VA (DSVA) became one of the earliest adopters of Amazon Web Services (AWS) within the agency. This experience meant that the Digital Service engineers were asked to review the VA’s Initial Cloud Reference Architecture for AWS. In the process of our review, we spotted an opportunity to streamline the architecture in such a way that VA could potentially save an estimated $100 million over the next 10 years.
The Cloud Migration
The Department of Veterans Affairs (VA) has a massive tech footprint, with 632 on-premise applications actively running in production. Many of these systems are interconnected, in various stages of maturity, and reflecting a wide range of technologies — some of which were launched 40 years ago and still going strong. These applications are critical systems that Veterans rely on to receive their benefits and assistance.
Considering the magnitude and interconnectivity of these critical systems, VA needs a cloud architecture that ensures reliable and high bandwidth connectivity to the on-premise network. The cloud architecture also needs the flexibility and scalability to accommodate the variety of application frameworks found across VA’s applications.
Transit VPC and Direct Connect
To meet its scalability requirement, VA is deploying the Transit Virtual Private Cloud (VPC) solution with AWS Direct Connect. This architecture follows the current industry best practice in security, scalability, and availability. The Transit VPC features the hub-and-spoke topology: a large number of VPCs (the spokes) share a connection to VA data centers. AWS Direct Connect, rated to 10Gbit connections, links the VA’s on-premise network with the cloud (AWS Govcloud). This network service addresses VA’s need for reliable, high bandwidth connections at a controlled cost.
Single and Multi-tenant Environments
With the connection infrastructure out of the way, the next key challenge is to choose a VPC Network Architecture. For this, we chose to have a mixture of single-tenant VPCs and multi-tenant VPCs (multiple applications share a VPC).
The multi-tenant VPC allows VA to centralize resource provisioning and network security management at the enterprise level. Of the hundreds of applications noted earlier, the mature applications (those in the sustainment phase) will be migrated into the multi-tenant environment using the lift-and-shift strategy.
Applications whose teams support a devops culture (e.g. Vets.gov, Caseflow) will be migrated into single-tenant VPCs using the cloud native strategy. These tenants will have more control over their environment, which allows them to deploy their own CI/CD pipeline and capitalize on the scalability of the cloud.
Initial Reference Architecture
With these basic requirements defined, the VAEC drafted the Initial Reference Architecture: the blueprint for the entire VA’s AWS environment. With Digital Service’s expertise in the cloud, our engineering team was invited to review this architecture. During the briefing, one thing that caught our attention was that it used GRE tunnels over VPC Peering as a layer-3 overlay for VPC interconnectivity and access to Direct Connect.
Beyond the AWS-provided VPC Peering service and VPN service, these GRE tunnels provide several extra capabilities:
● Scaling beyond the VPC peering limits of 125
● Multi-casting of packets
● Overlapping IP address space among VPCs
When we brainstormed about potential drawbacks, we realized a significant one. To manage the GRE tunnels, each Spoke VPC requires at least two Cisco Cloud Service Routers (CSRs) to maintain the GRE tunnels with high availability. With a large number of VPCs, the potential cost for Cisco CSR licenses and AWS EC2 instances alone would be enormous, with the cost of maintenance and upgrades only adding to the pile. A second major complication arose as well; because all Spoke VPCs are required to peer with Transit VPCs, additional Transit VPCs would need to be deployed to avoid breaching the VPC peering capacity. Managing multiple Transit VPCs would significantly increase the network complexity and increase the maintenance cost.
CSR Resource Cost
To estimate the cost of this architecture in a steady state, we assumed that out of the 600 applications in the cloud, there would be 100 applications in the single-tenant environment. For traffic isolation, we also assumed that there are three environments: development, staging, and production.
Here is a rough estimation of the CSR cost breakdown:
In the end, we concluded that at steady states, the annual CSR cost is at minimum $9,897,120. After factoring in the engineering maintenance cost (e.g software upgrades) and the overhead of license renewal, the total annual cost could easily exceed $10 million per year. Since this architecture may last well over 10 years, in total it is a $100 million architecture.
Taking a step back, the team determined that the cost and complexity of the GRE tunnels may outweigh its benefit in the short term and could lead us to an outdated architecture in the long term. We decided that we could largely substitute AWS’s Managed VPN Service and VPC Peering Service for the GRE tunnels. Both features are extremely low cost and would cover the requirements for a majority of the applications. This substitute frees us from hundreds of the CSRs necessary for the GRE tunnels approach. Additionally, it still allows for GRE tunnels; as the VA cloud architecture matures, we can iteratively build GRE tunnels for any applications that are exceptions to the AWS rules. Offloading VPN endpoints to AWS’s Virtual Private Gateways in this manner significantly reduces network complexity, as well as the number of CSRs in the environment. With this design, we estimated that we would only need 18 CSRs* at steady state. Combining the CSR and AWS VPN cost, this translates to a total cost of $403,128 per year, which is a 95.9% reduction in resource cost from the original architecture.
It is a rare opportunity to be invited to preview the cloud architecture of an enormous organization before it goes into production. It is even rarer to have the opportunity to streamline the architecture to have a huge impact. Digital Service at VA was fortunate to collaborate with the VA Enterprise Cloud, resulting in a simplified design that could save an estimated $100 million over 10 years. At the end of 2017, this new architecture was deployed and we have begun migrating applications to the cloud environment. Caseflow and Vets.gov will be two of the first dozen to move to this new environment. This effort will no doubt improve the reliability of applications that Veterans use every day, and the Digital Service team is thrilled to work with the VAEC team to achieve this mission.
Reference in this blog post to any specific commercial products, processes, or services, or the use of any trade, firm, or corporation name is for the information and convenience of the site’s visitors and does not constitute endorsement, recommendation, or favoring by the U.S. government.
The best of technology.
The best of government.
And we want you.
We’re looking for the most tenacious designers, software engineers, product managers, and more, who are committed to untangling, rewiring and redesigning critical government services. You’ll join a team of the most talented technologists from across the private sector and government.
If you have questions regarding employment with the U.S. Digital Service, please contact us at firstname.lastname@example.org and visit usds.gov/join.