Terraforming Stack Overflow Enterprise in AWS
Stack Overflow Enterprise (SOE) is a paid private version of Stack Overflow for businesses, available from Stack Overflow as an on-premises installation or hosted service. In this blog post, we discuss our strategy for building and deploying a self-healing, highly-available internal SOE instance on Amazon Web Services (AWS). We describe the iterative process of migrating SOE, which is a complex Microsoft stack application, to an all-AWS architecture via HashiCorp’s Packer and Terraform deployment suite. We detail our approach to security, including the use of Bastion hosts and front-end/back-end subnets, to help isolate communication routes between components. Overall, we show how we assembled infrastructure that simplifies the maintenance of an increasingly vital knowledge management service for our company.
Effective knowledge management is difficult at any rapidly growing company, but Palantir’s geographic footprint makes it even harder. Our business development, support, and product teams collaborate across different countries and time zones, so we need tools that make Palantirians’ expertise available to all their coworkers, regardless of location. SOE allows an engineer in Sydney to benefit from and contribute to a company-wide knowledge base in real time instead of waiting for the sun to rise in Palo Alto. It reduces support burdens by providing a friendly, searchable interface to many common technical questions, and its knowledge compounding effect frees development teams to focus on core feature development.
During its first year, Palantir’s internal SOE instance was hosted on two on-premises virtual machines located on the US East Coast. While this infrastructure was straightforward to access and administrate, it provided suboptimal service to our coworkers outside the United States. In addition, server maintenance became a concern, as our on-premises hardware would need to be taken offline for critical security patches, eliminating the possibility of a zero-downtime maintenance. We had to manually smoke test all of SOE’s components, and it was easy to overlook a small component or setting that hadn’t been reset correctly after the patch.
Six months after our on-premises launch, we began to explore migrating SOE to AWS because of these downsides to our on-premises installation. Moving SOE to AWS offered us several concrete benefits:
- Better uptime and reliability, improved system stability
- Automated and less time-consuming maintenance, upgrade, and patching workflows
- Cost savings through adaptive resource allocation
- Improved security posture through repeatable builds and deploys based on up-to-date, fully-patched images
Terraform Deployment Infrastructure
We embrace the Infrastructure-as-Code paradigm and chose the HashiCorp product suite to define our SOE infrastructure: Packer to create EC2 Amazon Machine Images (AMIs), Terraform to deploy AWS infrastructure, and Vault to control access to system secrets. We chose HashiCorp’s deployment software because we found that Terraform offered a concise way to represent our infrastructure in a number of code files, while Packer provided a straightforward way to create an AMI with clear logging of each step of its build process. With HashiCorp’s GitHub integrations, we are able to develop and review our AWS infrastructure just like a regular piece of software, testing and dry-running each incremental change as a part of the build cycle.
The result is a consistent deployment procedure that is admittedly tedious to get right at first, but provides significant time savings when run repeatedly. Creating the Packer build to install SOE, for instance, required us to research the Powershell commands needed to install a .NET Runtime application, configure IIS, and start several related Scheduled Tasks for SOE badge and notification processing. Our initial build took over a month to assemble, but now allows us to upgrade to new SOE releases with a few simple edits to our Packer code.
Our SOE upgrade process has improved after the migration: what used to be an hour-long manual testing protocol is now an automated 20-minute workflow comprising a Packer build followed by a rolling Terraform/EC2 redeployment via Bouncer. Further, we can test changes by executing the same repeatable workflow in a staging environment.
Packerizing Stack Overflow
We use Packer to create AMIs for the core SOE application as well as Elasticsearch. We deploy a dedicated Elasticsearch cluster since Amazon’s Elasticsearch Service initially did not support encryption at rest. Amazon has added encryption support in Dec 2017 and we may consider switching to this service in the future.
Our SOE Packer build downloads SOE’s release artifacts, installs several support tools like Chocolatey, Redis, and our security tooling suite, and then deposits all needed resources for SOE installation and customization into a local directory; it then creates and packages an AMI from the result. When an EC2 Web Server is spun up, its EC2 launch configuration-provided User Data Script runs our customized automated SOE installer script and then applies any further configuration like domain binding, security tooling configuration, and adds CSS styling to make SOE our own.
SOE is a Microsoft .NET application comprising the following components, each of which is listed with the corresponding AWS component:
- IIS Web Server (hosts main site)
Several EC2 instances in different Availability Zones (AZs)
- Redis (cache and pub/sub service)
Installed as a Windows Service on each EC2 Web Server
One EC2 instance, shared among EC2 Web Servers
- MS-SQL Server Database
RDS MS-SQL Server with multi-AZ availability and database mirroring
The following diagram shows the components and network configuration for our AWS SOE installation:
We create several EC2 Web Servers across different Availability Zones, each with its own Redis Windows service to handle local content caching; this adds redundancy to the notification delivery system as each EC2 Web instance can process notifications separately and have RDS deduplicate them later on.
The EC2 Web Servers are deployed in an Auto-Scaling Group behind an Elastic Load Balancer (ELB) with a simple health check that requests the SOE index page over HTTP. If this check fails for longer than a two-minute period, the ELB will request the creation of another EC2 Web Server and remove the underperforming one from the fleet. Individual EC2 instances provide traffic via HTTP to the ELB, which then applies an Amazon-provided SSL certificate on the front-end that matches a target Route53 hosted zone, thereby providing HTTPS traffic to the entire fleet of EC2 Web Servers.
SOE initially supported Azure Blob storage and local storage for user-generated content such as avatars or posted images, and the newest release adds support for SQL Server storage. Our first attempt to make this data available across EC2 instances was to attach an Elastic Block Storage device to each instance, but we found that this approach increased spin-up time by about 50%. Instead, we now store images on local disk and periodically
s3 sync the content between instances and S3. This approach does create a time window in which an image could be created but not synced to or from S3; this is on our list of items to improve upon, and we will likely switch to the new SQL Server storage option during our next upgrade.
The Microsoft SQL Server database is handled via Amazon RDS with multi-AZ duplication in order to support EC2 Web Servers in different geographical locations. We apply a backup strategy to RDS that generates snapshots regularly in case of database corruption or failure.
We encode configuration values as Terraform variables to enable selective deployment to staging or production environments. For example, we maintain a
network variable that allows deployment to different VPCs, as well as variables tracking the intended Route53 zone, the base AWS Availability Zone, EC2 Web Server Instance Size and Type, and the ACM certificate for the ELB.
We use several AWS security features to establish appropriate network and access control boundaries between different SOE components, users, and administrators.
Our base infrastructure is deployed into a Virtual Private Cloud (VPC) with distinct front-end and back-end subnets to classify and protect each component based on the type of traffic it expects to send and receive. Front-end subnets are meant for user-facing infrastructure like SOE’s Elastic Load Balancer (ELB), which is expected to receive direct user traffic and thus has open ingress from within the Palantir network. Back-end subnets are meant for infrastructure that isn’t expected to receive direct user traffic, including SOE’s EC2 application servers, RDS database, and Elasticsearch EC2 servers. By default, the back-end subnets only allow inbound traffic from defined bastion hosts and the ELB; we add specific outbound routes for any external resources, like logging endpoints, that each component requires. Our VPC’s default-deny security stance gives us fine-grained control over the flow of information between SOE’s different components.
We’ve modeled our inter-component connection restrictions as AWS Security Groups. The EC2 Web Servers are in several different Security Groups that help them send and receive information to other components:
- an ELB security group for HTTP/S ingress and egress to Palantir internal IPs,
- an Elasticsearch Security Group for Elasticsearch EC2 access,
- a Remote Desktop and SSH Security Group to a list of bastion hosts,
- an S3 Security Group for image backups and synchronization,
- and an RDS security group for database connections.
Direct access to EC2 Web Servers is disabled in favor of Windows Bastions hosts that require special permissions to connect into via Remote Desktop or SSH. In addition, we manually approve all acceptable certificates in the Certificate Manager so that only valid Route53 hosted zones can be represented by web-facing ELBs. Finally, we tie KMS keys and minimal Identity and Access Management (IAM) policies to a special Terraform AWS user, so that it can create infrastructure but can’t do much else beyond that.
Moving our SOE deployment to AWS was net-positive and we have achieved our primary goals: better reliability and performance, predictable automated deployment pipelines, and improved security posture. Of course, such a migration requires investments and has trade-offs: developing the initial build was time-consuming, and code-defined infrastructure can be harder to understand, monitor, and debug until administrators are familiar with the appropriate tools. Overall, we believe that our team’s productivity has increased after the move.
We are actively pursuing several improvements to our SOE AWS deployment toolkit. Our first effort centers around Amazon’s Elasticsearch Service’s new support for encryption at rest: we are considering migrating from our customized EC2 Elasticsearch boxes, which would remove the need to maintain an in-house AMI. Further, we are continuing to refine our SOE Powershell deployment scripts to automate and combine certain expensive steps that take longer to run. We are also in the process of creating a Terraform module for SOE so that we can configure multiple SOE instances as separate Git branches. We would like to continue writing alerts and alarms for the individual components of SOE so that we can make individual services self-healing. Finally, we are continuously working with the SOE team, providing feedback from our migration to help speed up the initial installation time and stability of their software.
We hope that our AWS automation story for SOE inspires and helps you with your own AWS services, and appreciate feedback and discussion regarding the presented approach.
You can learn more about Stack Overflow Enterprise at https://www.stackoverflowbusiness.com/enterprise. We would like to thank the Stack Overflow Enterprise team for making themselves available to answer all of our questions while we ironed out our deployment process!