Migrating Scaled up Infra on AWS to Private VPC
Overview
Security is a critical piece that often gets overlooked in fast growing startups. It is important to dedicate some quality time and resources to prevent security breaches.
There are mainly 3 basic level of security:
- Infrastructure Security : You need to make sure that your entire infrastructure, whether it’s on AWS / Azure / Google Cloud, is least vulnerable to unauthorised access to servers. No user must be allowed, who is not meant for login and fetch data from servers. This is generally achieved through proper DDOS protection layer, Web Application Firewall, Network Access Control List, Security groups etc.
- DATABASE Security : Databases are used by tech companies to store user information, transactional data and other company related information which are not meant to be available to outside world. In general to ensure proper security around data, companies use complex password authentication algorithms, putting databases under private subnet, adding IAM roles for user access management etc. Data is something which is most beneficial for outside world and hence is most vulnerable piece to go after as well. Zomato recently lost their 17 Million user’s data due to lack of security.
- API level authentication and authorisation : Application servers serves all the APIs which are called through frontend clients like Android / iOS / Desktop Web / Mobile Web etc. It becomes really important for any tech company to make sure each and every client is authenticated and authorised to access data which they are requesting through API call. Also you need to have proper throttling in place to make sure, that no API / resource are accessed more than its usual behaviour.
At Urbanclap, we had proper API level authentication, security groups and user access control for our AWS infrastructure. However, the fact that it was all still in a public subnet made it very fragile as a single instance, if hacked could potentially compromise the whole infrastructure. Hence, we decided to migrate all of our infrastructure to a private subnet with very controlled and secure entry points into the subnet.
While it is relatively easy to setup a private subnet, what is really tough is to migrate a fully scaled production running infrastructure. There are a lot of factors one needs to take care of, while ensuring as close to zero downtime as possible. This article would describe our journey, which we broke into phases and should be a helpful introductory guide for others who want to do the same.
Through this migration, we solved for Infrastructure and Database security, while we already took care for API level authentication in past.
Infrastructure overview and strategy
This is a snapshot of our infrastructure before we began the migration:
- We had our complete infrastructure on AWS with around 120 servers.
- We were having just a single VPC with two public subnets and no private subnet.
- All our application servers / database / caching / RDS / Redshift / Jenkins / Data / Mysql / ECS docker / ELB servers were having public IPs associated with them.
- Only security layer that we had in place was security groups. If a security group got compromised or we messed up with our inbound rules, our whole infra could be compromised.
- We didn’t have strict NACL rules.
Things that we had to keep in mind while doing the migration:
- Developer productivity is paramount and their efficiency should not be compromised during the migration.
- The goal was to migrate all non-internet facing and internal servers to private subnet, i.e. no such servers can have public IP associated with it.
- Restricted NACL and security groups needed to be put in place.
With a team of 3 people, we divided the whole migration into different phases and started to work towards it.
PHASE 1 : Migrating developers to DEV VPC
Initially, we had only Prod and Stage environment on AWS infrastructure. During migration, we decided to have development environment on AWS as well for the following reasons :
- Changes made to staging environment for testing would conflict with other developers.
- Dev environment is one where developers can actively work upon, combine different versions of other services and test their code. Staging environment is some sort of pre-prod for us and hence stability for that environment was important.
- Developers can run their own branches on development environment and staging would always be running the latest master of all the services, so that it is as close to production as possible.
So we went ahead and created a new VPC for devs, and named it as dev VPC. We created two public subnets in it (one in each availability zone for having better up time ) and migrated all stage ECS servers / MongoDB / Mysql / Redis / Jenkins / LDAP / Nagios servers to it. Servers in dev VPC were now restricted through security groups and only accessible through LDAP authentication. While this is more or less similar to our old VPC, however it has one key difference: it is completely separated and isolated from production environment.
Note: Remember that you must not have any VPC peering setup amongst your VPCs. VPC peering enables machines of one VPC to talk to servers of another VPC and can compromise security.
Now that we have migrated all devs to separate VPC, we refocused our efforts on Prod / Stage VPC.
PHASE 2: Setting up VPN, NAT and Private Subnet
After creating a separate dev VPC, we now had to migrate all staging and production servers to private subnet. We had a lot of micro-services on docker which talked to third party APIs, packages and tools and migrating them to private subnet was an issue.
A server in private subnet doesn’t have any public IP associated and hence neither the Internet can reach out to it directly nor the server can reach out to the Internet. To make private servers talk to the Internet, you need NAT (Network Address Translation). NAT proxies the traffic on behalf of your private server to internet.
After creating new private subnet in VPC and setting up the NAT for our subnet, we added NAT endpoint to private subnet’s route table:
Here nat-071ef497b75d57e1e is the NAT we created on AWS console and 172.31.0.0/16 and 172.32.0.0/16 are private subnets which were setup in VPC. The third rule in this route table states that all traffic going out needs to go through NAT.
Once NAT setup is done and route tables are updated, we tested using
ping google.com
and it worked fine.
Now that private subnet / NAT are created, we needed a way to access these servers through internet in case of debugging or any issue. There are two ways to do it :
- Boot a jump box server in same VPC in public subnet and use some sort of LDAP authentication on top of it. Then you just allow port 22 traffic on each private machine through Jump box server IP. But this approach has a significant drawback, that if the Jump box gets compromised being in public subnet, all your credentials and everything in VPC will also be compromised.
- Use VPN (Virtual private network) : VPN provides a secure tunnel between the end user and the destination machine. You need to have VPN setup in the public subnet of same VPC. VPN will allow separate tunnel for each and every user to send or receive traffic to private subnet. We used OpenVPN for providing VPN functionality to us along with LDAP authentication. Also for further security, we only allowed traffic to our OpenVPN server from our office gateway IPs only.
Once we had OpenVPN setup, we had to allow port 22 traffic through OpenVPN server on each and every machines’ security group.
PHASE 3: Setting up Stage infra in private subnet
After setting up private subnet with NAT and OpenVPN, it was time to start the migration of stage infra to private subnet. To achieve this, we did the following :
- Created a new ECS (Elastic container service) cluster in private subnet as we had multiple micro-services running on docker through ECS.
- We used an application load balancer above ECS cluster, so we setup an internal load balancer (private load balancer) to allow traffic for all services running on ECS only through our public facing Nginx server. This ensured that our services would be unreachable directly from the internet.
- Setup new database endpoints for Mongo / Mysql / Redis in private subnet and then updated the code base to utilise the new ones.
- Changed deployment scripts for CI/CD on Jenkins based on the new configs.
- AWS security group is a network security layer on top of servers within AWS VPC, so we created new security groups with proper inbound traffic rules to allow which port should be open to receive inbound traffic from which servers.
- Once all above tasks were completed, we changed our Nginx config to point to the new end points.
After this, our stage infra was ready and people were able to use it without any issue.
PHASE 4: Migrating production data to private subnet
Production data migration was one of the riskier, time consuming and focus driven process. Reason being:
- Migrating live MongoDB cluster from public to private : For this, you need to setup new machines in private subnet, sync them with running production databases and do continuous deployment on all services. Zero downtime is a must. There are a number of steps you need to ensure to do that, which would require a separate post.
- MongoDB cloud manager : We were using MongoDB cloud manager to manage one of our main MongoDB cluster. We were not sure how MongoDB cloud manager would be able to reach out to the private servers without public IP. But thankfully, there is no inbound traffic cloud manager sends to the servers it manages. The cloud manager's agents send the outbound traffic to the cloud manager.
- RDS : We were having RDS mysql instances as publicly accessible servers with security group restriction. To move them to private subnet, we changed their publicly accessible DB parameter to NO. Good part is, this didn’t change the DNS name for RDS instances, it just removed public IP association with them.
- Redshift : This is our data warehouse where all our reporting / metrics are taken from. Previously, our redshift server was publicly accessible. We had to take some downtime, of replicating new cluster in private subnet and deploying new endpoint across production services.
- Redis : We use Redis Cluster as a caching layer to power SEO / Authentication / API caching etc. To migrate Redis cluster from public to private subnet, we booted equal number of instances in private subnet and added them as slaves with running Redis Cluster and then we gracefully shut down the older ones.
- ELK : We have our own ELK stack which we use for application logging purpose. We shifted our entire ELK cluster from public to private. We restored old snapshots on our new private cluster and replaced it everywhere in production systems.
We did this end to end for each and every production database and deployed it across services.
Note : Few people might think, how come phase 4 got completed without migrating production servers. Well, within the same VPC, servers talk to each other irrespective of servers having public IPs or not. So within VPC, a public server can directly reach out to private server without any proxy.
PHASE 5: Migrating production to private subnet
Now finally we were left to migrate our production servers to private subnet.
Whatever we did for migrating Stage infra had to be done for Prod infra as well to migrate our application servers completely. Just that this time, the endpoints were many more and we had to be extra careful with our changes.
This was a fun exercise, as we had to do the migration during night hours when traffic on our application servers was the slowest. We had a few glitches due to restricted firewall settings, which we rectified as soon as we identified them.
During this migration, we migrated around 58 servers to private subnet, around 17 servers to newly built dev VPC. We changed endpoints for around 30 machines and put them in more restricted security groups and subnets.
This sums up our effort to move our whole infrastructure to a more secure private subnet. While this was a good overview of how we broke the whole migration process into phases, in the coming weeks, we will outline some of the steps in more detail in separate posts :)
Thanks to Aditya Chowdhry and Rakesh Arora.
We are hiring!!! , drop your resumes at mohitagrawal@urbanclap.com.