My goal is to share the experience of roses and thorns related to our recent migration from an on-prem solution to AWS. This has been a very fun and dynamic trip. About a year and a half ago our company (Primero Systems) decided to make a huge change in the direction our company was going (a software development company). In fact, the idea was to broaden our horizons, use more open source software and of course, migrate to the cloud. This article is about the IT aspect of our migration. A coworker wrote a similar article about the application aspect, you can read it here.
We were using 99% Microsoft software. The software development team used Visual Studio, Microsoft SQL Server, Office, etc. Our data center had 100% Windows Server machines. Most of them with Hyper-V with various VMs. I read lots of criticism about Hyper-V, but for us it was working beautifully. Specially the replication function. We were able to fail-over hardware with almost no loss of information. Our infrastructure was standard and solid. Cisco hardware, with HSRP (both for the WAN Feeds and internally). We had redundant everything: Switches, servers, hard drives, power, SQL mirroring, Hyper-V replication, etc. We had periodic fail-over testing, pen testing, configuration backups, database backups, offsite backups. We had VLANs for secure servers (PCI compliance), Hardware and Software monitoring with 1-minute thresholds, email and voice alerting. This was years of thinking, studying, changes, testing and investment in money. One of the most difficult parts of moving to the cloud is changing your mind set. We must forget worrying about some problems and begin worrying about some new ones.
Our first fork on the road: Which Cloud provider should we use?
I mentioned we used 99% Microsoft (and using Office 365), so our first look was at Azure. We were also migrating one of our clients to Azure at that time. All the marketing and demos were pretty. I was able to sign up to run a 1-month trial (with limited credits) to do some testing. Long story short, Azure performs well. We have developers spread all over that need fast access. AWS has had exponential growth of regions and availability zones. AWS seemed a bit cheaper and the service catalog is huge. For these reasons we did not even look at other cloud providers.
AWS it is, where do we start?
Hands up for the Udemy and A Cloud Guru online courses. Besides the AWS official documentation, those online course (very cheap) were essential for training. I’ve done many MS certifications before, all of them “self-taught”. I wish I would have had an Udemy course back then. I was used to my fixed eternal VMs, all of them perfectly available all the time. First mindset change in AWS is that everything should be thought of as temporal. You need to figure out a way to regenerate stuff quickly. Nothing is eternal.
A quick example are web servers. There are many ways you can deploy web sites. In our DevOps suite, we use Jenkins with MS Web Deploy to compile VS projects with MSbuild and deploy them to IIS. You can point to as many IIS servers as you want. In a presentation I saw new way of deploying sites in AWS (which could be better or not, but different) is to have a “Golden Image”. Basically, a VM were you deploy to, it is not in production. Once you finish your deployment, you create an image. Then the Load balancers are configured with Autoscaling groups which create new VMs from this Golden Image. That is changing the mindset. If you have 2 servers, then our Devops strategy probably is good enough. Now, if you have 20 servers, then the Golden Image strategy sounds easier.
A real big mind changer was attending the Re:Invent 2017 event in Las Vegas. It was an overdose of AWS. The boot camps, chalk talks, sessions and all the other fun stuff really convinced me of thinking different. This event was so big that I should probably write an entire post. After attending the big event, I had a few more months of studying until I took the exam: AWS Solutions Architect — Associate. (thanks to A Cloud Guru for the training!).
We quickly learned that there are some key aspects of AWS
EC2 instances -> VMs. You have “basic” instances with just the OS. Then you have a marketplace where you can buy pre-installed machines. You pay for the type of machine you want to use (CPU, Mem, disk). For testing purposes, we used the “T2” machines (cheap or free for a year). These are great temporal machines. They do have a limit on CPU credits you can use. In some of our testing, the machine started to go down to its knees, slow. We learned about the CPU credit the hard way. There are many types of instances. Some optimized for Memory, GPU, Db, etc. You must learn what instances to you in each case. These instances should not have any persistent data. If you can avoid storing data in the instances, you will have flexibility for scalability. For example, if you have a web site that saves reports on PDF, you probably have a reports folder in the web server. What happens if I run a report, it is stored in Server A, but John is connected to Server B and needs to access the report? You will need a synchronization program between the servers. This is probably not that hard with 2 servers, but in AWS this can be dynamic. You can 2 or 20, depending on the load. This is solved with our next key aspect: S3.
S3 is an online storage. I believe Dropbox uses S3 in their backend. You can save files on S3 using a Bucket which is essentially a virtual hard drive. S3 is huge. You can store a lot of data and it is very cheap. Permissions are kind of odd. You can even store static web sites on S3 and have your domain pointing to the bucket (see Cloudfront and S3 static sites). It allows you to have versioning on the files. You can use Glacier to store S3 files that are not frequently accessed at even a cheaper price. Going back to the persistent data issue, you can use S3 to store the persistent data. When you create the report in the web site, it will create a file in a bucket. Any of your web servers can access it. How can we deal with permissions? Next Key: IAM.
IAM is the AWS security module. You can give 2 types of access to a user: Console (basically you can login to the dashboard, menu, etc.) or programmatic keys (“passwords” used to access AWS services trough a cli, sdk, etc.). You can specify granular access for each user using Policies.
Your application can access these buckets using the keys and it will only be able to perform those functions. More so, you can assign a Role (with similar policies) to your instance. Everything that runs in that instance will have access to what the policy says. You would not need to use any Keys. This is a best practice for AWS. You must make sure your instance is safe, otherwise, having console access to your instance will give you access to all those resources.
Load Balancers are a very important part of your architecture. Whenever you can (even if you start with one server) your application should face the internet with a Load Balancer. You can add/remove instances from a load balancer in seconds. This can even be done automatically with autoscaling. If you have high memory usage, you can spin up other instances from your golden image and add it your load balancer. After peak hours, you can scale down again. There are 3 types of load balancers: Application Load Balancer, Network Load Balancer and Classic Load Balancer.
Application Load Balancer: Great for new applications, specially if you run them on docker containers (or ECS). You can run an SSL directly on the balancer (this is not as easy as it sounds, we had an SSL with 4096 bits… not all AWS services are compatible, we ended uploading the SSL via AWS Cli because the Wizard didn’t allow us). Specially you can route traffic depending on the URL. Example: webtreepro.com/Help -> Routes to server A, webtreepro.com/Contact -> Routes to Server B. This was great for a microservices API project that we had. We ended up having one main domain, and the traffic was routed to the different APIs depending on the URL. The IPs are dynamic. It is suggested to use a CNAME from www.webtreepro.com to the load balancer URL. You can’t use an A record because the IP will change.
Network Load Balancer: It’s a TCP load balancer, any traffic received in port X is re routed to the different registered servers in the same port. We used this or legacy apps. The SSL is directly installed in the servers (nginx, IIS, etc). It does not allow routing per url. We can also assign a Static IP (which we needed!)
We didn’t bother to use the classic load balancer because it was tagged as older generation, etc. We had what we wanted with the two above.
Our next key aspect: The database.
This was also a mind changer. The database servers are one of those items on the list with the slogan “if it works, don’t touch it!”. On-prem we had way too much hardware for the usage we had. The same machine in the cloud would cost us Thousands of dollars. Do we need that big of a machine? This is where the mind changer comes: You should use what you need, if it isn’t enough you can scale up. It was very hard making a baseline of what is “enough”. We had to look at many reports, graphs, analyze a lot of timings, data, hard drive speed, etc. So now what?
We had 2 options on using the database to the cloud (Out of all the process of migration, we spent most of the time in DB related topics).
Use an EC2 instances and install the engine on it.
Use RDS (database as service).
Each has its pros and cons. By using an EC2 instance you oversee all the software installed. It’s like our on-prem DB server, but in AWS’s datacenter. All the patches, installations, etc. are done manually. You are responsible for the license any software (not counting the OS) installed. In our case, we were using Microsoft SQL Server. We could of chosen to install the EC2 and use our license (you may be limited from Microsoft, depending on the type of license you have).
Using RDS is very flexible. You run a wizard, choose the Engine, select the type of the machine, security options, etc. and in 15–20 minutes you have a DB engine running. You can even have replication to another zone (if the engine is compatible). The backups are automatic. You even have a Point In Time restore that can give you new instance within a time frame of your choice (5 minute intervals). Sounds great, doesn’t it? Well, it is, but not everything is pink in RDS. There is a huge price difference between the engines. AWS promotes using Aurora (Mysql or postgres compatible). It has great features at an acceptable price. We were not starting a new project, we are migrating what we have to AWS. We simply cannot rewrite our apps and DBs to use Aurora (at least not yet).
Looking at the different Editions of SQL server, they offer one edition that I wasn’t aware of: Web Edition. This edition is in between SQL Express and SQL Standard. It’s cheap, does not have the SQL express limitations, but it does have some limitations. You cannot use mirroring or replication. The Standard edition allows you to use a main server in AZ1 and replicate to AZ2 All these is done behind the scenes without intervention. You can decide on which instance to use depending on the downtime you are willing to have if there is a problem.
The pain was uploading our data to RDS. We added more spice to this, our on-prem engine was a bit old and didn’t have some of the latest migration features. Exploring, there were 2 options:
Using the Native Backup/Restore.
Using DMS (a migration service).
The conventional backup/restore process in RDS does not exist. You cannot use the wizard, scripts, etc. They offer you a Store Procedure with backup/restore parameters to an S3 bucket. This feature is not out of the box, you have to enable it. The now is to do a full backup on-prem, upload it to S3. Restore it, leave in Non-recovery, at the time of the cutover, do a Diff backup. Upload it and finish the restore. Sounds good? Well.. RDS does not allow you to do a Restore of a diff backup. This is essential, but that feature is not yet available. It’s funny that the backup SP does have the Diff backup option. Basically, you can do a diff backup, but cannot restore it. If we wanted to use native backup/restore, we would need to stop our apps, make a backup and upload it. If it takes 3 hours, then it would be 3 hours of downtime. Spoiler: we ended using this option. To minimize the downtime, we migrated on-prem to SQL 2016 standard so we can use the backup compression.
DMS is a tool that allows you to capture data from one DB (source) and migrate it to another DB (destination). The source and destination do not have to be the same DB engine type. That is great! We used it to migrate a Mysql DB to a Postgres DB. There were some changes that needed to be done, but the data was migrated. It offers a periodic sync of data. We read that many companies used this sync option to constantly replicate the data to the cloud, minimizing the downtime of the cutover. Well, as stated above, we didn’t not use this tool for the SQL migration. Our SQL engine was not compatible with some of the features needed to use DMS against the RDS instance.
Since we are a software development company, we have “template” projects which have best practices, standard strategies, schemas, etc. When a new project starts, we usually use this template. While restoring a second DB in RDS I receive an error: Cannot restore DB because a DB with the same Family GUID already exists. What is this? RDS does not allow you to make a backup of a DB and restore it as another DB. It is not yet compatible with having two or more DBs with the same Family GUID. This was a show stopper for us. What will happen when we need to restore a backup for troubleshooting? The AWS answer is: Restore into a new RDS snapshot. Yes… that is an option, but that is extra money. Before worrying about troubleshooting post-migration, I was getting worried about the migration itself. After testing many different tools, searching many different blogs, the only solution was to recreate the offending database. On-prem, you need to export the DB as a Bacpac file and import it with a different name. This creates a new Family GUID on the DB. This process is slow. We had to build a huge server with flash drives to accelerate this process. Using the SSMS wizard is a lottery, we started getting memory error messages, etc. We ended up doing the restore of the bacpac file using command line.
Uploading SQL Backups to S3. Install AWS Cli on-prem and run: aws s3 cp g:\backups\database.bak s3://yourbucket/prod/
What was our migration strategy?
We have a non-prod and a prod environment. You can probably guess that we migrated the non-prod environment first. In the training we leaned about VPCs, public and private subnates, nat gateways, etc. For non prod we created a VPC with public facing load balancers. Our web servers and tasks servers are in private subnets. We also create an instance with a VPN server. None of our instances are public facing (only via the load balancer). Jenkins does all the deployment magic (does not use the golden image strategy yet).
Testing both prod and non-prod was done by modifying our hosts file and pointing the domains to the load balancer IP (static). Weeks before the cutover we ran the release, deploy and migration playbook. The day of migration was not very different. Our old sites and DBs were set to read-only. Used the backup/restore strategy above.
What role did the DNS records have in the migration?
I mentioned using CNAMES instead of IPs. You must be careful when recreating load balancers because the address changes. They are very cryptic domain names. One of our apps is a CMS tool that is used by hundreds of domains (www.webtreepro.com). Those domains must be pointed to our load balancer. Each time I recreated the load balancer I would have to send the hundreds of customers the new LB url. That is not acceptable from my point of view. I ended up creating a record like wtpprodnlb.primerosystems.com. I had all customers point to that domain. That domain had a CNAME to the AWS LB. If I must recreate the load balancer (hopefully never), I just need to modify wtpprodnlb.primerosystems.com.
This was the first phase
Now that our apps are in AWS we have a lot of flexibility on where our instances are deployed. We can comply with many regulations that require offsite disaster and recovery, etc. Its great to be able to deploy servers all over the world almost instantly. We are now using CloudWatch to monitor the servers and uploading all kinds of logs to AWS. There are many analytic tools that we plan to explore to learn business patters from these logs.
During this migration we also deployed Dotnet apps using dockerfiles in ECS (docker). We have used CloudFront for static website hosting (on S3!). We had to migrate our monitoring tools to Cloudwatch, etc. Much more stuff! I might write a couple more articles if time permits.
I hope you enjoyed reading my experience on our AWS migration. I’m sure we will have many more because we are a step away from being AWS Partners.
IT Manager, Primero Systems Inc.