Cloud lock-in cost me thousands: how I reduced infrastructure costs 10x (From AWS to Kops Kubernetes)

As a solo startup founder with high infrastructure demands I needed to cut costs to be profitable. Migrating away from AWS services to open-source alternatives made running a SAAS project 10 times cheaper.

All clouds offer convenience at a price. My experience with AWS lock-in proved costly.

This is a detailed post about how I was able to make my software project profitable by drastically reducing costs. It first covers my service requirements and infrastructure before examining how I identified and then reduced monthly costs.

Many software projects rely heavily on cloud services. These services are great for prototyping but can become prohibitively expensive at scale. Open source alternatives were a key aspect of my cost reduction.


Background

I founded my software-as-a-service project MailSlurp 6 months ago. It’s an API that let’s users create random email addresses on demand and then send and receive emails from them.

It was designed to help people test email-dependent processes in their applications — such as user sign-up, password resets, and email notifications. As a testing service it is used primarily in CI/CD pipelines. This means that the MailSlurp API is invoked at a high scale from automated tests and deployments that users trigger in their development workflow.

Requirements

As MailSlurp is a paid API that deals with the sending and receiving of emails it needs a few things: an SMTP server, a REST API, and a GUI for users to sign up and manage their accounts. I decided right from the start that I would offload the SMTP aspect to a cloud provider, in my case AWS SES. The rest I would build myself using technologies I was familiar with.

False starts: Serverless

The Serverless future hasn’t been completely figured out yet

My first attempt to built MailSlurp was as a serverless Typescript “application” (collection of scripts) that used AWS SDKs to process and store data. SES would invoke certain scripts on email events and I used AWS Cognito for user management, DynamoDB for a database and S3 for email storage. The serverless approach was initially a success but I soon ran into issues:

  • The AWS API Gateway was a rigid and proprietary system
  • Serverless Typescript lacked project structure, library maturity, and features compared with traditional frameworks. (I generally reach for Kotlin and Spring because of the JPA ORM, database migrations, separation of concerns, and excellent test support.)
  • Lack of structure or convention led to a buggy app that was hard to test and reason about.
  • Serverless API was difficult to document or generate SDK clients for. (Other solutions provide automatic Swagger support and client generation.)

Serverless redux: Java + Spring?

I was still happy with the pricing and scaling of AWS Lambda so I decided to rewrite the API in Spring and Kotlin in an attempt to solve the language and library issues I’d encountered with Typescript.

AWS Lambda supports Java and after a lot of head-scratching I was able to run SpringBoot in a serverless way. But I quickly encountered an issue I wish someone had warned me about: the start up time for a Spring application in a serverless environment took around 20 seconds per request…

The serious performance implications of JVM load time and Java application bootstrapping is not clearly discussed on Lambda marketing pages. In my opinion Java is NOT a suitable choice for public facing serverless applications.

I read around and learned about a “pre-heating” approach that some people use. This involves creating another Lambda to periodically ping your Lambda endpoints in order to keep them warm. This means that Spring doesn’t need to boot up for each request but the concept itself is seriously flawed.

Each warming request will only keep one Lambda instance hot. As soon as you get more requests per second than your warmer performs AWS will spin up more cold instances and their first response will take the full boot-up time! This means that Spring in a serverless environment does not scale.

Back to Docker

Diving in with the Docker whale

Once I realized Serverless Java is not currently an option for high-demand APIs (due to start times) I migrated the application to a more traditional SpringBoot Docker setup. The layout conventions and strong typing made this relatively painless. I kept the existing AWS services for data and user management (namely S3, Dynamo, and Cognito) and added new endpoints for SES payloads.

This result far simpler and meant I could now run the application in any environment that supported Docker containers. I had a few options.

Docker deployments

Docker is fantastic because there are countless ways to run a container. Here are four common methods that AWS supports in ascending complexity and roughly descending price:

  • Elastic Beanstalk
  • ECS + EC2/Fargate
  • EKS managed Kubernetes
  • Kops self-managed Kubernetes

I eventually tried every one of these options and learnt one major thing:

Initial investment in complex but scalable solutions saves considerable time and money in the long run. Plus it has many development benefits.

The old adage “move fast and break things” just isn’t suitable for small SAAS projects. Research and planning of infrastructure is necessary for a stable product and one’s own sanity.

AWS and Docker: master of none

Shipping containerized applications is more difficult than it should be.

So now that my MailSlurp service was rewritten and package in a Docker container I could try a new approach to Serverless and deploy it with one of the many AWS offerings.

At first I tried Elastic Beanstalk. This was simple and fast but quickly became a black-box of pain when things started breaking. The API instances would die randomly or they’d fail to scale and I had no well documented ways to debug them. I don’t have a lot of love for the AWS documentation and after some frustrating deployment failures I migrated from Elastic Beanstalk to ECS on EC2. I did so as ECS is positioned by AWS a logical progression from Elastic Beanstalk for those who want more control.

ECS is a proprietary container service from AWS similar to Elastic Beanstalk but far more configuration available. It also has great Terraform integrations. I set up ECS environments using Terraform and found them reliable and enjoyable to work. But… I quickly ran into similar issues I’d faced with Elastic Beanstalk.

Now you might ask whether MailSlurp itself was the common issue between these two different deployment systems and I did ask myself that too. I load tested a single instance in a local environment and found it stable for indefinite periods. But on ECS things would fall apart after a few days.

When things went wrong it was very difficult to work out why. Tooling and documentation for ECS is all proprietary which means searching for common solutions is difficult.

In the end I bit the bullet and migrating everything to EKS — a managed Kubernetes service by AWS. Kubernetes it what I should have gone for all along as it has a huge community and countless support articles and tutorials.

AWS EKS is just Kubernetes but AWS manages the Master Node while you manage Node autoscaling and the cluster itself. As AWS is extremely popular and Kubernetes is the predominant container orchestration system this combination was straight forward to set up. It can be deeply configured with Terraform too, so the transition suited my previous codebase.

Ready for customers

So, after a couple of months I had developed an application. Migrated it from Serverless to Spring, Dockerized the application and then tried out several container deployment options. In the end I’d come to a pretty familiar stack: EKS managed Kubernetes, DynamoDB, S3, and Cognito. MailSlurp was running and users were signing up. Requests were increasing and services were scaling well.

Then the bills came.


The bills arrive

AWS bills — you look, you die!

DynamoDB and EKS do not scale financially. They’re extremely easy services to get started with but the costs soon grow disproportionately large when compared with open-source solutions operating at the same scale. Let’s look at my actual bills over the first few months of deployment.

April 2019EKS: $66
DynamoDB: $125

My first bill for a month of public API use include $66 for EKS use and $125 for DynamoDB. Note that at this stage I only had a handful of paying customers. Including all the other AWS services I used this monthly bill came to over $500. This is totally acceptable for a funded startup but for a side-project that doesn’t have a clear revenue stream or even a proven market model this is a significant amount.

In May usage increased and so did the bills. Dynamo was now costing $307 per month and EKS was again $66.

May 2019EKS: $66
DynamoDB: $307

My database costs had more than doubled in the space of a month. Total monthly bills for a small side-project on AWS were over $1000.

In June the project Dynamo bill was more than $600. It was time to take action and find a way to drastically cut costs or else pull the plug on MailSlurp.


Taking action

The AWS Cost Explorer is a great way to examine costs by service. After the initial panic had passed I drilled down through the pricing to find services I could alter or eliminate.

EKS to Kops

The first place I could cut costs was EKS. I didn’t really need it — you can manage your own Kubernetes master easily enough with Kops.

Migrating from EKS to Kops was a bit of a hassle as it required that I learn more about Kubernetes fundamentals and the idiomatic way that Kops instantiates a cluster. This could be seen as a good thing in the long run however as more ownership and knowledge of your software stack is always a good thing.

Kops isn’t exactly free either. While EKS is roughly $792 per year, a yearly lease for a recommended master node type (EC2 M3.MEDIUM) is still around $400. For me and MailSlurp this was still a worth while saving.

Goodbye Dynamo

Dynamo was a disaster for MailSlurp and I would not recommend it for other fledgling startups. Maybe I’m naive and don’t understand database design and the advantages and compromises that each technology offers but Dynamo is positioned by AWS as a one-size-fits-all solution for everyone.

Dynamo marketing examples show how to read and write data in a few lines of Javascript but these simple scenarios are exactly the wrong thing to consider when selecting a database.

You should prioritize affordability and long-term flexibility of a database over how easy it is to write a hello world.

The ongoing cost of Dynamo was obviously untenable, especially if MailSlurp continued to grow. I decided to migrate the API’s data layer back to a more traditional database like Postgresql. Postgres is an open source database that is extremely well supported. It slots right into Spring and was a surprisingly straight forward refactor.

I rewrote my Entity classes and then exported all the Dynamo data, drank a big flask of coffee and transformed the export into one large Postgres migration. Postgres is much easier to run locally than Dynamo so I could test everything thoroughly on my own machine before redeploying to the Kubernetes.

Like Kops Postgres isn’t free. I had to reserve an RDS instance (and this does not automatically scale!) that met the needs of MailSlurp. This worked out at around $400 a year too. Not bad considering my Dynamo bills were more than that per month!


Learning from mistakes (looking ahead)

MailSlurp was my first paid service and it what I’ve learned from it is worth more than any recurring subscriptions. In my excitement to implement a service I believed in I reached for the easiest infrastructure solutions. Maybe I fell for marketing claims; maybe I was just lazy or naive. In the end the service worked well but my wallet took a large hit. You have to take risks in order to succeed and spending a bit of money to figure out what works best for your application is fine in my view.

A clear take-away has been that tried-and-true open-source solutions are often far better long term choices than trendy proprietary services. This applies to other SAAS products too — MailSlurp included. You need to weight how much something costs with what it saves you. DynamoDB wasn’t worth it for me. I could build the same thing with some time and patience myself using Postgres. Managed container services were similarly costly. In the end I had far more control and far less expenses with a self managed Kubernetes/Kops cluster.

For services like MailSlurp I’d argue there is a difference. Infrastructure is not software. Software as a service has many functions, edge-cases and capabilities. It must be thoroughly tested and that plus planning, maintenance, and development is what you save by paying for it. Infrastructure on the other hand is often a one-up investment. Do it once and do it right.

Addendum

I hope you enjoyed this article. You likes and comments really motivate me. Plus check out MailSlurp — I think you’ll really like it. You can create infinite email addresses during your tests and use them to test user sign-up, email verification codes, transactional mail and much more.

Cheers!

MailSlurp | Ephemeral Email API

Written by

SaaS API for testing code with real email addresses. See https://www.mailslurp.com for more information.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade