Rapidly Launch a Startup on GKE, Node & MongoDB
While I may sometimes discuss the work I do, both on my own and for my employers, my views are my own and are not intended to represent my employers in any way.
Apologies, this is a long one. In this post I’m going to detail how I was able to take a simple idea and turn into a viable reality complete with a revenue model in 3 months. That’s 3 months from inception, through design and architecture, to development and deployment. Whether or not it was a good idea is a subject for another post, haha. I was working with a good friend Chris Betz (medium.com/@thechrisbetz) who developed the UI while I completed 5 node.js services. The whole system was hosted and made available through the Google Kubernetes Engine via docker.
It was January of 2017 and the idea was simple. I wanted a way for people to be able to send real physical postcards to elected representatives as easily as they might send email. The interface would be relatively straight forward. Users would be able to log into a website either from desktops or mobile devices and then proceed to select a predefined issue or choose to define their own postcard content. Once selected, they would choose a representative based on their address and then checkout. Everything else would be on the backend. The postcards would be managed on my end but the actual printing and mailing would happen through the help of a third party service.
Printing and Mailing: After a lot of googling, I stumbled upon a very interesting pay-per-use API through LOB.com. If you are creating any sort of software based mailing functionality, you should check them out. Through their service, I am able to validate addresses and send custom postcards at a reasonable price per card. https://lob.com
Address Based Representative Data: Google to the rescue. The Civic Information API from google aggregates and makes available images and contact information from elected representatives all over the nation at every level from federal to state. We did notice that they sometimes return mixed content (http vs https) but we were able to compensate. https://developers.google.com/civic-information/
Payment: We initially wanted to target Paypal for this implementation; however, as we pushed forward we realized it may not be as easy as usual to implement. To be fair, paypal had just changed their API to the new 2017 version and we were having trouble getting up to speed to the new backend implementation quickly. We needed a backend implementation because we relied heavily on async backend processes being triggered by our payments, so instead we quickly pivoted to Stripe, which is one of the easiest APIs to integrate with period. https://stripe.com
I’m a big fan of microservice architectures in general. Consequently, I felt this project would be an interesting POC for using a microservice architecture with docker and kubernetes. The ROI associated with the approach was justified because there was nothing to transition (as in a previous monolith). I clearly understood the domains required and how to distinguish them from each other. All of that said, I feel it is EXTREMELY important to point out that microservices don’t always make sense. If I was consulting a client who came to me with this idea and wasn’t as familiar with the pros and cons, I might not have recommended this approach. When you move toward a partitioned, functional design where services encompass their own domains of data, you introduce new variables that can be difficult to negotiate. For example:
Relational Data: Relational data can really hurt you if you don’t ensure it is available within a defined domain and easily queryable. If you are not careful, you can find yourself making expensive and time consuming cross domain API calls just to orchestrate your data.
Domain Boundaries and the UI: Domain Driven Design allows you to define microservices that can be scaled independently with minimal dependency on each other. At least it’s supposed to… it would be a mistake to assume that any boundary is perfect. Nevertheless, there is one place where boundaries become more complex: the UI. At this level, you need to understand that services may not be defining relational contracts with each other, so if there are cross domain data references, these should be treated as de-normalized and non-relational references only, requiring the UI to orchestrate the interaction.
Data Denormalization: Data denormalization is the process by which data is essentially duplicated across domains in order to allow a service to reference that data quickly and efficiently. Data denormalization creates a question: How? How does service A make service B aware of its data. It also means accepting that the process implies eventual consistency. Service A should never atomically write the data in Service B (otherwise we have a dependency that could become problematic), so we accept that Service B will eventually catch up and become consistent.
Triggers and Reactive Design: In an eventually consistent distributed system, it becomes advantageous to be able to trigger asynchronous actions between services. For example, after the Payment Service receives payment for a postcard, it must trigger the Mailer Service to complete the order and process the card.
There are other considerations of course: service discovery, databases, deciding to what degree to isolate cross domain communication… But we are getting off topic.
My design would consist of 6 services that operate as independently as possible with minimal cross communication except for a few event based triggers and Authentication. The UI, designed and coded by Chris Betz (medium.com/@thechrisbetz), would be fully decoupled and built with the Angular JS Material framework. All artifacts would be containerized via docker.
Before we dig into the services, let’s look at the connections described in the diagram above. You’ll notice that the services primarily communicate through REST with only the UI; however, the Payment Service does make asynchronous calls to the Mailer Service when completing payments. The fact that this happens so often is an indication that perhaps these two services should not be separate domains, but that’s a debate for a different time. The lightly dotted lines above represent communication with the User Service to provide authentication. Each service makes an initial authentication call and then caches the authenticated token locally until it expires to avoid subsequent calls until needed. This is a level of interconnection that I could have probably gotten more creative about, but for 6 services and a UI, it didn’t seem necessary.
Each service is written in Node.js with Express and uses MongoDB as a persistence layer. Initially I planned to provide these databases as 2 distinct clusters on Google Cloud. Because of budget constraints, I chose to go with a single 3 node cluster with replication (2 nodes and an arbiter). All of the services persist logging in MongoDB using a custom framework. This is not a good long term approach for logging, but for such a small project I saw no harm. In general, if this were a more complex design, I would have preferred to take advantage of the docker based infrastructure to push logging to a central system like ELK. Although GKE makes it pretty easy to see what is going on in a container as well.
For the APIs I used swagger 2.0 and documented each endpoint before coding to give Chris a heads up on what to expect. As I explain the services, I am not going to provide a code level walk through of everything. This post is already long and that would make it the length of a small book. I’ll just explain what they do and how I wrote them as quickly as possible.
User Service: This service is actually part of a larger grouping of services I was already writing in order to create my own custom Identity Management (IDM) system called United Effects (UE) Auth. I’ll discuss that project in a different post. The service was pretty close to complete and I simply used it as a standalone system to authenticate and swap facebook tokens with my own internal tokens. In doing so, I also was able to save a user record, define roles and prepare for an ultimate shift to UE Auth in the future. While I had a head start on the service, it also wasn’t very complicated. Essentially, I defined a user data model and provided CRUD and authentication endpoints. The User Service would send a new token to a caller and authenticates that token for 12 hours until it expired. Other services and the UI would need to cache the token and refresh upon expiration. Initially targeting a Facebook only authentication, allowed us to ramp up this piece of the puzzle exceedingly fast while taking advantage of Facebook’s large user base.
Mailer Service: This service is the postcard processing system that connects to and manages the lob.com interface. It defines a data model for a postcard and the various states in which a postcard may exist. Once defined, a postcard can then move through a workflow of states that are partially triggered locally and partially triggered through webhook interfaces to lob.com. The service also makes it very easy to send customizable postcards with many different kinds of content and images to lob.com for printing and mailing.
Payment Service: This service defines a payment object to track payments as receipts and then triggers actions against the Mailer Service once payments are confirmed. It integrates with the Stripe API (stripe.com) to process credit cards without ever saving credit card data locally — averting the immediate need to adhere to any financial regulations. The trigger is a poor man’s event system. If this were a more complex architecture with many more services, I would be inclined to use a system like Kafka or NATS (both incredibly powerful tools you should read about) to asynchronously and reliably publish events for consumption and processing — the very basis of a reactive event based architecture. The scale of my project made the implementation of such a layer unreasonably costly. Instead, I simply used a custom intent framework and MongoDB to make a list of intended http requests and then provided retry logic that asynchronously attempts to make the calls until they are completed. In this case, the call being made is an HTTP request to the Mailer Service to mark a postcard as paid and begin printing. As a redundant backup system, the Mailer Service actually listens to webhooks from Stripe as well. In the event that the intent system fails for whatever reason, stripe will eventually send a confirmed payment message that the Mailer Service will see. In that message there is a Postcard ID.
Postcard Content Service: This is just a very basic layer to manage predefined content for postcards. Admins are the only ones who define this content. It is little more than a data layer with a CRUD API.
Content Service: This service is very similar to the Postcard Content Service except that it allows me to persist and reuse images and html that are publicly or privately available.
Civic Wrapper Service: This service is nothing more than a wrapper that allows us to proxy requests to the Google Civic API. We decided to do this so that we could ensure all content being served was https and avoid mixed content warnings on the UI.
I was able to move quickly through the implementation of these services because I used each service to build on the next. If you were to look through the code of each service you will see that they have a lot in common. If you’d like, feel free to view the Content Service as an example here: https://github.com/UnitedEffects/UE-Content_Srvc
Admittedly, the Content Service is very simplistic. There are probably hundreds of services just like it in Github that do what I’ve done and do it better. What I gained by writing it and the subsequent services myself was familiarity with the code and speed.
I have to acknowledge that I cut a really big corner. One no production developer should cut if they can avoid doing so. I didn’t add any unit testing. In hindsight, I don’t think I had a good excuse. I honestly don’t believe it would have impacted the schedule to do the testing. Instead I manually tested everything and Chris and I went through several iterations of trial and error. If I had to do it again I’d implement unit testing for sure. I probably still will at some point.
Faster with (mostly) Google
If you are unfamiliar with docker and kubernetes, first I think you really ought to take some time and learn because it will impact your career, and second you can start right here:
Docker is a containerization system that allows you to virtualize not only your application or service, but also your OS and all dependencies inside an extremely small container that is portable and easily managed. For node.js based services (and Golang) I am a fan of the Alpine Linux containers, specifically “mhart/alpine-node”. These are minimal containers with only what you need for your service to run, making them quite small and portable.
Kubernetes is an orchestration system to run, monitor, load balance, network and manage large numbers of containers. Google provides a managed implementation of Kubernetes on its Cloud Network called Google Kubernetes Engine (GKE).
From here on out I’ll assume you know about these technologies.
Building and Versioning the Containers
Each of the services above has a Github repository and a Dockerfile. You’ll notice it in the Content Service example. In addition, each service has a Docker Cloud (public) or Google Container Registry (private) repository. Through these registries it was pretty trivial to create automated docker builds based on the Git tags that I committed to the Github repositories. I used tags so that I could version each build. One of the things you learn as you work with Kubernetes to orchestrate your containers and provide rolling updates is that always using the “latest” tag can become confusing and sometimes just not work. The best way to proceed is to tag each Docker container version that is created, and the easiest way to do that is to use Github tags and trigger builds off of them.
I have worked with entire teams to stand up Kubernetes clusters manually on Cloud providers such as AWS. It is a time consuming process to do it right. Google has removed all of these issues. GKE is a managed implementation of Kubernetes that takes about 15 minutes to stand up and begin working with. When you do begin the process of standing up a GKE cluster, a wizard will walk you through the options that are available including the size of your cluster. I chose to go with a 2 node cluster to start and set auto-scaling to allow it to expand up to 5 nodes as necessary. This ensures that I’m only spending money on those nodes when I need them because of increased load. Once everything is up and running, you simply copy & paste the command required to configure your local Kubectl CLI to interact with the cluster from your machine.
Google Cloud has a “Click to Deploy” option for a MongoDB cluster with replication by Bitnami. This was the most cost effective solution with replication that I could find. It was also very simple. You just follow the instructions and once it is ready, Google provides you with in-browser SSH terminals to interact with the nodes. You are provided with the ability to make them externally available or keep them only in network on Google. If you don’t need replication, a standalone instance is just as easy to deploy and far less expensive.
Deploying to GKE
I wish I could tell you I created a fancy amazing auto continuous deployment process for my containers to Kubernetes. There are definitely some pretty cool options like Spinnaker, Drone or even good old Jenkins. But I didn’t. I found that with these 6 services, it was easy enough to simply manage the Kubernetes yaml files in a single Github repository that I manually controlled. Here is an example of the Content Service GKE deployment yaml:
I’ve removed some of the more sensitive values from the environment parameter section.
There are a few things you should take note of from the example above. Notice that there is a Service and a Deployment defined. The Service provides load balancing and an access point to the Deployment, which in turn defines the number of containers to balance across (replicas) and the configuration of each of those containers. With GKE, if you plan to use the standard Ingress (next section) it’s best to use the NodePort type for the service. When you see “port” and “targetPort”, these refer to the port other kubernetes based services may request the data on (port) and the port that is exposed at a container level (targetPort).
Within the deployment, there are also some settings that are optional but very helpful. Limits ensure that a single container can’t run wild, either because of traffic or because of error, and consume all of the available resources within a GKE cluster. You can define an initial resource request and the provide limits upto which you are comfortable expanding. Replicas defines how many instances of your container will run and be balanced. This example is only defining a single instance. And finally, revisionHistory tells kubernetes to not keep old versions of the containers which are not running lying around in your list of replicas. It keeps your dashboard clean.
The env section is where you define any and all environment variables your container may require for configuration. You’ll notice that if you need to reference a different container, you will do so through its Service. This example has a configuration environment variable called DOMAIN which points to another container. Its internal URI is simply defined as the Kubernetes service name which has an associated internally mapped IP address. So to make a request to this service from within the cluster, you simply need to send that request to http://domain. Similarly, if another internal service wanted to reference the Content Service, they would do so at http://content.
Finally, though not in the above example, you can set auto scaling for these containers as well. Essentially, you can define when the number of replicas (in this case 1) should begin to scale up to a higher value based on CPU useage. You can do this using the Kubectl CLI tool with the following command where I define the minimum number of replicas as 1 and allow scaling up to 8 when CPU percentage usage is at 85% or higher.
kubectl autoscale deployment content — cpu-percent=85 — min=1 — max=8
Routing via Ingress and AWS Route 53
I know I know, this is supposed to be a Google feel good story. But the truth is that Route 53 is just an excellent way to manage domains and subdomains. All of my domains and subdomains for all of my projects are managed through Route 53.
In Kubernetes, you define domain based access to a service through an Ingress. Here is an example of an Ingress for this project. You’ll notice several services are defined:
The ingress allows you to define access to the Service from a specific path on a specific internal kubernetes port for the service. In Route 53 I create these subdomains and here in GKE I tell Kubernetes to only allow requests from content.freedompostcards.com to the content service on internal port 80. One thing you’ll notice is that all of these services are serving content on port 80; however, if you look at the annotations above, you’ll notice that I’m actually disabling http based traffic, thereby forcing https (port 443) only. This is not a redirect, it simply gives a 404 error if someone attempts to access these services from http instead of https. This is because the GKE Ingress is based on the Google L7 load-balancer and at the time of this project did not offer a redirect option. The tls secretName is an internal reference to my SSL certificate which has been encoded and uploaded to GKE. This allows secure https traffic to be mapped to my services. It took me a while to understand how to setup the https SSL certificate, but it’s pretty simple:
kubectl create secret tls foo-secret — key /tmp/tls.key — cert /tmp/tls.crt
When you apply an ingress, Kubernetes will set it up and return an external IP address. You can watch this happen by observing the progress with the command:
Kubectl get ing — watch
When you have the IP address of the ingress, simply provide that IP address to each subdomain defined in Route 53 (or wherever you are managing your DNS). Each one will get that same IP address. Remember that Kubernetes will route the traffic appropriately based on the domain in the request header, so they all send the request to the same IP.
Live in 3 Months
We were able to code and configure everything you’ve read above in approximately 3 months. The result was a website we called Freedom Postcards — https://freedompostcards.com.
Trigger Warning — Politically speaking, there is a decidedly left leaning bias on the website. This post documents the technological journey that made it possible but is in no way meant to debate or venerate the site’s political leanings over any other.
What I Learned for Next Time
- GKE is expensive for individuals to maintain. That being said, if you can afford it, it’s a great way to rapidly build and deploy code.
- REST is inefficient as a cross service communication technology. If I had more time, I would have pursued GRPC with Protobuffs.
- There are a lot of serverless options available that would probably work just as well with this architecture without the GKE long term costs. In the future I would probably use AWS Lambda or Google Functions and utilize a GraphQL interface instead of REST. This would be a much more functional design that would also cost less to maintain.
- Having a GKE cluster is definitely fun… I like being able to test new ideas quickly. One thing on my radar to bridge the gap to a functional serverless approach like AWS Lambda is https://github.com/kubeless/kubeless
- I really should have built unit tests. It’s not hard and doesn’t take that much time.
- I love MongoDB, but I should probably branch out to managed NoSQL databases on the cloud to lower costs.
- I should probably set up a ready to go minimal eventing/messaging system for my future projects like https://nats.io/
- Market research first may have pushed us out a few weeks, but would have been worth the time.
- I’d use Google Container Registry over Docker Cloud. It’s much faster.
- I’d blog sooner, so as to avoid a giant article like this