Re-Engineering ICP-Airways Microservices, pitfalls and improvments
For those who are unware of what ICP-Airways is, let me brief it down for you. ICP-Airways is a Modern Cloud Native Application, completely built from scratch, inspired from Acmeair, deployed on IBM Cloud Private. This application leverages several microservices which works in conjunction with various databases and IBM Middlewares. The main objective of this application is to demonstrate the immense usecases of using IBM Cloud Private. This project is completely opensourced for anyone to play with, and is well supported project with over 700+ commits, and 30+ ⭐️ on Github .
It has been only few months since I started with IBM Cloud Private and Kubernetes, and it is impressive to see how well I have picked these technologies. The purpose of this blog is to show you what are the common pitfalls, and mistakes I have corrected while re-engineering ICP Airways.
Before we start talking about pitfalls, let me breif you out on what this application is all about. This application is an Airline Booking Application, where a person can authenticate himeself to look for avaiable flights and book them. This application will soon have a chatbot feature to automate the booking and checkin process. Also, AI to enhance customer expirience.
#1 Pitfall (The Architecture)
“NOT ANYTHING YOU DEPLOY ON KUBERNETES BECOMES A MICROSERVICE”
This is the following diagram of the first version of ICP-Airways.
If you undesrand microservices well, you will easily spot out the mistake in this architecture.
You will easily spot out that all my microservices are using one database instance. This is not a micriservice, this is a distributed monolith. Db2 is now here a single point of failure, if db2 goes down, the application will not function at all! A lot of people treat their rest apis deployed on Kubernetes as Microservices, which is a wrong concept
We re-engineered the archtitecture as follow
This archtitecture solves the problem of having Db2 as a single point of failure. Now each Microservice has it’s own independant database. This bring more security in our application, if one database gets leaked, information in other databases will still be safe. One of the design principles of creating microservices is that, it shall be buisness domain centric. Hence, we combined signup and signin microservices as authentication microservice, and checkin microservice combined with booking microservice. Now, all my microservices are loosely coupled and attain high cohesion.
#2 Pitfall (Using inefficient bundler)
The backend of this project is created using Nodejs (Express) customized to support typescript. We used typescript on our backend because it gives us more capabilities that normal vanillajs can not provide such as static type using interfaces.
#3 Individual Kubernetes Yamls
Before we had been using individual kubernetes yamls to deploy our applications. Deploying microservices is hard to manage if we use individual yamls to deploy our services, deployments, configmaps, daemonsets, etc
# 3 Solution using helm charts
All our microservices has moved to helm charts.
#4 Pitfall hardcoded replica counts
Traditionally kubernetes deployment has a fearure to specify replica counts. With replica counts you can have multiple instances of your applications, if one fails the other still works. This is good as a start but not optimum. For example, if your application gets sudden spikes of traffic and need to scale up, you will not be able to do that because the replica counts are hard coded. Can we have a way to autoscale in peak traffic hours? Yes!
#4 Solution. Using Kubernetes Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics).
It is important to specify cpu resources to create horizontal pod autoscaler. If our application utilize more resources than specified the pod will autoscale
we have defined our minimum replica count to be 1, this mean our minimum replica will be only 1, and maximum count as 10, this mean our pods can only scale up to 10 pods. The threshold of our cpu usuage is 50%
#5 Pitfall (Exposing all my databases and microservices as Nodeports)
If you understand kubernetes well, you should be well aware of techniques on how you can expose your services to the external world. By default any pod hosted on kuberenetes have restricted access to external world. Hence we use Kubernetes services to expose them, one of them is by using Nodeport. NodePorts are high-numbered ports ranges from(~30,000–32,767). Exposing port to external world is a bad security practice as well. Other than that, you can also use loadbalancers in kubernetes to expose your services to outside world, but ICP doesn’t support that. Best solution is to use ingress to expose your services.
Ingress, added in Kubernetes v1.1, exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the ingress resource.
internet |[ Ingress ]--|-----|--[ Services ]
#5 Solution. Using Istio ingress-gateway.
It’s a bird. It’s a plane. It’s Super…” No, wait it’s Istio!
One of the most interesting refactoring we have done in our application is to leverage istio for our microservices. It has simplified and added lots of capabilties to our project. Istio is available in ICP Catalog as helm chart that can be easily installed in IBM Cloud Private
Istio is an open source independent service mesh that provides the fundamentals you need to successfully run a distributed microservice architecture. Istio reduces complexity of managing microservice deployments by providing a uniform way to secure, connect, and monitor microservices.
From the Istio site it says:
Istio gives you:
- Automatic load balancing for HTTP, gRPC, WebSocket, and TCP traffic.
- Fine-grained control of traffic behavior with rich routing rules, retries, failovers, and fault injection.
- A pluggable policy layer and configuration API supporting access controls, rate limits and quotas.
- Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress.
- Secure service-to-service authentication with strong identity assertions between services in a cluster.
We switched from using Nodeport to Istio Ingress Gateway. Traditionally, Kubernetes has used an
Ingress controller to handle the traffic that enters the cluster from the outside. When using Istio, this is no longer the case. Istio has replaced the familiar
Ingress resource with new
VirtualServicesresources. So we expose our services to external world by using Gateway and Virtual services. You can use istio ingress gateway either through loadbalancers or using nodeport. Common port such as http and https are expose as nodeport in istio ingress gateway services
We are using istio ingress gateway nodeport to expose our services. This is how it works
- A client makes a request on a specific port.
- The nodeport http and https listens to the connection coming from external world
- Inside the cluster the request is routed to the
Istio IngressGateway Service
- This service now forwards the request to istio ingress pod
IngressGateway Podis configured by a
- Ports protocols and certificates are managed by istio ingress pod.
- Now the virtualservice routes the correct traffic to specific application service.
- Now application service routes the traffic to its specific pod.
Lets see an exmaple, our Booking Microservice
# Pitfall 6 (Poor Resiliency and Fault Tolerance)
Usually in the production environment you will face faults. Faults could be unpredicted long delays, or service unavailibility. We should prepare ourself to counter those behavior in our application.
# Solution 6 (Timeouts, circuit breakers, retries, and pool ejection to rescue)
“Remember that your services and applications will be communicating over unreliable networks. Without ensuring the application actively guarded against network failures, the entire system will be susceptible to cascading failures. Istio comes with many capabilities for implementing resilience within applications, but just as we noted earlier, the actual enforcement of these capabilities happens in the sidecar.”
Timeouts: Timeouts are a crucial component to making systems resilient and available. Calls to services over a network can result in lots of unpredictable behavior, but the worst behavior is latency. If one of our services is taking too long to respond we can timeout that service
Retry: It is crucial to observe that our application operates unreliable networks, hence our services might not be available for certain duration. To tackle that our services shall retry by itself, such that no request is lost.
“Much like the electrical safety mechanism in the modern home (we used to have fuse boxes, and “blew a fuse” is still part of our vernacular), the circuit breaker insures that any specific appliance does not overdraw electrical current through a particular outlet”
The same concept can be applied to microservices, where do not want to overload your microservice with many requests. You don’t want multiple requests getting queued or making that instance or pod even slower. So, we’ll add a circuit breaker that will open whenever you have more than one request being handled by any instance or pod.
The last thing to discuss is pool ejection. Sometime you want to remove badly behaving pods. This means that request will not be routed to these pods, so we give some time for this pod to be stable by a cool-off period
Pool ejection or outlier detection is a resilience strategy that takes place whenever you have a pool of instances or pods to serve a client request. If the request is forwarded to a certain instance and it fails (e.g., returns a 50x error code), Istio will eject this instance from the pool for a certain sleep window
Lets see an example here
Virtual service implementing reties and timeouts
The service timesout for 2 sec, and retries 3 times in case of bad request
Circuit breaker and pool ejection
#7 Pitfall (Lack of observability)
Monitoring is one of the important aspect of building microservices. We need to properly monitor our microservices in terms of cpu usage and memory consumption, to see if our application is working well, and detect sudden spikes. We also need to see how our microservices are interacting with each other in service mesh. In our previous version of ICP-Airways we did not had obervibility.
#7 Solution (Prometheus, Grafana, Weavnet, and Kiali)
Istio gives us observibility out of the box by using Prometheus, Grafana, and Kiali. We installed weavnet separately as it doesnt comes with Istio.
Prometheus: Monitoring system & time series database
Weavnet: Visualising kubernetes
#8 Pitfall (Security)
By default any pod deployed on kubernetes can easily talk with other pods in the specific namespace. This is a big security concern. If an attacker is able to comprimise one of your pod, he will easily compromise all your other pods as well. Because he can easily access your other microservices by their DNS.
#8 Solution (Implementing Zero Network Trust Policies using Calico)
In a time where network surveillance is ubiquitous, we find ourselves having a hard time knowing who to trust. Can we trust that our internet traffic will be safe from eavesdropping? Certainly not!
Security is one of gthe most important aspect to look at while buidling cloud native application. Zero network trust architecture allow us to deploy Cloud Native Applications securely. The architecture policy enforce us not to trust anyone on the network. The concept of zero-trust networking (ZTN) was introduced in 2010. At the core, the ZTN model means not allowing access to anyone unless they are authenticated and their request to a specific network resource has been authorized.
Let’s see an example here
deny all traffic yaml: This will deny all traffic, even pod to pod communication in the same namespace
Allow external traffic to bookingsvc
Allow traffic from bookingsvc to its database (mariadb)
This is cloud native at its best. However, there are other aspect of improvment on this application we are working on such as MTLS Encryption over HTTPS using istio, CI/CD Pipeline, and unit and integration testing.
Join our community Slack and read our weekly Faun topics ⬇