Infrastructure Engineering — Diving Deep

T.v. Vignesh
Published in
21 min readJan 15, 2021

This blog is a part of a series on Kubernetes and its ecosystem where we will dive deep into the infrastructure one piece at a time

So far we have gone through the various principles to keep in mind while working with Infrastructure as a whole and have also discussed the impact Kubernetes and its ecosystem has in our journey towards establishing a cloud native infrastructure.

But considering there are a lot of tools and technologies in play to make this all happen, I am pretty sure that you would be left with a lot of questions unanswered. While we will be diving deeper through the entire ecosystem in this series, I feel that it is important to clear some of the clouded thoughts you may have. So, why not start proceed with this series with an FAQ (similar to what I did for GraphQL)? That’s what we will do here. I have put together a series of questions and have also answered them below.

If you are new to Kubernetes or Containers, I would recommend you to start with any of these resources before jumping into this blog post:

Once you are comfortable with the basics, I would highly recommend you to stick to the Kubernetes Official Documentation since that acts as the single source of truth for everything Kubernetes. But if you are wanting to learn more or explore on the various tools supporting Kubernetes, I would recommend going through their docs respectively.

And if you really want to hear this all from industry experts and learn more by going through their case studies, I would highly recommend having a look at the CNCF Youtube Channel which hosts tons of useful resources in relation to Kubernetes and its ecosystem (don’t miss the Cloud Native/Kube Cons)

Why Cloud Native?

In our last blog we did look at what Cloud Native is and what the stack looks like. The main reasons why you would want to go cloud native is:

  • To achieve maximum scalability, but get there incrementally and do it either on demand or even automatically based on constraints
  • Having systems which can better respond to faults rather than trying to avoid faults which is not possible when you scale a distributed system
  • Avoid any sort of lock-in by adopting a vendor agnostic model leveraging standards, platforms like Kubernetes and other cloud native constructs available for you to use
  • Accelerate the Inner Dev Loop and the complete cycle from application development to production by establishing standard automation and scalable CI/CD pipelines in place allowing for agile development and better release process and delivery of applications
  • Have a high emphasis on the sanity and resilience of your application be it security, monitoring, logging, distributed tracing, backups/failovers and more by leveraging various tools and mechanisms already available as part of the cloud native stack.
  • Handle different architectures be it private cloud, public cloud, on premise or even hybrid cloud without having to change too much in the application layer

So, this would typically mean that it is highly recommended to go Cloud Native for almost everything irrespective of your use case. Your adoption level of various tools and technologies may vary depending on your use case, but the principles do remain almost always the same.

I use Docker compose. How do I transition to Kubernetes?

If you are starting off, and have a lot of compose files to convert, then I would recommend trying out a tool like Kompose which will take in your compose files and generate the K8 yaml files for you simplifying your task from there.

But this is recommended only when you start off as a beginner with Kubernetes. Once you start working closely with it, it is recommended to have your dev environment also as a K8 cluster (be it something like Minikube, Kind, MicroK8s, K3s or anything else for that matter) or even a remote K8 cluster like (GKE, AKS, EKS and so on). This is because you will get a consistent experience starting from development to production and you will have to get comfortable with it sooner or later if you use Kubernetes.

How do I run a Kubernetes Cluster locally?

There are a lot of options like we mentioned which helps us run K8 clusters locally. Some of the notable options would be:

  • Minikube: A one node K8 cluster running within a VM, maintained by Google Container Tools team. Can be pretty bulky if you are low on resources and you want to run multiple clusters on one machine and also take quite some time to start or stop the cluster. Has a high compatibility with upstream K8 versions
  • Kind: Run Kubernetes clusters within Docker, a Kubernetes Sigs project, used as the tool to test Kubernetes itself. Starts and stops pretty quickly. Since there are different container runtimes like Docker and Podman, the behavior can be different in both, every node/control plane is hosted within its own container and leverages the docker networking for all the communication. Has a high compatibility with upstream K8 versions
  • K3s: A lightweight stripped down version of Kubernetes, maintained by Rancher Labs, only stable features are shipped and without many plugins leading to a very low binary size, supports auto deployment. Has a high compatibility with upstream K8 versions but you may want to watch out if you are using alpha features/plugins since you have to install them manually for you to use.
  • MicroK8s: A lightweight Kubernetes version from Canonical, packaged as a snap (so, you don’t need a VM again), better compatible with Ubuntu than other distributions and not supported in distributions without Snap support.

There are other options as well like Firekube, or you can even use Kubeadm directly if you want 🤔. Try everything and use what is best for you.

But do remember that development and production K8 clusters can turn out to be pretty different. So, try testing in a staging environment before you ship something.

Why do I need Helm?

When you are starting off with Kubernetes, you may not need to use helm, in fact I would recommend you not to use when you start. You can just go with good old YAML files and get things working.

But Helm provides a great value when you start diving more deeper, introducing multiple environments with multiple configurations, a packaging/release process, rollback or roll forward and also acts as a package manager helping you to use a lot of already available OSS tools out there by just changing the config as you need and use them.

Considering that you don’t need Tiller anymore starting with Helm 3, all you need is a client to work with Helm.

In summary, Helm can help you with things like templating, packaging/release/versioning and also be a package manager for Kubernetes.

How do I check-in secrets and credentials to my version control?

This has always been a major problem for developers and one of the main reasons behind a lot of attacks on systems like this leading to a lot of concern on the way credentials are handled while still having the need to store it safely somewhere.

This can be handled in multiple ways:

  • Use a credential manager like Vault which can help you manage the secrets and sensitive data at one place
  • Encrypt the confidential data/credentials with a Key Management Service (KMS) using a tool like SOPS and checking in the encrypted credentials to the version control. For very confidential credentials you can also use a HSM (Hardware Security Module) which typically provides the highest level of physical security

While you do this all, it is important to introduce tools in place in case accidents do happen. For instance, a service like Github does have the ability to scan for secrets in the repositories and you may want to leverage it as well. And even after this, if the worst happens, pray for the best and revoke the compromised credentials.

I use a legacy stack. How do I make my application Cloud Native?

Making an application Cloud Native is a continuous process with opportunities always available for changes or improvements. The best way to start off is to take one component at a time and migrate rather than attempting an all-out migration which may not be feasible to start with.

For instance, if you have a portion of the application which requires high scalability, try doing a lift and shift of the respective portion alone and see if the migration actually helps you by just driving a portion of the traffic to the new deployment and see how it behaves. This is where A/B testing or canary architectures with service mesh can actually help.

To keep your application available while you complete your migration, it is also recommended to have a parallel existence of the legacy architecture as well in addition to the cloud native implementation.

In many cases, making an application Cloud Native might require application changes or in very rare cases, a rewrite as well. So, make sure you evaluate all your options and step in with a clear feasibility study.

How do I build a Highly Available Kubernetes cluster?

In some mission critical cases, it might be needed to make your Kubernetes clusters highly available. But don’t get too paranoid cause the more highly available your architecture is, the more complex it becomes. So, be thoughtful and assess if you really want it before you start off.

Making a cluster highly available might need doing things like having multiple replicas of the masters splitting them across multiple different zones/regions with a syncing mechanism established between their respective etcd stores as well. It might also need provisioning different nodes across different regions so that any failure in one cluster in one region does not affect the traffic in the rest. You might want to check how to create a HA cluster with Kubeadm here.

Also, if you want your application to be highly available, you may want to maintain multiple replicas of your pods also making sure that they all don’t end up on the same node by leveraging labels and pod affinity/anti-affinity.

In addition to this, you may also want to scale your application at the load balancer level using a software based load balancer, leveraging DNS to make sure you don’t end up sending all the traffic to the same place all the time, having static assets in the CDN which can help even when the server/cluster fails.

In summary, the scalability and availability can be and should be done at multiple different layers depending on where your bottleneck or issues are.

How do I collaborate with my team in the same Kubernetes cluster?

There are a lot of ways in which you can isolate workloads and still collaborate when you are working with your team using Kubernetes. While you can effectively use one cluster for every developer and some organizations do that which is nothing wrong since it provides a great isolation in workloads between team members, it may turn out to be unmaintainable and difficult to manage over a long time (especially if there is a single OPS team doing that) handling security patches/upgrades/managing auth and so on. To avoid such scenarios, some other ways to do this would be:

  • Leverage Kubernetes Namespaces — Allocating each namespace for a team member or a team as a whole and running your workloads within. While this can work very well for small teams or individuals, it becomes difficult to manage as the number of namespaces increase with increase in the number of teams/members using the cluster making it difficult for the admin to do operations like setup RBAC rules, remove unused namespaces, scale the cluster and so on. To simplify this, there are tools like Okteto keeping you sane while work on everything you have to.
    Or if you have managed to have access to a separate namespace all for yourself, you can also end up using things like Swap Deployment and Proxy Deployment from Telepresence effectively allowing you to develop your service locally in your system.
  • Leverage Same Namespace with Headers and Proxying — While a single namespace can be used by multiple developers, things can become tricky especially if multiple developers are modifying the same service at the same time. So, one developer modifying something will in turn lead to unexpected results for an another developer since there is no isolation.
  • This is where headers come in to the rescue. If you use a tool like Service Preview or Bridge for Kubernetes, this is exactly what they do. They leverage a sidecar like envoy to do the routing to a different instance of a service depending on the headers being used in the requests. This is powerful because, all you need is one namespace for everything without blocking any developer from doing any changes they would want.

Or if none of this works for you, and you want complete control, you can spin up Kubernetes clusters locally and work with it with the help of tools like Tilt or Skaffold. The choice is again yours depending on what you want to do.

How do I do Site Reliability Engineering (SRE) in my K8 cluster?

Unlike what most people think, Site Reliability Engineering (SRE) encompasses a wide array of activities which also includes doing things like monitoring the system health, logging events, handling scalability, responding to incidents/failures in a timely and organized manner, building complex distributed systems and so on, that there is a whole website where Google talks about SRE here.

And even if doing all of this is required specifically if your use case is SLA critical, a lot of these are offloaded away from you if you use managed Kubernetes clusters (like GKE, AKS, EKS and so on) leaving a relatively less space to cover from your end. For instance, in case of GKE, Google manages the masters, and if you are subscribed to an update channel, they send you periodic updates with latest features both from GKE and the upstream versions, help you manage logging, monitoring and other operations with Google Cloud Operations suite (formerly Stackdriver) and so on helping you have a quick start.

In addition to this, there are a lot of amazing tools which help with different SRE problems in their own way. For instance, you can use Prometheus as your database to scrape and store time series metrics, Grafana to manage your dashboards, FluentD or Loki to do log aggregation and conditional filtering, OpenTelemetry to instrument and expose metrics in your application, Jaeger to do distributed tracing, Velero to manage backups and failovers and this list is long as we see different tools cater to different problems in SRE.

We will talk more about this in our next blog post.

Is there a dashboard I can use to visualize and manage my clusters?

The great news is, there are quite a few.

  • Some distributions of Kubernetes do come with its own Default Dashboard UI giving you all the basic info and control you need with your cluster. But do note that it might not be enabled by default for security reasons.
  • You can also use the dashboard provided by your cloud provider. In case of GKE, there is a great dashboard where you can drill down into every resource and manage it using the options provided to you which is really convenient when you want to do something very quick.
    Or you can also have an amazing tool like Octant take care of this for you. Assume this like a User Interface to your kubectl client.

There are even more dashboards like Weavescope and even more tools like these. Just go for what gives you more visibility and control over your cluster with great usability and you should be good to go.

But for power user operations, we would always recommend going for kubectl since that is the single tool which is used by almost all the clients out there to interact with the Kubernetes API Server.

Do I need a Service Mesh?

Service Mesh has garnered a lot of popularity these days especially after leaders like Linkerd, Istio and Consul all have demonstrated a different way to do networking, authorization, logging, instrumentation, A/B Testing, MTLS and more with sidecars without having to modify the application code.

While service mesh is really powerful, not every use case might need one, especially when you have very little services to manage.

Adding a service mesh would need both a control plane and data plane to be setup with proxies being injected as sidecars to do the heavy lifting for you. While this might seem complex to start with, the benefits can be realized as soon as the number of services you manage increases and this is when you will reap all its rewards.

Also once the SMI Spec is widely supported by all the mesh providers (there is a good support already), it would enforce a lot of standardization to the service mesh ecosystem avoiding the need to be coupled to an implementation. Do note that a sidecar might not be well supported always with the various tools you use with your application. For instance, adding a sidecar along with your database or event queue may or may not work depending on the protocol being used.

But overall, it has a very promising future especially when adopted incrementally. So, in summary you don’t need a mesh when you start but it is great to have once you have a significant number of services.

How do I manage Authentication & Authorization for the cluster and various services within?

There are various ways to do authentication and authorization and can vary depending on your context.

  • If you want to add authentication or authorization within your application/service you can use any of the mechanisms as you see fit including JWT with OAuth2, sessions, cookies or even basic auth mechanisms. This would require writing logic within the service and also using libs to help you with this whenever needed and redoing this for every service where you need auth might be difficult to do while definitely possible. To make this simple, you can use tools like OPA as an SDK which can generalize a lot of things for you.
  • While OPA has native Golang support. Other languages are supported via WebAssembly (if you want to use OPA as an SDK)
  • The next way to add authentication/authorization is using Sidecars if you are using a service mesh or even OPA as a sidecar. This can offload the authentication/authorization away from your application leaving just the business logic within. This allows you to just inject the sidecars whenever needed without having to worry to much about how it might break your app. The sidecars can also things like MTLS, rate limiting and more as needed by your application.
  • If you would like to do cluster level authorization to assign roles, policies and access controls, you can make use of either OPA Gatekeeper or rely on RBAC to get the job done for you

How do I support hybrid cloud with the help of the Cloud Native Stack?

If you are onboard Kubernetes and the rest of the Cloud Native stack, then supporting hybrid cloud can be pretty easy provided you don’t use services/APIs as specific to a cloud provider. Almost every big cloud provider out there supports managed Kubernetes as a service today which you can use and if you are very insistent, you can also spin up a VM and run your Kubernetes cluster within it and manage it yourself.

Projects like Kubefed and Crossplane are especially useful here since they help you to manage and orchestrate clusters and the requests you send across different cloud providers even if its going to be across regions.

While these are the best tools to manage this kind of hybrid cloud scenarios, using a service mesh can also help if you have a Multicluster architecture like this or this setup helping you to communicate across cloud providers.

Which container runtime should I use?

Kubernetes supports multiple container runtimes due to its adoption of pod as the basic unit of scheduling. While Docker was one of the runtimes so far, it has been recently deprecated in favor of better standards like CRI removing the shim. The other recognized runtimes would be containerd, or even a low-level runtime like runc. You can read more about how they compare in this post or even this. As they mention, today making a call to Docker engine will make a call to containerd which inturn makes a call to runc. The main difference lies in the fact that every runtime has a different level of abstractions and ultimately the lowest level of the hierarchy is going to be LXC which is based on C or runc which is based on Golang.

While you can go for any runtime which supports your use case and also supported by your cloud provider, a great start to enabling this would be to build an OCI compliant image so that you can use that across with different runtimes.

What are the differences between CRI, CSI, CNI, SMI ? Why do they matter?

All of them are different standards meant to avoid possible complexities and vendor lock-in allowing for a great level of interoperability.

  • CRI (Container Runtime Interface) is a standard which helps establish interoperability within multiple container runtimes like containerd and others
  • CSI (Container Storage Interface) is a standard which helps establish interoperability between multiple storage providers avoiding the need to have in-tree plugins within the core. So, any storage provider who supports CSI can work with Kubernetes without any issues. You can find a complete list of providers supporting CSI here
  • CNI (Container Networking Interface) is a standard which helps establish interoperability between multiple networking solutions again avoiding the need to have in-tree plugins within the core and separating container networking and execution. There are a lot of plugins and runtimes which support CNI today.
  • SMI (Service Mesh Interface) is a standard which helps establish interoperability between various service mesh solutions like Linkerd, Istio, Consul and more. A lot of things like traffic access control, metrics, specs, splitting, etc. are also to be standardized so that users do not have to get locked in to a specific provider.

How do I use Kubernetes at the Edge or on IOT Devices?

Kubernetes has become so special that it is being now used in different kinds of environments including fighter planes and Raspberry PIs. This calls for a different way of thinking including adding support for offline operations and control, allowing a lightweight distribution which can run with restricted compute, run in different processor architectures and so on.

Use cases like these are made possible by projects like KubeEdge , K3s and Virtual Kubelets. You can read more about how they power the edge with different architectures and compromises here.

How do I start with Infrastructure as Code?

Infrastructure as Code (IaC) is not to be confused with Configuration Management though a lot of tools do have overlapping functionality providing features from both worlds.

There are a lot of tools which helps you convert your Infrastructure as Code most notable of which are the likes of Terraform, Pulumi, Ansible, Puppet and more each of which work differently. For instance, Terraform is declarative and uses HCL (Hashicorp Language) while Pulumi leverages the power of respective programming languages to do its job.

The best way to start with Infrastructure as Code is to go for it incrementally as you normally would for any migration (unless you are starting from scratch). You also have a lot of community resources which can help you in the process. For instance, if you use Terraform, the Terraform registry hosts a lot of Terraform Modules from the community along with a wide support for a lot of providers which you can use. Interestingly, it also allows you to manage all the resources like deployments, services and so on in your K8 cluster if you want to do it the terraform way. So, the options are endless.

But it only makes sense to adopt Infrastructure as Code if you are someone who truly leverages GitOps since it is always important to have any proposed change to your infrastructure properly reviewed by the respective stakeholders then applied also making sure that you avoid possible conflicts by using a locked state file checked in to a remote location like GCS or their respective cloud service.

If you are someone who would want to adopt the DRY principles and would want to have maintainable code, a project like Terragrunt can actually help you with this.

As we just saw, there are loads of options to go for. Just make sure that you review your changes properly before applying them without which it can cause disastrous effects.

How do I do CI/CD and GitOps with Kubernetes?

There has been a lot of amazing projects in this area as of late, so much so that, there is now a separate foundation dedicated to this. While projects like Jenkins were leaders before, the cloud native world has an interesting set of problems to be solved starting from even making your CI/CD pipeline scalable and adhere to all the cloud native principles we discussed. This is what has led to the rise of projects like Tekton which was born out of KNative, Jenkins-X (which also uses Tekton), Spinnaker and Gitlab CI with Kubernetes executors if you are on Gitlab, Github Actions if you are on Github and so on giving you a myriad of options like FluxCD, ArgoCD, etc. to play with.

When doing CI/CD on Kubernetes, you can either leave the runner to be hosted by your service provider (eg. Github/Gitlab hosted runners) or host your own runner on the K8 clusters you want. Either way, you will get the power to scale the runners as much as you need and run multiple pipelines parallelly without blocking others.

Your runner can take the job of pulling your code or binaries from the version control, building the image out of it, pushing it to the registry and also finally deploy it to the target clusters after all the necessary tests and checks are done.

Otherwise, the way you would do CI/CD in a cloud native world is similar to that of any other CI/CD pipeline.

How do I do serverless in Kubernetes?

Well, you always have servers. Serverless according to me is nothing but abstractions which helps you forget about the servers and all the complexities and scalability behind allowing you to focus on just the business logic at hand and Kubernetes allows you to do this as well.

You can opt for projects like KNative, OpenFaas or Kubeless all of which allow you to run serverless workloads on Kubernetes. But if you are using cloud providers like GKE, they provide their own solution like Cloud Run which comes closer to hosting your own serverless platform. Ultimately, the containers spin up/down in your cluster depending on the compute they need.

How do I choose my cloud provider?

This is a huge debate to have and can depend on both your use case and a lot of other factors.

But there are few things to keep in mind while choosing a cloud provider.

Make sure that:

  • They provide all the basic services you need to help you with your use case
  • They provide the services at an affordable cost, even as you scale up/down
  • They have great documentation, support and developer relations team
  • The services they provide don’t lock you in to their platform
  • They provide services in most of the regions you would want to serve your customers in
  • They provide a great emphasis on security, performance and usability in all their offerings
  • They have an SLA to offer you depending on your needs and also have a good track record on quickly responding to possible incidents
  • They satisfy all your compliance requirements and also have the necessary certifications to prove the same
  • They have a rapid innovation culture which can support you in your future ventures or as you scale up

These would just be few of the criteria to start with. But above all, make sure that you give it a try yourself before going for it.

Where can I know more about the Cloud Native projects?

The best place to check this all out would be here:

While this does not cover all the Cloud Native Projects (since it only hosts projects within the CNCF foundation), this would be a great start for you depending on the domain you want to explore more on.

But if you want to explore other projects, you can also have a look at the Cloud Native Landscape here:

While there are a lot of projects in that list, do note that all of them are in different stages of maturity levels ranging from Sandbox to Graduated. So, please be mindful of that when you are choosing something since it might undergo significant changes over time.

Are there some case studies which can actually help me in the implementation?

Yes, and there are lots. For a start, you can go through the Kubernetes specific case studies here and case studies from other CNCF projects here.

If you want to know more about how a specific organization uses Kubernetes, its highly likely that its already out in Youtube. Just give it a search and you can find tons of them. Or if you want to explore more, all you have to do is head over to the CloudNativecon | Kubecon and you will find a lot of speakers talking about their experience in using Kubernetes and all the tools available out there.

How do I contribute back to the Kubernetes and Cloud Native community?

Every bit of contribution really matters. And there are a lot of ways in which you can help.

  • Contribute to the K8 Docs
  • Contribute to K8 with bug fixes, enhancements, failing tests, feedback and so on
  • Help the community by joining the various channels within the Kubernetes Slack community
  • Contribute to all the CNCF projects or projects from the Cloud Native landscape
  • Write blogs like these, host meetups, speak in conferences about your experience and evangelize your best way possible
  • Join the CNCF foundation and support the projects directly/indirectly
  • Find a problem not addressed by anyone in the community? Propose and develop your own solution and contribute back to the community

There are a lot of small ways in which you can give back. Small or big does not matter. Every contribution counts.

Hope this was informative. Do you have any question which I have not covered in this list or are you looking for some help or engineering advisory/consultancy? Let me know by reaching out to me @techahoy. I will be using this blog like a living document and will be updating it with more useful Q&As when I find time.

If this helped, do share this across with your friends, do hang around and follow us for more like this every week. See you all soon.