Problems you face working with a big platform
I have worked as a “DevOps Engineer” for different companies at different scales. As you might expect, they have different definitions of what a DevOps engineer is.
These are types of “DevOps” I have worked with
- Build Infrastructure as Code, CI/CD solutions, Developer experience tools for internal teams(Build, but not operate)
- Use internal platform/public PaaS to build a specific product (use what’s there, build what’s missing), sometimes called local DevOps
- Run and Fix (mostly keeps the engine running, and fixes what’s broken. you may call some fixes as builds too, but in reality, it’s just to patch the wounds and move on)
- Build and Run (I preferred SRE in this case, where you have to balance the build and stability of the platform)
I find working as SRE is great for my career in that I get to work with a challenge of
- Keeping up with technologies and building a product just like Build team
- Maintain the stability of the platform and encounter real-world problems that make you understand deeper how things work together as a Run team
so, In my point of view, to be a more complete DevOps engineer, it’s better for you to also have some experience in Running part in supplement of Building.
Another dimension contributing to this, despite the type of team, is Scale.
The bigger the scale, the more difficult problems you will face, and of course, the more experienced DevOps engineer you will become.
In this blog post, I will walk you through things I only get to work with when working with a relatively bigger platform.
Instability
With bigger platforms, there’s a much higher chance that things will go wrong.
Imagine problems running just a few Nodes in one cluster against hundreds of nodes with thousands of pods.
When running a small platform, from time to time, you will see an OOM killed pods, Nodes getting full, then you apply more resources then call it a day.
However, on a bigger scale, these problems will happen many times a day and will happen continuously. without having good telemetry and automation, you will be flooded with these toils that you cannot do anything else.
Also, some problems will only be found when you have a big enough system. for example:
- Prometheus size is way too big due to the cardinality of metrics exposed by applications in your platform
- The logging pipeline got bombarded by some applications affecting the whole logging capability
- Rolling upgrades cluster with 100 nodes causing a network disruption
- ETCD memory is full putting Kubernetes API in read-only mode
The quote I liked to use in this is “what could go wrong, will go wrong”
Technological complexity
When working on a smaller scale system, most of the time, it’s just a matter of tools.
The platform is normally not mature, or sometimes not yet a platform.
They still lack most of the core features, so, your job is mainly to implement what’s missing to be a workable platform.
In this kind of situation, there won’t be many conflicts in terms of what you can or what you cannot do.
For example, When there’s a requirement for monitoring, someone will say, use Prometheus! When there’s a requirement for logging, someone will say, use ELK!
However, when you work on a bigger scale, there will be problems that do not have an “out-of-the-box” solution for you, and there are edge cases that you have to work around to have it work with your environment or the specific context of your organization has.
Legacy problems
- How to manage multiple Kubernetes versions at the same time? (your codebase will diverge)
- Issues arise from using not-up-to-date tools/libraries (trust me, they are scary)
Complex problems
- How to make highly available services across multiple clusters?
- How to minimize the impact of a full cluster rebuild?
- How to define solid and reliable SLIs/SLOs?
- Network connectivity with on-premise
Mostly, when you have more tenants, more nodes, and more workloads, there will also be more toils and operation works you need to handle.
to reduce these toils effectively requires fine-tuning and automation here and there, and these things can become complex to maintain as well.
Multi-tenancy
When you run a platform that has multiple development teams operating on it, apart from the instability mentioned above, other aspects add complexity to the picture
Security
- How to detect vulnerabilities from tenants' applications and reduce the blast radius on the platform?
- How to manage authentication and authorization for each customer persona in the least sufficient privileged way?
Resource
- How to detect and prevent one tenant uses up all the resources affecting other tenants, imagine, logging, traffic, and resources?
- How to autoscale the platform cost-effectively based on tenants' usage (resource quotas, limits, etc)
Platform visions
Each tenant comes with its own needs and sometimes it’s not possible to say yes to all of the incoming requirements.
implement specific requirements might cause
- Impact on other tenants
- causing more platform instability
- making platform more difficult to evolve in the future
so, you need to think carefully, putting your long term goal and short term goal, and select a solution that fits in between
Service Level Agreement
With a mature enough platform, it’s common to be asked about Service Level Agreement, and Service Level Objective, this is to grow trust with your customer and also to understand the performance and reliability of your platform.
Setting up SLA/SLO requires a thorough understanding of each component in your platform and finding a solid Service Level Indicator for them.
The more offering you have for your customer, the more work for you to understand, implement and observe your agreement.
You cannot promise your customer a good service if you don’t even know how it works at the moment right?
Final thoughts
People enjoy working in a different kind of work that fit their interests. I used to enjoy working in a build team, creating cool IaaC for other teams.
However, after experiencing working in an SRE team. it helps me grow my understanding of building an extensible, stable, and evolvable platform. Facing uncommon issues is always good learning. Defining SLO/SLAs can be a headache, but the experience is well worth the headache.
so, I highly recommend for junior DevOps engineers find work that allows them to experience both the Build and Run side of the platform to grow their careers.
Thank you!!