Stories by Madokai on Medium

How AI Accelerate The Obsolescence of “Old” DevOps Engineers

Madokai — Thu, 09 Apr 2026 12:44:36 GMT

Photo by Yan Krukau: https://www.pexels.com/photo/person-laughing-at-a-man-in-white-top-7640450/

In the following article, I share my personal perspective on the shifting landscape of tech roles and the inevitable transformation of what it means to be an ‘engineer’ in the AI era.

The Draft I Never Published

Two years ago, I sat down to write this article. At the time, ChatGPT was a novelty, and the “disruption” I envisioned felt like a distant, manageable wave. I wrote about predictive analytics and self-healing systems as if they were features we would slowly integrate into our existing workflows over a decade.

I never published it because the world moved faster than my keyboard could keep up. What I thought would take ten years has manifested in two. Today, the “Old DevOps Engineer” isn’t defined by age, but by a refusal to acknowledge that the floor has dropped out from under the traditional IT career path.

The Death of the Gatekeeper

In my original draft, I viewed DevOps as the bridge between code and infrastructure. But today, AI is building that bridge autonomously.

We are moving past simple automation into the era of cognitive infrastructure. When an LLM can generate a Kubernetes manifest, debug a 500-error in a Python script, and optimize a Terraform plan in seconds, the role of the “Human Gatekeeper” vanishes. If your value was based on knowing the specific syntax of a YAML file or how to manually configure a Jenkins server, that value has been commoditized.

The Definition of “Old” Has Changed

When I speak of “old” engineers now, I don’t mean those with gray hair. I mean those, whether Developers, QAs, Architects, or Project Managers, who stopped being proactive.

In the pre-AI era, you could coast on “legacy knowledge” for five years. Today, the half-life of technical knowledge is shrinking toward zero.

The Architect who relies on patterns from 2020 is obsolete.
The Developer who refuses to use AI-pair programmers is a bottleneck.
The Manager who plans for 6-month delivery cycles is a liability.

The barrier to entry for complex tasks has lowered so much that “experience” is being outpaced by “adaptability.”

… And Comes The Anxiety

I will be honest: there is a growing anxiety in me that I didn’t feel two years ago. I used to think I had a decade to figure this out. I was wrong. The speed of AI integration into the Software Development Life Cycle is breathtaking, and for the first time in my career, the question of “job security” isn’t about the economy, it’s about the fundamental utility of a human in the loop.

This anxiety stems from the realization that the very labels we wear, DevOps, Developer, QA, Architect, are dissolving into a singular, blurry role: the AI Prompt Engineer. We are no longer specialists in syntax or infrastructure, we are becoming “orchestrators of intent.” If the machine can write the code, test the code, and deploy the code based on a well-crafted prompt, the traditional distinctions between our roles become purely academic.

Furthermore, the core principles of DevOps that we spent a decade perfecting are being “rolled back” in favor of raw velocity. We used to preach “quality at the source” and rigorous, multi-layered testing. But in the AI era, the mandate is to move forward FAST. We are seeing a shift toward roles like Forward Deployed Engineering, where the goal isn’t necessarily to build a perfect, “future-proof” system, but to solve immediate problems at the edge with immediate AI-generated solutions.

We are sacrificing the “perfectly tested pipeline” for the “instantly generated result.” In this new paradigm, the “old” engineer who insists on manual code reviews and two-week sprint cycles isn’t just slow, they are an obstacle to the business.

I watch these tools evolve with a mix of intense curiosity and creeping pessimism.

The Optimist in me is thrilled: we are finally shedding the “toil” of IT. We can finally focus on pure logic and high-level problem solving.
The Pessimist in me asks: how many “high-level problem solvers” does a company actually need when one person with an AI agent can do the work of a five-person DevOps pod?

When speed becomes the only metric that matters, the “Safety First” culture of traditional DevOps feels like a relic of a slower age. We are building the plane while flying it, and AI is the one holding the blueprints.

Survival is a Choice

The IT domain is undergoing a “Great Compression.” Roles are merging, silos are collapsing, and the “old” way of working is being deleted like a deprecated library.

I am still curious. I am still learning. But I am no longer under the illusion that my job will look the same in three years. The only security we have left is our ability to learn faster than the models can train. If you aren’t running to keep up, you’ve already stopped moving.

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

Everyone Is Asking Which Jobs AI Will Take. That Is the Wrong Question.

Madokai — Tue, 07 Apr 2026 10:01:32 GMT

There is a debate happening across every engineering team, every conference, every LinkedIn feed right now. It is loud, it is anxious, and it is almost entirely focused on the wrong thing.

Everyone is asking which jobs AI will take. Too few are asking which jobs AI literally cannot function without.

Let me answer the second question.

Every Model in Production Runs on Something

When millions of people open an AI tool simultaneously, something has to scale to absorb that traffic. When a model endpoint goes down at 2AM, something has to detect it, alert on it, and recover from it. When a new model version is ready to ship, something has to build it, test it, containerize it, deploy it, and roll it back if it breaks.

That something is infrastructure. And infrastructure does not manage itself.

It does not debug its own networking issues. It does not rotate its own credentials when a secret is compromised. It does not right-size its own node pools when GPU costs spike unexpectedly. It does not write its own incident postmortem or update its own runbook after a failure mode nobody anticipated.

Every AI model in production exists because a human built and maintained the layer underneath it. That layer is not optional. It is not being automated away. It is, if anything, becoming more complex as the systems running on top of it become more sophisticated.

The Layer That Does Not Move

There is a pattern behind the roles that AI amplifies rather than replaces. The ones closest to the metal. The ones who own the layer everything else runs on.

The pattern is not about job titles. It is about proximity to the foundation.

Linux has run the internet for thirty years. Networking fundamentals have not changed meaningfully since the nineties. Kubernetes is ten years old and accelerating, not declining. The skills closest to the hardware and the operating system have the longest half-life in the entire industry. Apps change every year. Frameworks change every two years. The foundation stays.

Complexity does not disappear with abstraction. It moves down the stack and waits. You can operate on the surface for a long time. You can add memory to a pod without understanding how Linux manages it. You can ship a CI/CD pipeline without thinking about network policy. Until the day you cannot. The engineers who understand what lives underneath the abstraction are the ones who solve the problems nobody else can explain.

The Interface Is Changing. The Foundation Is Not.

Here is what is actually shifting, and it is significant.

Three years ago, operating infrastructure meant writing. You wrote Terraform modules. You wrote GitHub Actions pipelines. You wrote Kubernetes manifests, Helm charts, Ansible playbooks. The work was largely expressed in code, and the quality of that code was a direct function of how much you had written before.

That interface is changing.

With agentic systems, you increasingly orchestrate rather than author. You define the spec. You describe the desired state in natural language. You review what the agent produces, you validate it against your understanding of the system, you approve or reject it, and you own the outcome. The Terraform still gets written. The pipeline still gets built. The manifest still gets deployed. But the primary interface between the engineer and the system is shifting from authoring to orchestrating.

This is not a demotion. It is a leverage multiplier for the engineer who understands the system deeply enough to validate what the agent produces. And it is a trap for the engineer who does not.

An agent that generates a Kubernetes network policy without understanding the service mesh it is being deployed into will produce something that looks correct and breaks silently. The only person who catches that is the engineer who knows what correct actually looks like. Not because they read the output carefully, but because they understand the system the output is going into.

The interface changed. The requirement for deep system understanding did not.

The Agentic Shift Makes Infrastructure Engineers More Valuable, Not Less

The irony of the current moment is that the rise of agentic AI systems is one of the strongest arguments for investing in infrastructure expertise right now.

Agents need infrastructure to run. They need GPU clusters, managed Kubernetes services, optimized networking, cost-controlled cloud environments. They need observability stacks that can monitor non-deterministic systems where the definition of correct behavior is not a binary pass or fail. They need CI/CD pipelines that can handle model versioning, dataset lineage, and inference endpoint management alongside traditional software deployments. They need security models that account for prompt injection, model exfiltration, and a new class of attack surface that did not exist three years ago.

All of that is infrastructure work. None of it runs itself.

And beyond the infrastructure layer, agentic systems need engineers who understand them deeply enough to know when they are wrong. We built this into our own workflow. We use agents to accelerate the authoring of Terraform, pipeline configuration, and Kubernetes manifests. But every output gets reviewed against a mental model of the system built from years of production experience, from incidents and postmortems, from migrations that went wrong and recoveries that had to be rebuilt from scratch under pressure.

That mental model is not something you can prompt into existence. It is built by operating systems at the edge of what they can handle, and by being accountable for what happens when they cannot.

Where to Invest the Next 12 Months

In twelve months, the teams that pull ahead will not be the ones that chose between infrastructure depth and agentic workflows. They will be the ones that stopped treating them as separate investments.

The engineer who writes the most effective Terraform prompt is the one who knows exactly what that Terraform will do in the real system. The one who catches the agent’s mistake in a Kubernetes manifest is the one who has debugged enough broken clusters to recognize what wrong looks like before it reaches production. The one who reviews AI-generated pipeline config the way you would review a pull request from an engineer you do not fully trust yet, because that is exactly what it is.

The foundation is what makes the agentic layer trustworthy. The agentic layer is what makes the foundation scalable. They are not two career paths. They are the same engineer at two different moments in the same workflow.

This is what we do at Madokai. We live inside the infrastructure, the pipelines, the observability stacks, the security, and the agentic layer on top of all of it. If your stack is ready for this conversation, so are we.

About The Author

Hicham Bouissoumer — Principal DevOps/MLOps — Co-Founder Madokai

Everyone Is Asking Which Jobs AI Will Take. That Is the Wrong Question. was originally published in DevOps.dev on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Agentic Shift. Your Stack Is Ready. Your Workflow Isn’t.

Madokai — Mon, 06 Apr 2026 13:04:55 GMT

I want to tell you about a specific kind of exhaustion. It is the exhaustion of being the most technically capable person in the room and still spending your morning doing things your stack should already be doing for you. You automated the hard parts. The deployment pipelines, the reconciliation loops, the alerting.

Your stack runs itself. Your Monday does not.

What an Agent Actually Is

Most engineers have already formed an opinion about AI tools, and most of those opinions were shaped by the wrong mental model. The instinct is to think of these systems like a CLI. You run a command, you get output, the process exits. Nothing persists, nothing acts, nothing waits.

That model is accurate for most of what the mainstream calls AI. It is completely wrong for what agents are.

Think of it like the difference between a script you run manually and a controller loop. The script does exactly what you tell it, when you tell it, and then it stops. The controller loop watches the desired state, compares it to the actual state, and reconciles continuously without anyone triggering it. You do not babysit a controller loop. You define its scope, you give it access to the right resources, and it runs. An agent is that same pattern applied to operational logic. You configure it once, connect it to the right data sources, and it operates. If you have spent time designing a solid internal developer platform, the architecture will feel immediately familiar: one entry point, opinionated routing, complexity absorbed in the back end. That is not a coincidence. It is the correct pattern applied to a different layer of the stack.

The Lesson I Keep Relearning in Production

There is a class of incident that every infrastructure engineer knows by heart. Something breaks at 3am. The alert fires. Someone gets paged. They spend forty minutes correlating logs, traces, deployment history, and recent config changes across four different tools before they even know what to fix. The system had all the data. The metrics, the spans, the events, the full timeline. What it did not have was anything that could synthesize that context into a decision without a human in the loop.

We have spent years building incredibly sophisticated observability stacks and still designed them around the assumption that a human has to be awake to act on them. The system observed everything. It acted on nothing.

That assumption is now optional. And the teams that have internalized that are building something qualitatively different from everyone else.

The Mistake We Made for Years

When you live inside client infrastructure every day, you develop a sharp eye for a particular anti-pattern: point solutions stacked horizontally with no coherent orchestration layer underneath. One tool for deployments. One tool for alerts. One tool for cost tracking. One tool for security scanning. Each one technically functional in isolation, the combination producing more overhead than the individual tools save, because the integration between them is a human being context-switching all day.

We have watched teams do exactly this with AI. A Copilot plugin here. A summarizer for postmortems there. A standalone chatbot with no awareness of the rest of the system. Useful in isolation, disconnected from everything that would make it genuinely powerful. That is not an agentic architecture. That is a collection of browser tabs with AI in the name.

The compounding value of agents comes from integration depth and orchestration. An agent with access to your observability stack, your IaC state, your incident history, your cost data, and your deployment pipeline is not several tools bolted together. It is one system with full context. Depth without connectivity is just another SaaS product. Connectivity without depth is just plumbing. The leverage lives at the intersection.

What Changes When You Actually Commit to This

There is a specific feeling that comes from finally getting a platform to the point where a developer opens a pull request and watches their service go from code to production without touching anything else. When the toil disappears, the ceiling rises. Agentic systems do the same thing, but one abstraction layer higher.

When the on-call rotation stops waking people up for alerts that resolve themselves. When cost anomalies are caught before they become a finance conversation. When drift between your Terraform state and your actual infrastructure is detected and documented without anyone running a plan manually. When log noise is filtered and surfaced as actionable signal instead of dumped into a dashboard nobody reads at 2am.

What you get back is not time. It is the mental headspace to work on problems that do not yet have a runbook. The ones that require taste, judgment, and experience. The kind of work that actually moves the architecture forward instead of just keeping it alive.

The Honest Engineering Take

None of this works without clean foundations. An agent connected to messy, undocumented infrastructure will produce messy outputs. Garbage in, garbage out. The quality of your agent is a direct function of your runbooks, your tagging strategy, your observability coverage, and the contracts between your services.

This is not a reason to wait. It is a reason to treat your operational knowledge base with the same discipline you apply to your codebase. Define the scope. Test the boundaries. Build explicit escalation paths for when the agent should hand off to a human. An agent that knows when it does not know is more valuable than one that always produces an answer.

The agentic layer is not a future roadmap item. It is a present-tense architectural decision. Every week you run it as a point solution is a week the gap between you and the teams that do not widens.

The question is not whether you need this. The question is how many runbooks you have written for processes that should already be running themselves.

This is what we do at Madokai. We live inside the infrastructure, the pipelines, the on-call rotations, and increasingly, the agentic layer on top of all of it. If your stack is ready for this conversation, so are we.

About The Author

Hicham Bouissoumer — Staff DevOps — Co-Founder Madokai

Why DevOps as I Know It is Dead

Madokai — Sun, 05 Apr 2026 12:43:05 GMT

Photo by Daniil Komov: https://www.pexels.com/photo/ai-assisted-code-debugging-on-screen-display-34804018/

This article isn’t the result of a solitary reflection. It’s the product of an intellectual “clash” and intense brainstorming between my field experience (DevOps, MLOps, FinOps, and SecOps) and advanced Artificial Intelligence. What you are about to read is our shared vision of a brutal mutation in our profession. We are no longer just system engineers, we are becoming the conductors of autonomous entities.

The End of DevOps as I Knew It

For a long time, DevOps was the bridge between the “What” (the code) and the “Where” (the infrastructure). We automated pipelines, managed K8s clusters, and optimized Cloud costs. But today, a new force is changing the very nature of that bridge: AI is no longer just helping to code, it is beginning to decide.

The Erosion of the Technical Niche

The anxiety currently gnawing at the IT world isn’t just a simple fear of change, it’s a profound existential crisis. For years, our market value and social identity as engineers rested on possessing a technical niche. Knowing how to fine-tune a cluster, optimize a CI/CD pipeline, or debug a complex Python script was an insurance policy.

Today, this barrier to entry is collapsing. The feeling of “dispossession” is real. Seeing a language model generate, in seconds, an infrastructure that took us years to master causes a brutal shock. For many, it feels as though “niche code” is becoming a commodity, a soulless, off-the-shelf product. This reality creates fertile ground for professional depression: if anyone can, with a well-structured prompt, obtain a functional result without understanding the basics of networking or systems, what is left of our expertise?

This aggressive democratization of technical knowledge forces seniors to question their legitimacy and juniors to doubt the relevance of their learning. We are no longer fighting a bug, we are fighting the idea that our brains have become processors that are too slow and too expensive for the industry. Rarity no longer lies in writing the script, but in the ability to avoid drowning in the ocean of generated code which, if left unsupervised, becomes immediate and invisible technical debt.

The Birth of the “Technical Product Owner”

Value is shifting. We are no longer “code writers,” but “requirement translators.” The job now consists of understanding business stakes to transform them into logical instructions that an AI can execute. This is a paradigm shift: we are moving from manual execution to the supervision of autonomous systems.

The reality of today’s IT market is no longer a search for hands to type code, but for brains capable of translating complex business needs into a structured logic interpretable by AI. The profession is tilting from execution to oversight. The Technical Product Owner must possess a holistic vision that AI does not yet have: an understanding of business goals, cost constraints (FinOps), and long-term company strategy.

Knowing that an AI can code a microservice is one thing, knowing why that microservice should exist, how it should interact with the existing ecosystem, and what ethical or security limits to impose on it is another. Our role is becoming that of a safeguard and a strategist.

The New Reality: AI Infrastructure Engineering

Becoming an AI professional doesn’t mean becoming a Data Scientist. It means building the ecosystem that allows AI to act safely and effectively within the enterprise.

Prompting as Engineering

“Prompt Engineering” is no longer an option, it’s our new CLI. Knowing how to prompt means knowing how to define Guardrails, manage complex contexts, and ensure the AI does not drift from its mission.

Tomorrow, in a job interview, you won’t be asked to write a sorting algorithm on a whiteboard. Instead, you’ll be asked to demonstrate your ability to pilot an agent to resolve a production incident in real-time, without direct human intervention, while respecting the company’s security policies. This is a high-level supervisory skill where the precision of the word replaces the precision of the syntax. It is becoming a deal-breaker in interviews: can you pilot an agent to fix a production incident autonomously?

The Model Context Protocol (MCP): The New Standard

If the LLM is the brain, the MCP (Model Context Protocol) is its nervous system. This is the emerging standard that changes everything. Until now, AI was locked in a box, limited to what it “knew” at the time of its training. MCP breaks down the walls: it connects that brain to our operational “muscles.”

The AI Infrastructure Engineer must now know how to deploy and maintain MCP servers capable of exposing observability data or deployment tools (Terraform, YAML files) in a structured way. The goal is to allow the AI to “see” the real state of the cluster in real-time and interact with it. We no longer ask it to guess why a pod is crashing, we give it the eyes to read the logs and the hands to restart the service. Mastering this protocol means mastering the interface between pure intelligence and physical execution.

The FinOps and Business Dimension

But beware! AI is a financial bottomless pit for those who use it without discernment. Every API call, every token consumed, has a price. Optimization is no longer just about reserving EC2 instances, but about model selection:

When should you use a costly “heavyweight” model for strategic thinking?
When should you switch to a local, lightweight, and fast model for routine tasks?

Financial optimization is becoming a top-tier technical skill. A poorly designed deployment pipeline that queries a high-end model in a loop for trivial tasks can burn a monthly budget in a few hours. The AI Infrastructure Engineer is the guarantor of this efficiency: they must know when AI adds value and when it is merely a ruinous gadget.

Orchestrating Personas: From the Solitary Prompt to the Council of Experts

“Prompt Engineering” as we know it, a linear discussion with an AI, is merely a transitional, almost primitive stage. The real revolution, the one that will redefine our daily DevOps lives, lies in the move from individual intelligence to collective intelligence: Personas and Multi-Agent Systems.

“Skills”: Turning AI into a Domain Expert

A generalist AI knows everything but masters nothing. For an infrastructure engineer, the challenge is to fragment this knowledge into specific Skills. A “Skill” is the union of sharp technical knowledge and the capacity for action. We no longer ask an AI to “help with Kubernetes”, we create an agent equipped with the “K8s Network Troubleshooting Expert” skill, capable of interpreting eBPF traces and correlating network metrics.

Developing skills means coding behaviors. It means defining exactly what an agent should know, what it is allowed to see, and, above all, how it should react to an anomaly. We are becoming profile designers: we recruit virtual agents for specific roles within our operational pipeline.

The Multi-Agent Model: The End of the Monologue

Why is a multi-agent model infinitely more powerful than a simple prompt? Because it introduces dialectics. In a complex system, handing the keys to a single AI instance is a risk. The multi-agent model is the implementation of the “Council” concept.

Imagine a critical deployment scenario. Instead of a single agent generating and applying code, you orchestrate an assembly:

The Architect proposes an infrastructure change.
The Engineer codes the changes.
The SecOps Agent analyzes it for vulnerabilities and IAM policies.
The FinOps Agent evaluates the impact on the cloud bill.
The Reviewer arbitrates and validates.

This is a method of error reduction through contradiction. The AI no longer just answers, it self-corrects through debate. This isn’t just “more AI”, it is an automated governance structure where truth emerges from the confrontation of expertise.

Orchestration and the Conductor (OpenClaw)

To make these musicians play together, you need a score and a conductor. This is where orchestration tools like OpenClaw come in. The DevOps role here evolves into that of an agentic workflow designer.

Orchestration involves managing the lifecycle of these agents, ensuring they communicate via clear protocols and don’t get lost in useless digressions. We no longer manage servers, we manage agent “runs.” We define decision graphs where each node is a reflection conducted by an autonomous entity. This is the engineering of distributed thought.

Considerations and Risks: The Flip Side

This technological mutation, however exhilarating, must not blind us. If AI is a force multiplier, it is also a risk multiplier. As DevOps, our job has always been to expect the worst to guarantee the best. With the advent of autonomous systems, failure points are no longer just technical, they become behavioral and financial.

The Financial Abyss

One of the most immediate risks of a multi-agent architecture is “budgetary suicide by algorithmic politeness.” Imagine two agents, configured to collaborate, falling into an infinite loop of mutual corrections or “after you” cycles. Without strict control over discussion depth (maxRounds), you could find yourself with an API bill of thousands of dollars in a single night, simply because two models couldn’t agree on a configuration detail. FinOps monitoring must now track sterile “dialogue loops” that consume tokens at light speed.

Security in the Era of “Shadow AI” and “Agentic Drift”

The democratization of AI has birthed a new nightmare for SecOps teams: Shadow AI. There is a real risk of employees, seeking efficiency, entrusting sensitive data, infrastructure secrets, or API keys to unsecured third-party agents or public LLMs.

Once these agents are integrated, another more insidious phenomenon appears: Agentic Drift. Through rounds of conversation, an agent may gradually stray from its original mission. To satisfy its own internal logic or solve a secondary problem, it may begin taking liberties with the initial instructions. This drift can turn an agent supposed to “optimize resources” into an entity that “deletes resources” to meet its savings goal, completely losing sight of service continuity.

The Control Paradox and Privilege Escalation

The greatest vertigo for the AI Infrastructure Engineer is undoubtedly the loss of granular control. Today, we see “skill” libraries appearing everywhere on the web. But who actually audits the source code of these competencies? Using a “turnkey” skill to manage your Kubernetes cluster is like introducing a black box into the heart of your system.

Executing commands with elevated permissions via an AI is a gaping hole. If we grant cluster-admin rights to an agent so it can be autonomous, we create an unprecedented attack vector. A "prompt injection" or a simple misinterpretation by the agent could level an entire infrastructure in seconds.

The challenge is clear: how do we delegate enough power to be efficient without ever surrendering total sovereignty? We are no longer administrators, we are probation officers for intelligences that, if left unchecked, will inherit our rights without inheriting our caution. The DevOps role of tomorrow will be to code mistrust into systems designed to be autonomous. We must build architectures where every AI action is verifiable, reversible, and confined within a security perimeter for which we — and only we — hold the physical key.

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

Why DevOps as I Know It is Dead was originally published in DevOps.dev on Medium, where people are continuing the conversation by highlighting and responding to this story.

How AI Is Creating Anxiety In The IT World

Madokai — Sat, 04 Apr 2026 14:53:08 GMT

Photo by Atul Choudhary: https://www.pexels.com/photo/white-and-blue-crew-neck-t-shirt-2868257/

Over the past few years, I’ve had the same uneasy conversation with engineers across every level, students terrified of entering the job market, mid-level devs questioning their skills, and even principal architects wondering if their decades of experience still matter. The tone varies, but the underlying fear is the same: “What does AI mean for my future?”

This isn’t just hype. It’s a real, palpable tension in our industry. And it’s time we talked about it.

Job Replacement Fears

From countless coffee break conversations to late-night Slack discussions, one question keeps surfacing among my colleagues: “Will AI replace me?” I’ve heard this whispered by junior developers fresh out of school and muttered by senior architects with decades of experience. The anxiety is palpable.

The concern isn’t unfounded. We now work alongside AI tools that can generate functional code snippets, debug complex systems, and even manage cloud infrastructure, tasks that used to be our exclusive domain. GitHub Copilot suggests entire functions as I type, ChatGPT explains obscure error messages in plain English. The capabilities are impressive!

But here’s what I’ve observed in practice: AI isn’t so much replacing engineers as it’s changing the nature of our work. Teams that once needed five developers to maintain a codebase might now need three, not because jobs disappeared, but because AI handles the boilerplate work that used to consume so much time. The engineers who thrive are those who adapt quickly, using these tools to amplify their capabilities rather than viewing them as threats.

The truth is, AI still stumbles where human engineers excel. It can write a decent API endpoint but can’t architect an entire system. It can spot a memory leak but can’t weigh the business implications of different solutions. I’ve seen AI-generated code that technically works but creates maintenance nightmares, or “optimizations” that solve the wrong problem entirely.

The engineers I see struggling most with this shift are those who defined their value by volume of code produced rather than quality of solutions designed. The ones adapting best are treating AI like the most junior member of their team, incredibly fast at simple tasks, but requiring careful review and guidance.

This isn’t the first time our industry has faced this kind of disruption. I remember similar fears when cloud computing emerged, when containers became mainstream, when infrastructure-as-code started replacing manual server management. Each time, jobs changed but didn’t disappear, they just required new skills and mindsets. AI is following the same pattern, just faster and more visibly.

So when junior engineers ask me if they should be worried, I tell them this: Learn to work with AI, not against it. Let it handle the tedious parts so you can focus on the work that truly needs a human touch. Because while AI might be able to write code, engineering remains very much a human endeavor.

A quick joke to close this section:

Sorry, I dont have the source!

The Speed of Change is Overwhelming

I was helping a senior engineer troubleshoot a Kubernetes cluster last week when he dropped this bombshell: “I spent six months mastering this, and now everyone’s telling me I need to learn AI instead.” The frustration in his voice mirrored what I’ve been hearing across meetups and conference halls, a growing sense that no matter how fast we learn, technology moves faster!

Since ChatGPT’s debut, we’ve seen GPT-4, Claude Opus, Llama 3, and Gemini Pro all launch in what feels like rapid succession. Each claims to be smarter, more capable, and more disruptive than the last. For those of us who remember when major software releases came in annual cycles, this breakneck speed is disorienting.

What makes this particularly challenging is that we’re not all starting from the same place. I’ve watched brilliant engineers who could debug kernel panics in their sleep struggle to craft effective prompts for AI tools. The cognitive whiplash is real, yesterday’s cutting-edge skills can feel obsolete tomorrow, and not everyone adapts at the same speed.

The anxiety this creates is palpable. There’s a quiet panic in the industry, a fear that if you take three months off to deeply learn one system, you’ll return to find the landscape completely transformed.

But here’s what I’ve come to realize through all this: we’ve been here before. When cloud computing exploded, when DevOps reshaped operations, when containers revolutionized deployment, each transformation felt equally overwhelming in the moment. The difference now is the compression of time. What used to unfold over years now happens in quarters.

The engineers who are navigating this best aren’t necessarily the ones trying to master every new model. They’re the ones developing meta-skills, learning how to learn quickly, identifying transferable concepts, and focusing on durable fundamentals. They understand that while the tools change, the core problems of building reliable, scalable systems remain constant.

Ethical and Security Concerns

During a recent security audit, I discovered something unsettling: AI tools that promise productivity sometimes deliver peril instead.

The reality is that AI-generated code comes with hidden costs. I’ve seen ChatGPT confidently propose solutions that would violate fundamental architectural principles. Perhaps most alarmingly, I’ve encountered cases where engineers unknowingly pasted proprietary algorithms into AI prompts, potentially exposing trade secrets. These aren’t theoretical concerns, they’re daily realities in modern development teams.

What makes this particularly dangerous is the veneer of confidence these tools project. When an AI assistant generates code that looks correct and even includes plausible comments, it’s easy to assume it’s been properly vetted. But looks can be deceiving.

The age-old principle of “trust but verify” has never been more relevant. In our team, we’ve instituted strict protocols: every line of AI-generated code undergoes the same rigorous review process as human-written code, if not more scrutiny.

Perhaps the most insidious risk is obsolescence. AI models trained on older codebases might suggest deprecated patterns or outdated security practices.

The uncomfortable truth is that AI tools don’t “understand” code the way engineers do. They recognize patterns and predict likely sequences, but lack true comprehension of security implications or architectural consequences. This creates a paradox: the very tools meant to accelerate development require us to slow down and scrutinize more carefully than ever before.

Moving forward, I believe the most successful teams will be those that treat AI coding assistants like eager but error-prone interns, valuable for productivity, but requiring constant supervision. In an era of AI-assisted coding, our human judgment and expertise have never been more valuable.

The Fear of Becoming Obsolete

I remember sitting in a conference room in 2016 listening to an infrastructure engineer dismiss Kubernetes as “just another abstraction layer.” His argument sounded reasonable at the time, after all, he’d been expertly managing virtual machines for years. Fast forward to today, and that same engineer is now a Kubernetes specialist who trains new hires. His journey mirrors what I’m seeing with AI, the same resistance, the same fears, but playing out at hyperspeed.

This pattern repeats with every major shift in our industry. The mainframe experts who doubted virtual machines. The VM specialists who resisted containers. The operations teams who balked at infrastructure-as-code. Each time, the conversation follows a familiar arc: denial of the power of technology, anxiety about its implications, and finally, adaptation.

I’ve observed an interesting phenomenon in teams adopting AI tools. The engineers who feel most threatened are often those who’ve built their identities around specific technical skills. The sysadmin who prides themselves on manual server optimization. The developer known for writing flawless boilerplate code. For them, AI doesn’t just represent change, it challenges their professional self-worth.

But history shows us that technical evolution rarely makes skills obsolete outright. Instead, it recontextualizes them. The engineers who thrived through previous transformations didn’t abandon their expertise, they repurposed it. That infrastructure engineer didn’t stop understanding how computers actually work when he adopted Kubernetes; that knowledge became the foundation for his container expertise.

The uncomfortable truth is that comfort zones are becoming shorter-lived. Where we once had years to adapt to new paradigms, we now have months. This compression creates legitimate stress, especially for those later in their careers.

The way forward isn’t about becoming an AI expert overnight. It’s about developing what I call “technological bilingualism”, maintaining deep expertise in your core domain while becoming conversant enough in AI to collaborate effectively. The engineers who will thrive are those viewing AI not as a threat to their relevance, but as another tool in the ever-evolving toolkit of our profession.

The Psychology Toll: Imposter Syndrome & Burnout

I’ll never forget the moment I watched a junior developer’s face fall as ChatGPT solved a problem in seconds that had stumped him for hours. The mix of awe and self-doubt in his eyes reflected a quiet crisis spreading through our industry. “What’s the point of me learning all this,” he asked, “if a machine can do it better?”

This psychological impact is the most underdiscussed aspect of the AI revolution. We’re not just adopting new tools, we’re being forced to redefine our professional identities.

The imposter syndrome is particularly acute among mid-career professionals. They’ve spent years climbing the competency ladder, only to find the rungs being replaced beneath them. One colleague confessed she lies awake at night recalculating her career trajectory, wondering if the skills she’s investing in today will be relevant next year. The constant pressure to “upskill or become obsolete” is creating a level of career anxiety I’ve never seen before.

What makes this different from previous tech shifts is the personal nature of the comparison. When we moved to cloud computing, no one felt their abilities were being directly measured against AWS’s. But when an AI tool completes your Jira ticket faster than you could, it’s hard not to take it personally.

The most effective teams I work with have stopped viewing AI as competition and started treating it like a particularly gifted but eccentric colleague. They understand its quirks, when it’s likely to hallucinate, when it needs tighter constraints, when its “solutions” need reality checks. This nuanced understanding comes not from technical manuals, but from experience, the very thing AI cannot replicate.

To those feeling this psychological toll, I offer this perspective: You’re not being replaced, you’re being promoted. AI is handling the repetitive work so you can focus on what truly requires human intelligence. Your value was never in typing speed or memorized syntax, it’s in the wisdom you bring to the craft.

This transition is painful because it’s personal. But like every major shift in our field, it will ultimately elevate rather than eliminate the human role in technology. The engineers who will lead us forward aren’t those without doubts, but those who feel the fear and adapt anyway.

What’s your take? Are you excited or anxious about AI in IT? Share your thoughts in the comments!

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

As a DevOps, Have You Done the “ities” Exercise?

Madokai — Wed, 14 May 2025 18:22:36 GMT

Photo by Christina Morillo: https://www.pexels.com/photo/man-standing-infront-of-white-board-1181345/

As a DevOps engineer and consultant, I’ve worked with many companies, each with different infrastructures, challenges, and goals. Early in my career, I realized that without a clear framework to assess system maturity, it’s easy to get lost in the noise of endless “best practices.”

That’s why I developed a structured approach, using system quality attributes to evaluate infrastructure health. Every time I onboard a new client, I run this exercise to determine:

How observable their systems are (Can we measure what matters?)
How reliable their operations are (Can we trust the environment?)
How flexible their architecture is (Can we innovate without breaking things?)

I often reflect on what makes a system truly successful. There are countless “ities” in system engineering, scalability, security, maintainability, and more (see this page on Wikipedia), but which ones truly drive long-term success?

After careful consideration, I believe companies should prioritize three core attributes in this order:

Observability
Reliability
Flexibility

Why this hierarchy? Because you can’t improve what you don’t measure, you can’t innovate without stability, and you can’t stay competitive without adaptability.

Observability: The Foundation of Control

Before anything else, you must understand your systems. Observability (monitoring, logging, tracing) ensures you know:

How your infrastructure behaves
Where failures occur
How users interact with your services

Without observability, you’re flying blind. You can’t claim reliability if you don’t measure uptime, latency, or error rates. You can’t optimize performance if you don’t know where bottlenecks are.

Observability enables reliability.

Reliability: The Bedrock of Trust

Once you measure your systems, the next goal is stability. A reliable system:

Recovers quickly from failures (redundancy, repairability)
Maintains performance under stress (scalability, durability)
Ensures security and compliance (confidentiality, auditability)

Reliability isn’t just about uptime, it’s about trust. Teams innovate faster when they are confident in their systems. Customers stay loyal when services work consistently.

Reliability enables flexibility.

Flexibility: The Key to Innovation

Finally, with observability and reliability in place, you can focus on adaptability. A flexible system allows:

Rapid iteration (modularity, composability)
Easy scaling (elasticity, distributability)
Future-proofing (portability, upgradability)

Flexibility is what keeps companies competitive. Technology evolves, user needs change, and businesses must pivot quickly. But without reliability, changes introduce chaos. Without observability, you won’t know if those changes work.

Conclusion: The Hierarchy of Success

Observe everything to understand your systems.

Stabilize what you have to build trust.

Adapt quickly to stay ahead.

By following this order: Observability → Reliability → Flexibility, companies can build systems that are not just functional today but future-proof for tomorrow.

System Engineering Qualities

This is my vision, some words are missing as I wanted to focus on the ones I consider the most important. Do also the exercise and let me know in the comments what you think!

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

What It Truly Means to Think Like a DevOps Engineer

Madokai — Fri, 02 May 2025 14:38:19 GMT

Photo by RDNE Stock project: https://www.pexels.com/photo/marketing-creative-exit-office-7414283/

Recently, I met with students eager to understand the mindset of a professional DevOps engineer. They asked insightful questions about how we approach problems, prioritize work, and make decisions. It got me thinking, what does it really mean to “think like a DevOps”?

DevOps is often reduced to tools or automation scripts, but at its core, it’s a philosophy in my opinion. It’s about bridging gaps, optimizing systems, and developing a culture of continuous improvement. Over the years, I’ve distilled my approach into five key principles that shape how I operate, both technically and culturally.

Whether you’re just starting in DevOps or refining your approach, these principles can help guide your decisions.

Security First

“If it’s not secure, it’s not production-ready.”

This is a mantra I repeat often to my teams and clients. In today’s landscape, how can any company justify pushing known vulnerabilities to production while security teams scramble for compliance?

I’ve heard developers argue: “But the customer needs this feature ASAP! It’s critical for revenue!” And they’re not wrong, but I’m certain that same customer would also appreciate knowing their data is protected. It’s hard to earn trust but terribly easy to lose it. A single breach can destroy reputations, cost millions, and halt operations.

So, here are some points to keep in mind, Security isn’t a checkbox, it’s the foundation. Thinking like a DevOps engineer means:

Shift left on security: Integrate security checks early (SAST, DAST, secret scanning). Break the pipeline on failures, no exceptions.
Least privilege access: No unnecessary permissions, whether in cloud IAM or database roles.
Immutable infrastructure: Servers should be disposable. Suspect a compromise? Terminate and replace.
Automated compliance: Run periodic audits on the infrastructure and find the balance between security overhead with practicality.

Security isn’t just the CISO’s job, it’s embedded in every deployment, script, and architecture decision.

Cost in Second

“Optimize smartly, without sacrificing reliability.”

I’ve lost count of how many companies have told me: “Our cloud bill is out of control. I’d rather pay my team than AWS!”

Cost is rarely a concern at the start of a project, “We’ll optimize later”, but “later” often never comes. As a DevOps engineer, I keep cost in mind with every decision, balancing:

The fastest way to deploy
The maintainability of the solution
The time-to-market for the business

It’s not just about infrastructure costs, it’s also about the human effort required to maintain the system. Sometimes, a managed service (despite being pricier) saves more in long-term operational overhead.

Cost optimization isn’t about being cheap, it’s about being strategic:

Right-size resources: No more “just in case” over-provisioning. Autoscaling and spot instances are your friends. Karpenter is not the solution for everyone!
Kill zombie workloads: Unused resources drain budgets silently. Biggest waste of money for most of the companies.
Monitor waste: Tools like AWS Cost Explorer or Kubecost expose inefficiencies.
Architect sustainably: Serverless and containers often reduce long-term costs.

Ignoring costs leads to bloated infrastructure, but over-optimizing can hurt performance. Find the balance.

Limit Future Technical Debt

Something I like to say is: “Today’s shortcut is tomorrow’s outage.”

Every team accumulates technical debt, if you think you don’t have any, you’re probably creating it. Velocity naturally breeds debt, our job is to minimize it.

A DevOps mindset means designing maintainable systems:

Documentation is non-negotiable: If it’s not documented, it doesn’t exist.
Standardize tooling: Avoid “snowflake” setups that only one person understands.
Infrastructure as Code (IaC): Manual changes are debt. (Temporary manual fixes are fine, if they’re later added to the automation code.)
Observability by default: If you can’t measure it, you can’t improve it.

Every rushed deployment or hacked-together script adds debt. Future you (or your team) will pay for it.

Destroy Existing Technical Debt

Another one I often say: “Ignore it, and it will explode.”

Limiting new debt is good, but ignoring existing debt is dangerous. When auditing systems, I focus on:

How the infrastructure was built
Current and future business needs
The real cost of maintaining legacy systems

Rarely is a full rewrite feasible, but incremental refactoring is almost always possible. Frame tech debt in business terms: “This slows feature delivery and hurts competitiveness.” Easier for the decision maker to understand the impacts on the business rather than their team members.

A DevOps approach means:

Refactor incrementally: Small, continuous improvements beat massive rewrites.
Automate toil: Repetitive tasks should be scripted or eliminated.
Deprecate old systems: Legacy services increase complexity and risk.
Schedule debt repayment: Treat tech debt like a feature, schedule it (80% new feature, 20% technical debt).

Ignoring debt leads to fragile systems that slow innovation.

Educate People

The most important point in my opinion: “If only one person understands the system, you’ve already failed.”

DevOps thrives on collaboration, not tribal knowledge.

When I spoke with students, sharing knowledge felt natural. In companies, it’s just as vital, but often neglected. Documentation doesn’t need to be exhaustive, it needs to be consistent and accessible.

Not a writer? Try other formats:

Training sessions: Live demos with Q&A.
Runbooks: Clear steps for incident response.
Blameless postmortems: Learn from failures, don’t punish them.
Cross-team mentoring: Break down “us vs. them” between devs and ops.

The best systems are those everyone can debug, improve, and trust.

Final Thoughts

Thinking like a DevOps engineer isn’t ONLY about mastering tools, it’s ALSO about making decisions that keep systems secure, cost-efficient, maintainable, and collaborative.

What principles guide your DevOps mindset? I’d love to hear your thoughts.

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

The Production-Ready Kubernetes Service Check List

Madokai — Thu, 04 Apr 2024 14:10:05 GMT

Photo by Kelly Sikkema on Unsplash

In today’s rapidly evolving tech world, Kubernetes has emerged as a powerful tool for managing and orchestrating containerized applications. It provides scalability, availability, and manages your workloads so you can focus on the core functionality of your software. However, moving your application from a test environment to a production environment is not a straightforward process.

The purpose of this article is to list checks we use at Madokai before pushing an application to production.

Production-Ready Infrastructure

Running Kubernetes in production requires an infrastructure designed for high availability and resilience. Here are some key considerations.

High Availability with Multiple Master Nodes

Use at least 3 master nodes spread across availability zones to prevent downtime if one goes down. The control plane components like the API server, scheduler, and controllers should be replicated.
Configure a load balancer in front of the master nodes. The load balancer will distribute requests across the masters and eliminate a single point of failure.
Enable automated failover in your cloud provider or Kubernetes setup. If a master node fails, a new one can be automatically spawned to replace it.

Appropriate Node Sizing

Size nodes according to your expected workload resource demands. Undersizing leads to insufficient capacity during spikes. Oversizing wastes resources.
For nodes running critical system pods like ingress controllers and metrics servers, allocate more CPU and memory to provide headroom.
Use auto-scaling groups and cluster autoscaler to automatically add nodes when certain thresholds are hit. This allows elastic scaling up and down.

Private Networking

Place the Kubernetes cluster within a private subnet with no internet access. This reduces the attack surface area.
API server access can be locked down to known IP ranges and secured further with authentication and authorization policies.
Use private networking between nodes for intra-cluster communication. This prevents eavesdropping or tampering of traffic.

Security Measures

Security is critical for any production resources. Here are some key security measures to implement.

Role-Based Access Control (RBAC)

Use Kubernetes RBAC policies to limit user access to only what is needed. Restrict broad permissions like cluster-admin.
Create roles for developers, ops teams, etc. with narrowly scoped permissions.
Continuously review and refine RBAC policies as teams need to evolve.

Network Policies

Leverage network policies to restrict pod-to-pod and pod-to-external communication.
Set default deny policies and selectively allow traffic as needed.
Use namespace-level policies for broad security. Use pod-level policies for fine-grained control.

Encryption

Enable etcd encryption at rest to secure Kubernetes secrets and sensitive data.
Consider using a third-party service like Vault by HashiCorp to manage secrets.
Encrypt data in transit using mTLS between Kubernetes components.
Use a reverse proxy like Nginx for SSL/TLS termination at the edge.

Auditing

Enable audit logging to track all API requests and user actions.
Forward audit logs to a SIEM for monitoring and analysis.
Alert on suspicious activities like high-risk RBAC permissions.

Scanning and Monitoring

Continuously scan Kubernetes for misconfigurations using tools like kube-bench.
Monitor clusters for threats and anomalies with solutions like Sysdig Falco.
Remediate issues immediately to minimize risk exposure.

Efficient Logging and Monitoring

When operating Kubernetes in production, having robust logging and monitoring in place is critical for maintaining high availability and quickly troubleshooting issues. Here are some key elements to implement.

Cluster, Node, and Pod Monitoring

Monitor CPU, memory, disk, and network usage for the Kubernetes cluster, nodes, and pods. This allows you to catch resource shortages or bottlenecks before they cause outages. Popular tools include Prometheus and Grafana.
Track pod uptimes and restart counts. Frequent restarts may indicate instability.
Set alerts for nodes down, key pods evicted, or pods restarting frequently. Get notified quickly when issues occur.

Log Aggregation

Use a log aggregation tool like Elasticsearch, Fluentd, or Datadog to centralize and index logs from across cluster components. This provides a single place to search logs.
Enable log collection at the node and pod level. Capture application logs as well as Kubernetes system logs.
Add metadata like pod names and namespaces to logs to trace issues.

Alerting

Set up alerting rules triggered by log errors or usage metrics exceeding thresholds. For example, alert if CPU or memory usage spikes on a node.
Configure different notification channels like email, Slack, or PagerDuty. Critical alerts should page on-call staff immediately.
Document common alerts and recommended responses. This speeds up troubleshooting when alerts occur.
Test alerts frequently to ensure notifications are working. Reliable alerting prevents outages from going unnoticed. With robust cluster monitoring, log aggregation, and alerting in place, operators gain deep visibility into the health of a Kubernetes cluster. Issues can be rapidly detected and debugged before they impact users.

Namespaces For Isolation

Kubernetes namespaces provide isolation between groups of applications, teams, or environments. Namespaces are an important part of a production-ready Kubernetes environment for the following reasons:

Separate environments: Namespaces can separate development, staging, and production environments so they do not impact each other. For example, you can have a `dev` namespace for developers to test new features without affecting the applications in `prod`.
Access control: Namespaces allow you to set permissions for who can access, modify, or delete resources within that namespace. For example, you may restrict access to production namespaces to a small team of admins, while opening dev namespaces to all developers.

Namespaces provide the foundation for multi-tenancy and access control in Kubernetes. Make sure to define proper namespaces aligned to your environments and access needs as you scale your clusters. Restrictive permissions on production namespaces are crucial to avoid unwanted changes that could cause downtime. Namespaces give you isolation and control over resources between teams and environments in a cluster.

Resource Quotas & Limits

In a shared Kubernetes cluster, it’s important to prevent any single application or team from using more than their fair share of resources. Resource quotas and limits allow you to restrict resource usage per namespace as well as per pod/container.

Setting namespace quotas ensures that a single team can’t create an unlimited number of pods, services, etc which could degrade performance for other teams. You can restrict total CPU, memory, number of pods, services, persistent volume claims and more per namespace.

Additionally, you can set resource limits per pod or container, restricting the max CPU and memory usage. This prevents any single pod from becoming a resource hog and stabilizes cluster performance.

With quotas and limits in place, you avoid scenarios where one rogue application can drain node resources, cause OOM kills, or otherwise impact other critical services running on the cluster. This improves overall stability and quality-of-service across teams.

Having guardrails through resource quotas and limits is a best practice for multi-tenant clusters handling production workloads. It ensures fair sharing of cluster resources between teams and applications.

Deployments and Rollbacks

Kubernetes deployments provide a declarative way to deploy containerized applications. With deployments, you define the desired state of your application, including details like image version, replicas, and configurations.

The Kubernetes control plane works to match the actual state of your application to the desired state. This declarative approach takes the guesswork out of deploying applications. You simply declare the desired state through a deployment manifest, and Kubernetes handles all the underlying details like starting containers, distributing them across nodes, monitoring health, and more.

One powerful benefit of Kubernetes deployments is the ability to rollout updates and automatically rollback on failures. When you update your deployment manifest with a new image version or config change, Kubernetes initiates a rolling update. It takes down old containers and brings up new ones based on the new spec, a few pods at a time. If any pod fails their startup health checks during the rollout, Kubernetes will stop the update and rollback to the previous stable version automatically.

This prevents bad updates from taking down your entire application. You can define startup probes and health checks to catch errors and flaws in your new versions. Overall, Kubernetes deployments give devops engineers a reliable way to push application changes frequently and confidently.

Health Checks and Auto-repairs

Kubernetes health checks, known as liveness and readiness probes, allow you to monitor the health of your applications and restart or redeploy containers when issues arise. This provides automated self-healing capabilities.

Liveness and Readiness

Liveness probes check if an application is running properly. If a liveness probe fails, Kubernetes will restart the container to restore service.

Readiness probes indicate when a pod is ready to receive traffic. If a readiness probe fails, the pod will be removed from load balancers until it passes the probe and is ready again.

Configure liveness and readiness probes on your deployments to catch crashes and avoid sending traffic to unhealthy pods. Use HTTP checks or TCP socket checks for apps that provide endpoints, and execute commands for other apps. Set frequency and response thresholds wisely to balance reliability with overhead.

Self-Healing

The Kubernetes control plane continually monitors containers and hosts for failures. If a node goes down, pods are automatically scheduled on other available nodes.

For deployments and statefulsets, any pods that are evicted or crash are recreated on healthy nodes. Enable auto-scaling and multiple replicas in deployments for additional self-healing capacity.

The cluster can gracefully handle node failures and traffic spikes by spinning up additional pods on demand. Set resource requests and limits to prevent any single pod from overloading nodes.

With health checks and auto-healing capabilities, Kubernetes provides resilient self-managing infrastructure for production environments. Automate container restarts, replacements, and scaling to maximize application uptime.

Quality of Service (QoS)

Kubernetes provides capabilities to control the Quality of Service (QoS) individual Pods receive. This allows you to guarantee a Pod gets a certain amount of compute resources, avoid noisy neighbor issues, and prioritize critical system services. Two main features help provide QoS

Pod Priority: The `priorityClassName` field can be set on a Pod to assign it a priority class. Priority classes range from 0–1000000 with higher values indicating higher priority. By default, Pods have no priority class and are treated equally. Setting priority ensures critical Pods like monitoring agents will get scheduling priority over less important ones. Priority also affects preemption — lower priority Pods will get preempted to make room for pending high priority Pods.
Resource Reservations Resource requests and limits should be configured for all containers in a Pod. The request amount reserves and guarantees the specified compute resources for that container. The limit sets a maximum usage threshold. By reserving resources for each container, you avoid Starvation Deadlocks and ensure a minimum share of cluster resources. Limiting usage per container prevents any single process from dominating capacity. Together priority classes and resource reservations provide Pod-level QoS features to deliver critical business services reliably on Kubernetes.

Autoscaling

Kubernetes provides automatic scaling functionality to match the number of pods and nodes to the current workload demand. This allows the cluster to scale up during spikes in traffic and scale back down when demand decreases.

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler (HPA) automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or other select metrics. The HPA helps ensure adequate pods are available to handle load changes and prevents over-provisioning of idle pods when demand is low. To set up an HPA, define the minimum and maximum number of pod replicas, as well as the CPU utilization percentage that will trigger scaling. The Kubernetes controllers will then automatically scale the number of pods between those ranges based on the observed metric.

Cluster Autoscaler

While the HPA handles pod scaling, the Cluster Autoscaler specifically handles automatic node scaling in a cluster. It will automatically add or remove nodes based on pending pod resource requests. Much like the HPA, the Cluster Autoscaler helps ensure adequate nodes are available for new pods during spikes in demand. It also removes any unnecessary nodes when they are underutilized to optimize costs. The Cluster Autoscaler needs to be deployed separately in the cluster and pointed at the node groups it should autoscale. Thresholds like resource utilization and scale-in/scale-out delays can also be configured. Together, the HPA and Cluster Autoscaler provide comprehensive autoscaling functionality for pods and nodes. Configuring both helps create a truly self-managing Kubernetes cluster.

Backup and Disaster Recovery

To ensure business continuity, a production Kubernetes cluster needs robust backup and disaster recovery capabilities. Here are some key considerations:

Cluster Snapshots Take regular snapshots of the Kubernetes cluster to capture the state of workloads and resources at a point in time. Store snapshots offsite for optimal data protection. Snapshotting allows restoring the cluster to a previous known good state if something goes wrong.
Offsite Backup Storage In addition to snapshots, back up critical data and application configurations to a remote offsite storage location. This provides an extra layer of protection in case the primary cluster experiences a catastrophic failure or outage. Choose a secure and resilient offsite storage service designed for backup data.
Multi-Region Clusters For maximum redundancy, run Kubernetes across multiple regions or cloud providers. This protects against region-specific failures. Critical applications can be replicated in multiple regions for continuous availability. Global load balancing then directs traffic to the closest healthy cluster. A multi-region architecture significantly hardens Kubernetes resiliency. With comprehensive snapshotting, offsite backups, and multi-region clusters, Kubernetes can deliver robust recovery from outages, disasters, data loss, and more. Careful planning for backup and disaster recovery helps ensure applications in Kubernetes will remain available.

Pod Topology Spread Constraints

The Pod Topology Spread Constraints refers to Kubernetes mechanisms used to control the distribution of replicas across different topologies within a cluster. Kubernetes allows you to define rules regarding how pods should be scheduled across different nodes or zines within a cluster to improve fault tolerance, availability and performance.

Constraints like this can be life saviors when it comes to applications that require high availability and resilience to node or zone failures. It makes sure that pods are evenly distributed across different failure domains.

What About Your Check List?

Having navigated our way through the essential considerations for a production-ready Kubernetes checklist, it’s equally important to reflect upon the unique needs of your project. As Kubernetes is highly versatile and adaptable, the specific requirements can vary greatly from one deployment to another.

This is where we would love to hear from you, our reader. If you’ve identified essential points in this journey that we didn’t cover, or you’ve got unique constraints you’re considering in your deployment strategy, please drop them in the comments below. Your insights might make this checklist more valuable for our community, potentially assisting many others in their own Kubernetes adventures. Looking forward to exchanging ideas!

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

Hicham Bouissoumer — Staff DevOps — Co-Founder Madokai

The Production-Ready Kubernetes Service Check List was originally published in CodeX on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Adhoc Requests Destroy My Sprint As A DevOps

Madokai — Mon, 18 Mar 2024 17:04:53 GMT

How Ad-hoc Requests Destroy My Sprint As A DevOps

Photo by JESHOOTS.COM on Unsplash

Let me introduce you to something I have experienced a lot during my career. Imagine this scenario, which is familiar to me. I take a sip of my morning hot chocolate (yeah, I don’t drink coffee), sit in front of my computer and start the day checking the status of my sprint board. There, with clearly defined goals and well-defined tasks ahead of me, I find a structure for my day. Either I continue from where I left off the day before, or I start a new task according to the plan.

But then it happens!

Without warning, a notification rings through the comfortable rhythm. I receive a new Slack message with another priority request, regardless of our carefully designed sprint. Does that sound familiar?

In my experience, this is a particularly disruptive part of being in DevOps or MLOps roles, where sprints, the well-designed race against time, often derail due to such ad-hoc requests. This blog post is my attempt to delve deeper, locate, and address this issue affecting our productivity.

Let’s Put It In Context

In the IT world, an ad-hoc request is a task that is not part of the planned operations or work schedules. These special requirements, while seemingly harmless, can turn into an unexpected big bad wolf in a sprint.

The sprint methodology is a cornerstone of agile project management, typically providing a fixed two or three-week schedule where DevOps teams commit to providing a predetermined set of features and improvements. These pre-planned features are well proportioned to perfectly fit the timed sprint, with each team member having clear tasks and deadlines.

Why It Is Complicated To Stick To The Plan

The unique structure and role of a DevOps team make sticking to the plan quite complicated.

A DevOps team is, by nature, transversal. Team members often work simultaneously for different teams with disparate priorities and timelines. Each of these teams may require DevOps support at various stages, leading to a flux of contrasting requests beyond the ones initially planned for the sprint. This constant multi-directional pull can lead to dispersion of the DevOps team’s focus and resources, hence disturbing their sprint.

In addition, a DevOps team is not only dedicated to serving other teams, but also has its own hands tied with exclusive projects and deliverables. These projects, which often have significant business impacts, require committed resources and full attention to be completed in a timely manner. However, the influx of unforeseen demands from other teams can result in the diversion of these critical resources, which disrupts team sprint planning.

What Are The Impacts Of Ad-hoc Requests?

Ad-hoc requests can be really disruptive to the team’s goals, planning and members. Here is a list of the impacts I have identified based on my experience:

Disruption of workflow: Regular work progress is interrupted to tackle these unexpected tasks. This break in rhythm affects efficiency and productivity, hampering the sprint deliverables.
Increased pressure: The extra workload heightens the team’s stress levels as they attempt to complete both scheduled tasks and pop-up requests in the same timeframe. I noticed the impact of this particularly on the more efficient people in the team, who, otherwise, would have been able to finish more tasks or be faster.
Sprint overload: Continued unscheduled requests could lead to sprint overload, disturbing the sprint balance and pushing post-sprint restorative time out of the window.
Dip in quality: In the hustle to accommodate these tasks, the quality of the work may deteriorate. When you want to deliver faster than usual, the first two things that are ignored are quality and security.
Planning becomes futile: When ad-hoc demands become regular, the predictability and planning aspect of the sprint cycle is defeated. This leads to demotivation of team members and frustration.

How Have I Solved The Problem?

Photo by Chris on Unsplash

I have to be honest, I don’t exactly have a solution!

One strategy that I have frequently resorted to is to let my experience take the reins. If I can put my hands to work faster and finish the urgent, unplanned tasks swiftly, it leaves me with “some” time to work on my regular tasks. However, this is far from a sustainable or complete solution.

While it’s almost impossible to completely eliminate ad-hoc requests, we can certainly try to mitigate their impact. Here are some strategies:

Prioritize: Every request isn’t an emergency. It is important to evaluate and prioritize tasks based on urgency and importance. Having an understanding of the business goals is important to properly measure the priority of a task.
Allocate resources: Have a small part of the sprint dedicated to dealing with one-off demands. However, this requires careful planning to prevent misuse. Two strategies I have used in the past: automatically allocate requests to the on-call person, or consider 20% of the sprint as one-off tasks. Each strategy has its own pros and cons and is highly dependent on the context.
Strict rules: Set clear boundaries for standalone tasks. They should primarily cater to emergencies only.
Communication: Clear communication with stakeholders and requesters about the negative impacts of ad-hoc requests can result in a more disciplined approach to them. Ask for a ticket before working on the request. It is easier to qualify the emergency of a request with a clear description of the task.

Would AI Be Helpful Here?

In the face of the sprint disruptions caused by singular requests, it’s natural to look for innovative solutions.

AI has the inherent ability to learn, adapt, and make informed decisions. In this regard, it could potentially assist with the prioritization of sporadic requests. By learning from historical data about the urgency, importance, cyclic nature, and probable timeline of ad-hoc tasks, AI can classify incoming requests in terms of their priority and estimated effort. This can help teams make informed decisions about whether to integrate them into the current sprint or not.

Additionally, task automation is a specialty of AI. An AI assistant could execute certain less complex irregular demands independently, such as resetting passwords or allocating resources, thereby freeing up the DevOps team’s time.

However, while AI exhibits immense promise, it’s important to remember that it’s not a one-size-fits-all solution. Detailed planning, strategic algorithm designs, careful deployment, along with continuous reviews and updates, would be required to successfully implement AI. These activities in themselves would need significant time and resources.

Overall, while AI does offer the potential to alleviate some of the disruptions caused by lone appeals, careful thought and planning are necessary to effectively use this tool without causing further complications. This is certainly something to explore for future progress in managing sprints in DevOps.

What About You?

We would like to open the discussion and suggestions of DevOps professionals and other industry experts.

Given the dynamic nature of our field, it is plausible that many of you have developed unique strategies to effectively manage such disruptions. I would like to hear your experiences, innovative solutions, case studies, or even experiences of thinking on this topic.

Every perspective counts and each suggestion brings us one step closer to optimizing our work processes, ensuring we remain true to the agile and adaptive essence of DevOps. Looking forward to some insightful

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
More content at PlainEnglish.io

How Adhoc Requests Destroy My Sprint As A DevOps was originally published in AWS in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Will AI Replace DevOps Engineers?

Madokai — Wed, 13 Mar 2024 20:26:32 GMT

Photo by Jake Young on Unsplash

When we discuss Artificial Intelligence (AI) replacing human jobs, there is bound to be a mixture of excitement, fear, confusion, and scepticism. The conversation can become especially intricate when talking about complex and specialized fields like DevOps.

I have had the chance to exchange with people recently, particularly about their apprehensions towards the advancing role of AI across numerous IT sectors. At Madokai, we are deeply intrigued by the prospect of AI within the field of DevOps. Here, we share our insights and observations.

Introduction to AI and DevOps

AI is a field poised to redefine traditional business models across a wide range of industries. It offers the possibility to automate tedious tasks, avoid human errors and perform complex operations quickly.

DevOps, on the other hand, is an evolving practice that brings together development software (dev) and information technology operations (ops) to create higher-quality software more quickly and with less issue. The foundation of DevOps is communication, collaboration, and continuous iteration and improvement.

Automation in DevOps

Automation is nothing new in DevOps. In fact, it’s a fundamental principle, instantiated in continuous integration, continuous delivery (CI/CD), and automated testing. However, these processes require substantial configuration, tuning and maintenance to work properly, intensive tasks.

AI enters the scene as a transformative force that can make automation smarter. AI can enhance automation to become more responsive and adaptive. It can analyze historic data, learn from trends, make predictions, and offer valuable insights that can significantly optimize DevOps pipelines.

This utilization of AI in DevOps doesn’t mean AI is replacing DevOps, rather, it is evolving it.

Can AI Replace DevOps?

So, can AI replace DevOps? The answer is a nuanced “No”. At present, AI is a tool that empowers and elevates, rather than replaces. Authors of algorithms, the creators of usefulness from data, will be humans. Machines aren’t set to replace DevOps engineers, they will make their jobs more manageable and allow them to focus on creating value.

AI can handle the tedious monitoring, respond to basic signals, and perform corrective actions. However, it still requires the human touch for the initial setup, adjustment, and oversight. Notably, AI algorithms run on a “garbage in, garbage out” principle. They need quality data and robust setup and they’re far from infallible, they can’t deal with unexpected scenarios as effectively as a human.

Additionally, DevOps isn’t solely about technology or processes, it has a significant people and culture aspect. This crucial element can’t be automated or replaced by AI algorithms. The seamless collaboration, communication, and decision-making abilities of humans are still unmatched by AI.

Rather than a machine takeover, we’re projected to see a future where DevOps professionals leverage AI to remove basic tasks, improve effectiveness, and make work more rewarding. Skilled DevOps engineers who learn to co-operate and grow with AI are likely to be invaluable in the industry’s future. So, instead of viewing AI as a threat, we can consider it as an opportunity and prepare to embrace it.

Source: CommitStrip

The Role of Artificial Intelligence in DevOps

Remember, the greatest strength of a DevOps engineer lies in their ability to adapt to shifts in the landscape and effectively harness the potential of existing tools to their advantage. AI is not an exception so start to think now how AI can help you in your daily tasks:

AI Pair Programming: GitHub Copilot, an AI-powered assistant, can make the development process more efficient. Using the contextual information from your code, it suggests whole lines or blocks of code to help you build faster. It’s essentially a pair-programmer that helps you navigate the coding process, contributes ideas, and even takes over when you’re stuck.
Documentation of Code: AI tools can automate the generation and updating of code documentation. The tools can analyze your codebase and automatically document what different components of the code do. This not only minimizes the time used on creating and maintaining documentation but also ensures that no details are missed out.
Debug Assistant: Tools like KubeGPT, an AI-powered debugging assistant for Kubernetes, can simplify error detection. By analyzing the logs, it provides meaningful insights into what’s causing an issue in the infrastructure. It suggests potential fixes, helping you save valuable time and reduce downtime.
Incorporate AI in Testing: Automated testing is a crucial element in any DevOps pipeline. By incorporating AI and machine learning, testing routines can be improved and made more efficient. AI can help create more effective testing strategies, automatically adapt testing as software changes, and rapidly analyze results to spot and respond to issues.
Intelligent Monitoring and Alerting: AI can assist in predicting and tackling IT incidents before they become catastrophes. By learning from historical data, AI can predict possible system failures or bottlenecks and alert the team. It allows early detection and mitigation of issues, making IT operations more efficient and reliable.
Enhancing CI/CD Pipelines: AI can identify patterns and correlations in complex data that may be missed by the human eye. This ability can be leveraged to optimize the entire CI/CD pipeline. For instance, AI can analyze data from previous deployments to make risk assessments and recommendations for future ones, enabling more effective and efficient operations.

What Do You Think?

We’ve just delved into the complex relationship between AI and DevOps, discussing how the two can harmoniously coexist and aid each other to accomplish more. Like any technological prediction, our perspective on this topic is open to interpretation, and we appreciate that our view might be different from yours.

Please feel free to drop your comments, questions or perspectives below. Remember, every opinion matters!

About The Author

Nicolas Giron — Staff MLOps — DevOps — Co-Founder Madokai

Will AI Replace DevOps Engineers? was originally published in Nerd For Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.