Stories by Razeen Abdal-Rahman on Medium

Running Production Minded Kubernetes on a Raspberry Pi

Razeen Abdal-Rahman — Sun, 15 Feb 2026 11:31:00 GMT

Most Kubernetes clusters are built in environments with excess capacity.

Plenty of memory. Plenty of CPU. Elastic infrastructure behind the scenes.

This one is not.

This cluster runs on a Raspberry Pi 3, inside a normal household network, behind an ISP router that cannot be replaced. It shares that network with personal devices, guest devices, and the usual mix of consumer hardware.

The goal was not to build a “homelab”.

The goal was to design a small, portable, security-focused platform using the same principles I would apply in a professional environment.

The constraints were real. And they shaped every decision.

Constraints That Shaped the Design

Before choosing tools, I defined the boundaries:

Consumer ISP router with limited configurability
A separate guest network that must remain functional
Limited RAM and CPU
No appetite for breaking household connectivity
Future physical relocation to a different property

The Raspberry Pi 3 provides 1GB of RAM. Kubernetes does not treat that as generous.

The router cannot be replaced, and guest network behaviour is largely opaque.

These were not inconveniences to be engineered around later. They were design inputs from the start.

If a solution required ideal networking hardware or exposed services to the internet, it was rejected immediately.

This environment had to be:

Self-contained
Portable
Recoverable from scratch

Why Kubernetes Here, Despite the Overhead

Running Kubernetes on a Raspberry Pi is often described as overkill.

In many cases, that criticism is valid.

This cluster is not about scale. It is not about handling traffic spikes. It is not about demonstrating high availability.

It is about discipline.

Using k3s enforces:

Declarative configuration
Explicit resource management
Containerised workloads
Repeatable deployments
Clear separation between services

The overhead is significant on constrained hardware. Default components alone can overwhelm a Pi 3.

But the payoff is structural.

If the hardware changes, if a new node is added, or if the cluster must be rebuilt after relocation, the deployment model remains identical.

The trade-off is complexity in exchange for long-term flexibility.

In this context, that trade-off is intentional.

Architecture Overview

The current cluster is single-node.

The Raspberry Pi 3 runs both control plane components and workloads. There is no high availability and no redundancy. This is a known and accepted limitation.

Core services include:

Pi-hole for network-wide DNS control
Tailscale for secure remote access
Linkding as a stateful user workload

All services are deployed using raw Kubernetes manifests. There are no Helm charts and no abstraction layers beyond Kubernetes itself.

Stateful workloads use PersistentVolumeClaims backed by local storage. This is a compromise driven by hardware limitations.

Traffic flow is simple:

Internal devices -> Router DHCP -> DNS requests to Pi-hole -> Upstream resolver Remote access -> Encrypted overlay via Tailscale -> Internal services

Nothing is exposed publicly.

The architecture prioritises clarity over cleverness.

Networking and DNS: Where Theory Met Reality

Networking proved to be the first real constraint.

The household network includes:

Trusted devices
IoT devices
A separate guest network

The initial plan was to move DHCP to Pi-hole and centralise both DNS and address management.

This broke the guest network.

The ISP router did not allow sufficient configuration to support that model cleanly. The guest network relied on router-controlled DHCP behaviour that could not be replicated externally.

The solution was not complex.

The router retained DHCP responsibilities. Pi-hole became the sole DNS server configured on the router.

This preserved:

Guest network functionality
DNS-level visibility and filtering
Simplicity

The compromise was correct because it respected the environment.

Good infrastructure adapts to its constraints instead of fighting them.

Security Model and Attack Surface

Security in this cluster is based on reduction rather than exposure.

No services are publicly accessible.

There is no port forwarding. There are no open inbound firewall rules. There is no public ingress controller.

Remote access is handled exclusively through Tailscale, providing:

Encrypted connectivity
Identity-based access control
No reliance on IP trust

The threat model is pragmatic.

It assumes:

The local network may contain untrusted devices
The public internet should not have visibility into internal services

It does not attempt to defend against nation-state actors.

The security posture is proportional to the context.

What Broke, and What Changed Because of It

Several assumptions failed early.

The first mistake was starting from an outdated Raspberry Pi OS image. Package updates were unreliable, and the system required a complete reinstallation using a current release.

Next, default k3s components proved too heavy for the hardware.

Traefik, ServiceLB, metrics-server, and local storage were initially enabled. Performance was inconsistent, and memory pressure was significant.

Removing non-essential components stabilised the node.

Later, deploying Linkding introduced new constraints. Django migrations were slow, and default uWSGI settings triggered out-of-memory conditions.

Mitigations included:

Reducing worker and thread counts
Setting explicit resource limits
Re-enabling local storage carefully

Each adjustment followed the same pattern:

Assumption -> failure -> simplification.

The system improved not through expansion, but through restraint.

Future Evolution Without Re-Architecture

The design intentionally supports gradual evolution.

Planned changes include:

Adding a Raspberry Pi 5 as a second node
Moving stateful workloads to SSD-backed storage
Labelling nodes for clearer workload separation

The existing manifests will not require structural redesign.

The deployment model remains consistent whether running on one node or two.

High availability is not currently implemented. It may be considered later, but only if justified by actual requirements rather than theoretical completeness.

The cluster is allowed to grow, but not at the expense of clarity.

What This Project Actually Demonstrates

This cluster is small.

It is not enterprise scale. It is not highly available. It is not complex for its own sake.

What it represents is:

Decision-making under constraint
Security-first design in consumer environments
Comfort with trade-offs
Operational tuning on limited hardware
Discipline in documentation and reproducibility

The most valuable lessons did not come from adding components.

They came from removing them.

For full manifests, diagrams, and documentation, the complete repository is available here:

https://github.com/Razeen-Abdal-Rahman/portable-k3s-homelab

Why DevOps Transformations Fail (And How Culture Fixes It)

Razeen Abdal-Rahman — Sat, 14 Feb 2026 12:31:00 GMT

I’ve seen DevOps transformations fail spectacularly.

Companies use all the right tools. GitHub Actions, AWS, Kubernetes, the whole stack. They hire consultants, run workshops, create roadmaps.

Then six months in, they’re back where they started. Tickets piling up. Teams pointing fingers. Release days that felt like going to war.

The tools weren’t the problem. The culture was.

The Expensive Mistake Everyone Makes

Here’s what most organisations get wrong: they think DevOps is about automation.

They’re half right. Automation is crucial. But it’s not the foundation.

The foundation is trust.

Traditional IT culture is built on distrust. Don’t let developers touch production, they’ll break it. Don’t let operations change the code, they don’t understand it. Put gates everywhere. Make every change go through three approval processes.

This creates bottlenecks. More importantly, it creates blame games.

When something inevitably breaks (and it will), everyone scrambles to prove it wasn’t their fault. The developer points to the deployment process. Operations points to the code. The project manager points to unclear requirements.

Meanwhile, users are still experiencing downtime, and nobody’s actually fixing the problem.

The Cultural Shift That Changes Everything

DevOps culture flips this entirely.

It’s not about preventing all failures through rigid processes. It’s about recovering quickly when failures happen, because they will happen.

The question shifts from “whose fault is this?” to “how do we prevent this from happening again?”

That sounds simple. It’s not.

I’ve sat in postmortems where the room was silent because everyone was terrified of being blamed. I’ve also sat in blameless postmortems where the person who caused the outage was the one leading the discussion, walking everyone through exactly what went wrong.

The difference? In the second room, we actually learnt something. We improved the system. We added guardrails. We documented the edge case.

In the first room, we just learnt to be more careful next time. Which isn’t learning at all.

Breaking Down the Walls

Here’s the thing about silos: they’re comfortable.

As a developer, it’s easier to throw code over the wall and let operations deal with deployment. As an operations person, it’s easier to say “not my problem” when the application logic fails.

But comfortable isn’t the same as effective.

Plenty of teams stay firmly in my lane. Developers write code, commit it, and consider my job done. If it didn’t deploy properly, that was someone else’s issue to sort out.

The cultural shift starts from something small like developers sitting in on operations meetings. They bein to understand why that code freeze exists, because three months ago, a deployment took down the payment system during peak hours.

Suddenly, those “annoying restrictions” make sense. And code starts getting written differently.

We don’t need operations people to start writing features. We don’t need developers to manage storage arrays.

We need enough context to make better decisions in our own part of the system.

That’s what breaking down silos actually means. Not eliminating specialisation, but eliminating ignorance.

The Feedback Loop Advantage

Fast feedback loops are DevOps’s secret weapon.

Traditional development had month-long cycles. Build for weeks. Integrate everything. Test. Deploy. Then discover what’s broken.

By the time you find the bug, you’ve moved on to three other features. The context is gone. You’re trying to remember what you were thinking weeks ago.

DevOps collapses this entirely.

Write code in the morning. Automated tests run in minutes. Deployed to staging by lunch. In production by afternoon. User feedback by evening.

This isn’t just faster. It fundamentally changes how you approach building software.

You make smaller bets. You validate assumptions quickly. You pivot when something’s not working.

This is why DevOps teams can move faster whilst actually being more stable. They’re not moving recklessly, they’ve just got better safety nets and faster recovery times.

What Actually Works

If you’re trying to shift to DevOps culture, here’s what I’ve seen work:

Start with blameless postmortems. Make them genuinely blameless. The first time someone gets blamed in a “blameless” postmortem is the last time anyone will be honest in one.

Create shared responsibility. Don’t just say “you build it, you run it”, give people the tools and support to actually do that. Pair developers with operations people. Let them shadow each other.

Celebrate failures that lead to learning. Not all failures, obviously. But the ones where someone tried something new, it didn’t work, and they came back with insights? Those are gold.

Measure what matters. Not how many deployments you did, but how quickly you recover when things go wrong. Not how many tickets you closed, but whether users’ problems are actually solved.

The Real Work Begins

The hardest part of DevOps isn’t learning Kubernetes or mastering CI/CD pipelines.

It’s convincing people that failure is a learning opportunity, not a career-limiting move. It’s building trust across teams that have spent years protecting their territory. It’s changing the conversation from “how do we prevent this person from making mistakes?” to “how do we make it safe to try new things?”

The tools will follow. The automation will follow. The improved metrics will follow.

But first, you need people who trust each other enough to share responsibility instead of sharing blame.

That’s the culture shift. And that’s what makes DevOps transformations actually stick.

Have you seen a DevOps transformation succeed or fail? What made the difference?

What DevOps Actually Means (And Why It’s Not About Tools)

Razeen Abdal-Rahman — Sat, 07 Feb 2026 14:49:07 GMT

I remember my first week as a DevOps engineer four years ago.

I kept waiting for someone to sit me down and explain what DevOps actually was. I’d read the job description. I knew the tools listed in the requirements. But I had no idea what I was supposed to be.

Here’s what nobody told me: DevOps isn’t a thing you can download or install. It’s not even really a job title, though we all have it plastered across our LinkedIn profiles.

DevOps is what happens when developers and operations teams stop pretending they work in separate companies.

Why This Actually Matters

Most organisations I’ve worked with have the same problem, just wearing different clothes.

Developers build features as fast as possible. Operations keep systems stable at all costs. These goals don’t just conflict, they actively work against each other when teams aren’t collaborating.

The result is predictable. Developers get frustrated because ops “slow them down.” Operations spend their evenings firefighting issues they had no context for. Everyone blames everyone else, and the software suffers.

I’ve seen brilliant engineers burn out because they were fighting against their own colleagues instead of working with them. The waste isn’t just inefficient, it’s heartbreaking.

DevOps exists to fix this. Not by eliminating roles or making everyone do everything, but by creating shared responsibility for outcomes.

The Wall Nobody Talks About

There’s often an invisible wall between development and operations teams.

Developers write code, throw it over the wall, then move on to the next feature. Operations catch whatever lands on their side and deal with it in production. Sometimes this works. Usually, it creates chaos.

I worked on a team where this pattern was so ingrained that developers genuinely had no idea how their code was deployed. They’d finish a feature, mark the ticket as done, and forget about it. When something broke at 2am, ops would be scrambling to understand code they’d never seen before.

This isn’t a technology problem. You can’t solve it by buying better monitoring tools or switching to Kubernetes.

It’s a people problem.

DevOps tears down that wall by making everyone care about both speed and stability. When developers understand how their code runs in production, they write better code. When operations teams get involved earlier in the process, they can prevent problems instead of just reacting to them.

The best DevOps transformations I’ve seen didn’t start with new tools. They started with developers and ops actually sitting in the same room, talking about their problems, and realising they wanted the same things.

Why Speed Isn’t What You Think It Is

When I tell people DevOps teams obsess over speed, they assume I mean shipping features faster.

That’s not quite right.

Speed in DevOps is about learning faster. When you can deploy changes quickly, you get feedback quickly. That feedback tells you whether you’re building the right thing or whether you need to change direction before you’ve wasted months going the wrong way.

I worked on a team that went from monthly releases to weekly deployments. The technical changes mattered, we automated testing, improved our CI/CD pipeline, broke down the monolith. But the real transformation was cultural.

We started catching bugs within days instead of weeks. We could validate ideas with real users instead of making guesses in planning meetings. We fixed production issues before most customers even noticed them.

Fast feedback loops mean you’re building on solid ground. Slow feedback loops mean you discover your foundations were shaky after you’ve already built three stories on top.

This is why DevOps practices like continuous integration and automated testing exist. Not because they’re trendy or because some consultant said you need them. They exist to make feedback loops faster, because faster feedback usually means better decisions.

DORA (DevOps Research and Assessment) identified deployment frequency as one of the key metrics that separate high-performing teams from everyone else. It’s not about speed for speed’s sake, it’s about the learning that speed enables.

What To Actually Do About This

If you’re trying to break into DevOps, here’s what actually matters:

Learn to think in systems. Understand how code moves from a developer’s laptop to production. Not just the tools, but the human processes around them.

Practice automation, but understand why you’re automating. The goal isn’t to eliminate people, it’s to eliminate the repetitive tasks that stop people from doing valuable work.

Focus on communication skills as much as technical skills. The best DevOps engineers I know aren’t necessarily the ones who can write the most elegant Terraform. They’re the ones who can explain complex systems to different audiences and build bridges between teams.

Start small. You don’t need to transform your entire organisation overnight. Find one painful, manual process and automate it. Find one gap between dev and ops and help close it.

The Thing About Culture

Four years in, I’ve learned something unexpected.

The tools change constantly. Kubernetes replaces Docker Swarm. GitHub Actions replaces Jenkins. Some new monitoring platform promises to solve all your problems (it won’t).

But the fundamental challenge stays the same: getting people to work together towards shared goals instead of optimising for their own team’s metrics.

The best DevOps work I’ve done had nothing to do with writing better YAML. It was helping a developer understand why ops cared about observability. It was showing an operations engineer that faster deployments actually made their life easier, not harder.

DevOps is a mindset, not a job title. It’s culture over tools, people over processes.

The question isn’t whether you know the latest framework. The question is whether you can help people collaborate better than they did yesterday.

Your DevOps Journey: Lessons from 15 Weeks (and 4 Years)

Razeen Abdal-Rahman — Sat, 31 Jan 2026 13:31:02 GMT

Fifteen weeks ago, I started sharing what I’ve learnt in DevOps.

Not because I have all the answers. Because I remember what it felt like to have none of them.

Four years into this career, I’m still learning. Still making mistakes. Still discovering that “best practices” change faster than documentation can keep up.

This week marked the end of my 15-week series. Security, career pitfalls, and the reality that the learning never stops.

Here’s what I wish someone had told me when I started.

Security Can’t Be an Afterthought

“We’ll add security later” ranks alongside “we’ll add tests later” in the hall of famous last words.

It never happens. Or it happens after a breach, when it’s expensive, stressful, and your company’s name is in headlines you’d rather avoid.

DevSecOps isn’t about adding more meetings or slowing down deployments. It’s about building security into every stage so it becomes invisible infrastructure rather than a last minute blocker.

Security scanning in CI pipelines catches vulnerabilities before they reach production. Dependency checking identifies compromised packages before they’re deployed. Secrets management prevents credentials from living in Git history forever.

This sounds like extra work.

In practice, it’s significantly less work than dealing with a security incident.

Finding a vulnerability during development takes minutes to fix. A quick dependency update. A configuration change. Maybe a code refactor if you’re unlucky.

Finding that same vulnerability in production after it’s been exploited? That’s weeks of incident response, customer communications, regulatory reporting, and reputation management.

The shift left approach pushes security concerns earlier in the development process. Not because security teams are bottlenecks, but because earlier is cheaper and easier.

I’ve seen teams transform security from a deployment blocker into an automated part of every build. They didn’t ship slower. They shipped faster.

Because they caught issues when they were easy to fix, not when they were critical.

The Pitfalls Nobody Warns You About

Four years in, here’s what surprised me most about DevOps.

Tools don’t fix culture problems. You can have the most sophisticated CI/CD pipeline, perfect infrastructure as code, and cutting-edge monitoring. If teams don’t collaborate, if developers and operations are still throwing work over the wall, none of it matters.

DevOps is fundamentally about culture. The tools just make that culture scalable.

Automation isn’t always the answer. This feels heretical to say in DevOps circles, but some things are better done manually. Automate repetitive tasks that happen frequently. Don’t spend three days automating something that happens quarterly and takes five minutes.

I’ve wasted time on this. You don’t have to.

Perfect is the enemy of deployed. Ship something working. Improve it based on real feedback from real users. Don’t wait for perfection, because perfection is a moving target that you’ll never hit.

Monitoring without alerting is useless. I’ve built beautiful dashboards that nobody looked at because they required someone to actively check them. If the system can’t wake you up when things go wrong, you’ll find out when customers complain instead.

Documentation ages badly. I’ve spent hours writing detailed wiki pages that were outdated within months. Code and automation are better documentation than prose. If your infrastructure is defined in code, that code is always current documentation.

Production will always surprise you. Test environments never perfectly mirror production. Build systems that handle surprises gracefully rather than trying to predict every possible failure mode.

The biggest lesson? DevOps is a journey, not a destination. You don’t “implement DevOps” and finish. You continuously improve, adapt, and evolve.

Your Career Path in DevOps

Here’s what nobody tells you about learning DevOps: you don’t learn it all at once.

Four years ago, I was overwhelmed by how much there was to learn. Linux, networking, cloud platforms, containers, orchestration, CI/CD, infrastructure as code, security, monitoring, logging.

The list was endless. It still is.

The secret isn’t learning everything. It’s learning what you need, when you need it, and knowing where to find the rest.

The people who succeed in DevOps aren’t the ones who know everything. They’re the ones who keep learning, stay curious, and aren’t afraid to admit when they don’t know something.

New tools emerge constantly. Practices evolve. What’s best practice today might be legacy tomorrow. That’s not a problem. That’s the nature of the field.

Start where you are. Pick one area to dive deeper. Build something. Break it. Fix it. Learn from it.

Join communities. Share what you’re learning. Help others who are a few steps behind you.

Building in public is how you learn faster and build your reputation simultaneously.

When you help someone solve a problem you struggled with last month, you’re not just helping them. You’re reinforcing your own understanding and building connections that will support your career for years.

What to Do Next

If you’re starting your DevOps journey, here’s what I’d focus on:

Pick one area and go deep. Don’t try to learn everything at once. Choose linux, or containers, or CI/CD, or cloud platforms. Build something real. Then expand.

Automate something that annoys you. The best way to learn automation is to solve a problem you actually have. Even if it’s small.

Prioritise security from day one. Learn to think about security as part of development, not something that comes after. Future you will be grateful.

Find your community. Whether it’s online forums, local meetups, or structured communities, learning with others is faster and more sustainable than learning alone.

Here are a couple of DevOps focused communities:

Coderco DevOps Academy

Coderco free community

Accept that you won’t know everything. That’s fine. Nobody does. What matters is knowing how to learn and where to find help.

Document your journey. Write about what you’re learning. Share your mistakes. You’ll be surprised how many people are facing the same challenges.

The Journey Never Ends

Fifteen weeks of posts covered a lot of ground. Culture, Linux, networking, cloud computing, CI/CD, infrastructure as code, containers, Kubernetes, security.

That’s a solid foundation. But it’s just a foundation.

The learning doesn’t stop here. This is where it really begins.

Because DevOps isn’t about reaching a finish line. It’s about continuously improving systems, processes, and yourself.

Four years in, I’m still figuring things out. Still learning new tools. Still making mistakes. Still discovering better ways to solve old problems.

The difference is that now I know that’s exactly how it’s supposed to be.

Your DevOps journey doesn’t end when you land your first role, or finish a course, or complete a certification.

It continues as long as you stay curious.

Where are you in your DevOps journey, and what’s the one thing you’re focused on learning next?

Kubernetes: When You Need It (And When You Don’t)

Razeen Abdal-Rahman — Sat, 24 Jan 2026 13:32:03 GMT

Docker vs K8s comparison by ByteByteGo

What happens when teams adopt Kubernetes on their first day of production?

They spend six months wrestling with YAML files, debugging pod networking, and trying to understand why their three container application needs a 47 step deployment process.

Meanwhile, their competitors shipped features using Docker Compose and grow their business.

Why This Matters

Kubernetes has become the default answer to “How do we run containers in production?”

But default answers are rarely the right answers. They’re just the loudest ones.

Understanding when you actually need Kubernetes, and when you’re better off without it, can save you months of complexity and help you focus on what matters, building things people want.

The Problem Kubernetes Actually Solves

Running one container is trivial. Docker handles it perfectly.

Running hundreds of containers across dozens of servers is a completely different challenge.

You need orchestration. Load balancing. Health checks that automatically restart failed containers. Service discovery so containers can find each other. Rolling updates that don’t cause downtime. Resource management so containers don’t starve each other of CPU and memory.

This is where Kubernetes excels.

You describe what you want (ten replicas of this container, exposed on this port, with these health checks) and Kubernetes makes it happen. Container crashes? Kubernetes restarts it. Node fails? Kubernetes reschedules containers to healthy nodes. Traffic spikes? Kubernetes scales up automatically.

It’s infrastructure as code taken to its logical conclusion.

But here’s the thing: if you’re running three containers on one server, all of this is complete overkill. You’re trading a simple problem for a complex solution.

The Core Concepts That Matter

Kubernetes has its own vocabulary, and the documentation can feel overwhelming.

But you only need three concepts to get started.

Pod: The smallest unit. Usually one container, sometimes a few that need to run together. Think of it as a wrapper around your container with Kubernetes-specific metadata.

Deployment: Describes how many replicas of a pod you want and how to update them. Want to run five copies of your application? Create a deployment with five replicas. Kubernetes handles scheduling, health checking, and recovery.

Service: A stable network endpoint that routes traffic to pods. Pods come and go, they have ephemeral IP addresses. Services provide a consistent way to reach them.

These three concepts handle 80% of what you’ll do with Kubernetes.

Everything else (ConfigMaps, Secrets, StatefulSets, DaemonSets, Ingress Controllers) you learn when you need them. Not before.

I spent weeks trying to understand every Kubernetes concept before deploying anything. I read documentation. I watched tutorials. I built mental models.

Then I deployed something simple: a basic web application with just a Deployment and a Service.

Suddenly everything clicked. The abstractions made sense because I could see them working.

Understanding comes from doing, not from reading documentation.

The Complexity Tax

Here’s what nobody tells you about Kubernetes: it solves real problems by creating new ones.

You no longer worry about manually restarting containers. Now you worry about pod scheduling, resource quotas, and network policies.

You gain powerful orchestration. You lose simplicity.

This trade-off is worth it at scale. When you’re managing hundreds of services across multiple teams, Kubernetes pays for itself. The consistency, automation, and resilience it provides become essential.

But at small scale? The complexity tax is brutal.

I’ve seen companies with a few engineers adopt Kubernetes because “that’s what serious companies use”. They spent months building deployment pipelines, debugging networking issues, and managing cluster upgrades. While their competitors are shipping features using simpler tools and captured market share.

I’ve also seen companies delay adopting Kubernetes until they had genuine scaling problems. By the time they migrated, they were fighting fires daily, manually managing containers across dozens of servers, dealing with inconsistent deployments, and losing sleep over outages.

The right answer depends on your specific situation.

The inflection point (where Kubernetes becomes worth the complexity) is somewhere around 10–20 services or when manual container management becomes genuinely painful. But it varies by team size, technical experience, and business requirements.

What You Should Actually Do

Don’t adopt Kubernetes because it’s trendy or because it’s on job postings.

Adopt it when managing containers manually becomes more painful than learning Kubernetes.

If you’re just starting out, begin with Docker and Docker Compose. Learn how containers work. Build things. Deploy them. Understand the fundamental problems before you reach for orchestration solutions.

If you’re already managing multiple services and feeling the pain of manual coordination, start learning Kubernetes. But start small, one simple application, just Pods, Deployments, and Services. Get comfortable with the basics before exploring advanced features.

If you’re at a company considering Kubernetes, ask honest questions: What problem are we solving? Is there a simpler solution? Do we have the expertise to manage this? What’s the opportunity cost of the time we’ll spend learning it?

The goal isn’t to avoid Kubernetes forever. It’s a powerful tool that solves real problems.

The goal is to adopt it when it actually helps rather than when it’s fashionable.

The Real Question

Kubernetes made managing hundreds of containers practical.

But it also made managing three containers unnecessarily complicated.

The technology itself is neutral. The decision to use it isn’t.

I’m curious about your experience, are you running Kubernetes in production, or have you deliberately chosen something simpler?

How Containers Solved ‘It Works on My Machine’ Forever

Razeen Abdal-Rahman — Sat, 17 Jan 2026 13:31:57 GMT

“It works on my machine” used to be the most frustrating sentence in software development.

Your code runs perfectly locally. You deploy it. Everything breaks.

Different OS version. Different library. Different environment variable you forgot to document.

I’ve watched deployments fail for reasons that took hours to diagnose, only to discover the production server had Python 3.8 whilst development used 3.9. Or that a critical environment variable existed on one machine but not another.

These weren’t coding problems. They were environment problems.

And they consumed enormous amounts of time.

The Real Problem Containers Solve

Containers didn’t invent a new way to run code.

They invented a new way to package it.

Before containers, you deployed your application code and hoped the environment matched what you’d tested against. You documented dependencies in README files. You wrote deployment scripts that installed the right versions of everything. You crossed your fingers.

It was fragile because you were managing two separate things: your code and its environment.

Containers merge these into a single artefact.

Your application code, the exact runtime version, every dependency, all configurations, packaged together. That package runs identically everywhere because it carries its environment with it.

This seems obvious now. But it eliminated entire categories of deployment failures overnight.

If it works in the container on your laptop, it works in the container in production. Not probably. Actually.

When Microservices Became Real

I first understood containers properly in my third year at university.

The previous year, we’d learnt about microservices architecture through abstraction. Lecturers drew diagrams. We discussed separation of concerns, loose coupling, service boundaries. It made sense academically, like understanding encapsulation in object-oriented programming.

But it felt theoretical.

Then we started using Docker to actually build microservices. Each service got its own container with its own runtime, dependencies, and network interface.

Suddenly microservices weren’t diagrams anymore. They were real.

Docker enforced genuine boundaries that Java classes never did. Separate processes. Independent failure modes. Network communication instead of method calls. Each container was truly isolated, truly independent.

That’s when it clicked. Containers didn’t just make deployment easier. They made architectural patterns tangible.

You couldn’t cheat and share state between services because they literally couldn’t access each other’s memory. You had to think about network communication, failure handling, data contracts.

The constraints containers imposed were the constraints that made good architecture work.

The Efficiency That Changed Everything

The technical specifications tell part of the story.

Containers start in milliseconds. Virtual machines take minutes.

Containers use megabytes. VMs use gigabytes.

Containers share the host OS kernel. VMs need a complete operating system for each instance.

But the real impact isn’t the efficiency itself. It’s what that efficiency enables.

Before containers, spinning up a test environment meant provisioning virtual machines, installing operating systems, configuring networks. It took time and resources.

With containers, you can create identical test environments in seconds. Destroy them when you’re done. Create new ones for the next test.

This changes how you work.

You can run dozens of microservices locally on your laptop. Each in its own container. Each with its own database. The entire architecture running on one machine.

You can test deployment changes by spinning up containers, running tests, and tearing everything down, all in your CI pipeline, all automatically.

You can scale applications by launching more containers. Not in minutes. In seconds.

I’ve seen teams move from monthly releases to daily deployments just by adopting containers. Not because containers made the code better. Because they made deployment reliable enough to do frequently.

Blue-green deployments. Canary releases. Auto-scaling. These patterns existed before containers.

Containers made them practical at scale.

Understanding Docker’s Core Concepts

Docker has three fundamental concepts worth understanding properly.

Images are templates. They contain your code and everything it needs to run. Think of them like classes in programming. They define what something is, but they don’t do anything until you instantiate them.

Containers are running instances of images. Like objects instantiated from classes. You can run hundreds of containers from a single image. Destroy them. Create new ones. Each container is isolated and independent.

Registries are repositories for images. Docker Hub is the default, but organisations often run private registries. Think of them like GitHub for container images.

The workflow is straightforward.

Build an image from your code using a Dockerfile. Push that image to a registry. Pull the image onto any server. Run containers from it.

Same image everywhere means same behaviour everywhere.

This simplicity is deceptive. It seems too simple to be revolutionary.

But it works because it removes variables. You’re not deploying code and hoping the environment matches. You’re deploying the entire environment.

What This Means Practically

If you’re learning DevOps, understanding containers is non-negotiable.

They’re the foundation of modern cloud infrastructure. Kubernetes orchestrates containers. Cloud-native applications are built as containers. CI/CD pipelines build and deploy containers.

Start with Docker locally. Install it. Run a few containers. Build a simple image from a Dockerfile.

The concepts aren’t complicated. Image, container, registry. That’s genuinely 80% of what you need to start.

Then build something real. A small application with a few services. Database in one container. API in another. Frontend in a third.

Watch how they communicate. Watch how you can destroy and recreate them. Watch how the same Dockerfile produces identical results every time.

That’s when you’ll understand why containers changed everything.

The Transformation Isn’t Technical

The real shift wasn’t from VMs to containers.

It was from “deployment is risky” to “deployment is routine.”

From “we release monthly because releases are painful” to “we release hourly because releases are reliable.”

From environment differences causing bugs to environment consistency preventing them.

Containers enabled that transformation. Not through complexity or cleverness.

Through consistency.

Your laptop, your test environment, your production servers. All running identical containers. The code that works in one works in all of them.

“It works on my machine” isn’t a joke anymore.

Because now, your machine is the container. And that container runs everywhere.

What deployment problems have you encountered that containers might solve?

Monitoring vs Observability: Why You Need Both

Razeen Abdal-Rahman — Sat, 10 Jan 2026 13:32:45 GMT

I used to think monitoring was enough.

Set some thresholds, configure a few alerts, watch the dashboards turn green. If something broke, I’d get notified. Simple.

Then I spent hours debugging an issue that monitoring never caught. Response times had degraded, but nothing was technically “broken.” No alerts fired. No thresholds crossed. Just users complaining that everything felt slow.

That’s when I learnt the difference between monitoring and observability. And why you need both.

Why This Actually Matters

Here’s the reality of production systems: you can’t predict every failure mode.

You can monitor for high CPU usage. You can alert on error rates. You can track disk space and memory consumption. But what about the weird edge case where requests from a specific region are slow, or the subtle performance degradation that only affects authenticated users, or the cascade failure that starts in a service you weren’t even watching?

Monitoring tells you when something you expected to go wrong actually goes wrong. Observability gives you the tools to investigate the problems you never saw coming.

Both are essential. Neither is optional.

The Fundamental Difference

Monitoring is predictive. You set a threshold, CPU above 80%, error rate above 5%, response time over 500ms, and you wait for reality to cross that line.

It’s brilliant for known problems. If you know that your database struggles above 1,000 connections, you can monitor for that. When it happens, you get alerted, and you know exactly what to do.

Observability is exploratory. You don’t start with a hypothesis. You start with a symptom, requests are slow, and you dig through logs, traces, and metrics to figure out why.

Think of it this way: monitoring is your smoke detector. Observability is your ability to investigate where the smoke is coming from, how the fire started, and what’s actually burning.

I’ve debugged issues where the root cause was three services deep in a dependency chain. Monitoring caught nothing because I hadn’t thought to monitor for “database query performance when cache is cold after deployment during peak traffic.” Observability let me trace the request through the entire system until I found the bottleneck.

The Metrics That Actually Matter

When you’re starting out, it’s tempting to monitor everything. CPU, memory, disk, network, request count, response time, error rate, queue depth, connection pool size. The list goes on.

But more metrics doesn’t mean better monitoring. It usually just means more noise.

Google’s “Four Golden Signals” cover most of what you actually need: latency, errors, traffic, and saturation.

Latency: How long do requests take? Because slow is broken from a user’s perspective.

Errors: What percentage of requests fail? Errors directly impact your users’ experience.

Traffic: How many requests per second are you handling? This tells you about load and capacity.

Saturation: How full are your resources? CPU, memory, disk, network. The things that constrain your system.

I’ve seen dashboards with 50 graphs. Nobody looked at them because it was overwhelming. I’ve seen dashboards with 4 graphs. Everyone understood system health at a glance.

The goal isn’t comprehensive monitoring. It’s actionable monitoring.

When an alert fires, you should immediately know what’s wrong, where it’s wrong, and what you should do about it. If you can’t answer those questions, your metrics aren’t helping.

Start with the basics. Add complexity only when you actually need it.

Setting Alerts That Don’t Destroy You

Alert fatigue is real, and it’s dangerous.

Get woken up for non-critical issues too many times, and you’ll start ignoring alerts. Then you’ll miss the one that actually matters. It’s not sustainable.

Good alerting follows one simple rule: only alert on things that require immediate human action.

CPU at 60%? Log it. Don’t alert.

CPU at 95% and climbing with no auto-scaling configured? Alert.

One request failed? Log it. Don’t alert.

Error rate at 10% for five minutes straight? Alert.

The question isn’t “is this bad?” The question is “does someone need to wake up and fix this right now?”

If the answer is no, it’s not an alert. It’s a dashboard metric. It’s a log line. It’s data you can review later.

I’ve been on teams that generated 50 alerts a day. We ignored most of them, including the ones that actually mattered. I’ve been on teams with 2 alerts a week. We responded immediately every time.

The difference? The second team only alerted on things that actually required action.

Your alerts should be actionable, urgent, and rare. Everything else is noise.

Debugging the Unknown

Here’s where observability really shines.

A user reports that checkout is slow, but only on mobile, and only in the evening. Your monitoring shows nothing unusual. All your thresholds are green.

With observability, you can ask questions you didn’t prepare for. What was different about requests from mobile devices? What changed in the last deployment? Which service in the chain is adding latency? Are there any patterns in the slow requests?

You’re not waiting for an alert. You’re actively investigating.

Digging through distributed traces, correlating logs across services, piecing together the story of what actually happened. It’s detective work. And it’s only possible when you’ve instrumented your systems to be observable.

What You Should Actually Do

Start with monitoring. Set alerts for the problems you know about. High error rates, resource exhaustion, service failures. The basics.

Then build in observability. Add structured logging. Implement distributed tracing. Make sure you can explore your system’s behaviour, not just measure predefined metrics.

Monitor for the problems you know about. Observe everything else.

And please, be thoughtful about your alerts. Your future self will thank you.

The Real Question

I used to think good engineers prevented all failures. Now I know better.

Good engineers build systems that fail gracefully and provide the tools to understand why.

Because the question isn’t whether things will break. It’s whether you’ll be able to figure out what happened when they do.

The Testing Pyramid: Why Most Teams Get It Backwards

Razeen Abdal-Rahman — Sun, 04 Jan 2026 11:02:29 GMT

I’ve watched countless teams ship broken code to production, then scramble to fix it whilst customers are affected.

The irony? They all had tests. Sometimes hundreds of them.

But here’s the thing, having tests isn’t enough. The type of tests you have, and how you balance them, makes all the difference between catching bugs in seconds or discovering them when a customer rings to complain.

Why Testing Architecture Matters

Most engineering teams understand they need tests. What they don’t realise is that not all tests are created equal.

Some tests run in milliseconds and tell you exactly what broke. Others take ten minutes to run and fail for mysterious reasons that require an archaeology degree to debug.

The testing pyramid isn’t just a nice diagram someone drew at a conference. It’s a fundamental principle that determines whether your test suite helps you move faster or slowly grinds your team to a halt.

Get it wrong, and you’ll spend more time maintaining tests than writing features. Get it right, and you’ll ship code with genuine confidence.

The Pyramid: Fast Tests at the Bottom, Slow Tests at the Top

The testing pyramid has three layers, and each serves a specific purpose.

Unit tests form the foundation. These test individual functions in isolation. They run in milliseconds. Thousands of them can execute in seconds. When one fails, it tells you exactly which function broke and why.

This is your early warning system. You change a calculation, a unit test fails, you fix it immediately. The feedback loop is instant.

Integration tests sit in the middle. These verify that your components actually work together. Does your service talk to the database correctly? Do your API calls return what you expect? Can your message queue handle the load?

These tests are slower because they involve real infrastructure. But they catch a different class of problems the “it works in isolation but fails when connected” bugs that unit tests miss.

End-to-end tests form the top. These simulate actual user behaviour. Click a button, fill a form, verify the entire flow works. They’re slow, fragile, and expensive to maintain.

But they’re also the only tests that prove your application actually does what users need it to do.

The pyramid shape isn’t arbitrary. It reflects the cost and value of each test type. You want lots of cheap, fast tests catching most bugs, with progressively fewer expensive tests handling scenarios that really require them.

The Inverted Pyramid Problem

Here’s what actually happens in most organisations.

They start with end-to-end tests because those feel “real.” Manual testing through the UI. QA teams clicking through flows. Automated Selenium tests that take twenty minutes to run and break when someone changes a CSS class.

Meanwhile, unit tests are an afterthought. “We’ll add those later when we have time.”

Except “later” never comes.

Writing tests after you’ve written the code is exponentially harder than writing them alongside the code. You’ve moved on mentally. The context is gone. You’re already thinking about the next feature.

Plus, code written without testing in mind is usually a nightmare to test. Tight coupling everywhere. Hidden dependencies. Side effects you didn’t document. Retrofitting tests means refactoring first, which means “add tests” just became “rewrite the entire module.”

So it gets postponed. The technical debt compounds. The codebase becomes increasingly fragile.

I’ve inherited projects with zero unit tests and hundreds of end-to-end tests. Every change was terrifying. You’d make a small update, run the test suite, wait thirty minutes, then discover you’d broken something in a completely unrelated part of the application.

The feedback loop was so slow that by the time you found the bug, you’d forgotten what you’d changed.

Automation: The Safety Net That Lets You Move Fast

The real power of the testing pyramid emerges when you automate it in your CI/CD pipeline.

Push code. Tests run automatically. If they fail, the pipeline stops. Bad code never reaches production.

This happens dozens or hundreds of times per day. Every single change gets validated before it goes anywhere.

Without automation, you’re relying on humans to remember to run tests. Check results. Decide whether it’s safe to deploy. Humans forget. Get busy. Take shortcuts when deadlines loom.

Automated testing in pipelines means you don’t rely on memory or discipline. The pipeline won’t let you deploy broken code even if you try.

I’ve pushed code that would have caused production outages. The pipeline caught it within seconds. That’s exactly the point, catch mistakes before they matter, not after they’ve affected customers.

Fast, automated feedback is what makes frequent deployments safe. Without it, you’re either deploying rarely (and batching up risk) or deploying frequently (and hoping for the best).

Neither option is sustainable.

What Actually Works

If you’re building a testing strategy from scratch, or fixing one that’s already inverted, here’s what I’d recommend.

Start with unit tests. Write them as you code, not after. Test business logic, calculations, transformations, anything with clear inputs and outputs. Aim for hundreds or thousands of these.

Add integration tests for critical paths. Database operations. External API calls. Message queue interactions. You don’t need integration tests for everything, just the connections that matter.

Use end-to-end tests sparingly. Pick the handful of user journeys that absolutely must work. Login. Checkout. Critical workflows. Keep them stable and maintain them properly.

Integrate everything into CI/CD. Tests should run automatically on every push. Fast tests first, slow tests later. If unit tests fail, don’t bother running the expensive end-to-end suite.

The goal isn’t 100% coverage. It’s confidence. Can you deploy on Friday afternoon without sweating through the weekend?

The Real Cost of “We’ll Test It Later”

Testing isn’t about slowing down to be careful. It’s about building the safety net that lets you move fast.

Teams that test as they go ship faster and with more confidence. Not despite the time spent on tests, but because of it.

The alternative is living in constant fear of your own codebase. Every change is a gamble. Every deployment is stressful. Every bug that reaches production chips away at your reputation.

I’ve been on both sides. Codebases with good test coverage feel different. You refactor confidently. You ship features without anxiety. You sleep well after deployments.

The testing pyramid isn’t complicated. It’s just deliberate. Most teams get it backwards not because they don’t understand it, but because they never stopped to think about what they were building.

So here’s the question: what shape is your pyramid?

Git for DevOps: More Than Just Code Version Control

Razeen Abdal-Rahman — Sun, 28 Dec 2025 13:40:28 GMT

I’ve been asked the same question multiple times: “What’s the one skill that’s actually non-negotiable in DevOps?”

Everyone always says Linux, you need to be comfortable on the command line.

But there is another skill that doesn’t get enough recognition, Git.

Not because it’s more important than Linux. But because people underestimate how fundamental it is to everything we do.

Not because Git is exciting or trendy. But because everything else breaks without it.

Why Version Control Is Your Foundation
When I started in DevOps, I thought version control was just for developers keeping track of their code.

I was wrong.

In DevOps, version control is your time machine, your audit trail, and your collaboration platform rolled into one. It’s the foundation everything else builds on.

Here’s what actually belongs in version control:

Your infrastructure definitions
Your configuration files
Your deployment scripts
Your documentation
Your monitoring rules
Your database migration scripts
Everything

If it configures something, it belongs in Git.

I’ve seen teams who understood this principle and teams who didn’t. The difference is staggering.

The teams who version control everything can answer four critical questions within minutes:

What changed?
When did it change?
Who changed it?
Why did they change it?

The teams who don’t? They’re guessing. They’re searching through backup folders. They’re hoping someone remembers what the configuration looked like last Tuesday.

The Git Workflow That Actually Works
Most DevOps teams I’ve worked with use a similar workflow, and there’s a reason why.

You create a feature branch from main. You make your changes. You commit frequently with clear, descriptive messages that future you will actually understand.

Then you open a pull request.

This is where the magic happens. Your teammates review your changes. Automated tests run in your CI pipeline. Both have to pass before anything reaches production.

This workflow does something clever, it catches problems before they become incidents.

Code review spots logic errors, security issues, and configuration mistakes that one person might miss. Automated tests validate that nothing broke. And because you’re merging small changes frequently, conflicts are rare and easy to resolve when they do happen.

I’ve worked in teams where pull requests sat open for days. Progress was glacial. Everyone was blocked waiting for reviews.

I’ve also worked in teams where pull requests merged within hours. We moved incredibly fast whilst maintaining stability.

The difference wasn’t talent or team size. It was discipline around the workflow.

Keep your branches short-lived. Branch for a few hours or a day, not weeks. Small changes are easier to review, easier to test, and safer to deploy.

When something does go wrong in production, you can revert quickly. That’s not possible with massive, week-long branches.

Git History: Your Secret Debugging Weapon
Here’s something that took me embarrassingly long to learn: Git history is your most valuable debugging tool.

Something breaks in production at 2am. What changed?

git log shows you every commit. When it happened, who did it, what they were trying to accomplish.

git diff shows you exactly what code or configuration changed between the working state and the broken state.

git blame (terrible name, great tool) shows you who last touched each line, not for pointing fingers, but for finding the person who has context.

I’ve spent hours debugging issues that took minutes once I actually looked at Git history.

“Oh, we changed the database connection string yesterday. That’s probably it.”

“Someone updated this library version last week. Let’s check if that broke compatibility.”

“We modified this config file for an unrelated feature. It might have side effects.”

Version control doesn’t just save your code. It saves context. And context transforms a mysterious production incident into a clear problem with an obvious solution.

Every senior DevOps engineer I know who’s brilliant at debugging is also brilliant at reading Git history. That’s not a coincidence.

Infrastructure as Code Changes Everything
This is where version control becomes truly non-negotiable in DevOps.

Your infrastructure definitions, whether that’s Terraform, CloudFormation, Ansible playbooks, or Kubernetes manifests, all need to live in Git.

Why? Because infrastructure needs the same guarantees as code.

When you version control your infrastructure, you can review changes before they’re applied. You can test them in lower environments. You can roll back when something goes wrong.

Without version control, someone manually clicking through a cloud console can break production with zero record of what they changed.

With version control, every infrastructure change goes through the same rigorous workflow as your application code.

I’ve seen this save teams from disasters more times than I can count. Someone tries to scale up a database. The change is reviewed. Someone spots they’ve selected the wrong instance type. Disaster averted before any damage is done.

What You Can Do Now
If you’re trying to break into DevOps, here’s what this means for you practically.

Put everything in version control. If you’re working on personal projects or lab environments, resist the temptation to make quick changes directly on servers. Practise the discipline now.

Learn to write good commit messages. “Fixed bug” tells future you nothing. “Changed database timeout from 30s to 60s to prevent connection drops under load” tells a complete story.

Familiarise yourself with Git history commands. Not just git log, but git log -p to see actual changes, git log with grep flag to search commits, git blame to find context.

And here’s the thing nobody tells you: this skill compounds. The better you get at using Git properly, the more you’ll rely on it, and the better you’ll become at collaborating, debugging, and moving fast safely.

The Foundation Everything Else Builds On
Git isn’t the most exciting part of DevOps.

But it’s the foundation that makes everything else possible. Your CI/CD pipeline needs something to pull from. Your infrastructure as code needs something to track changes. Your disaster recovery plan needs something to revert to.

Without version control, you’re building on sand.

With it, you have a solid foundation that lets you move fast, collaborate effectively, and sleep soundly knowing you can always go back in time if something breaks.

I’ve never met a DevOps engineer who regretted learning Git properly. But I’ve met plenty who wish they’d learnt it sooner.

What’s one thing you’re not version controlling yet that you probably should be?

Infrastructure as Code: Why ClickOps Doesn’t Scale

Razeen Abdal-Rahman — Sat, 20 Dec 2025 13:32:16 GMT

Image inspired by controlmonkey

A colleague once told me how he spent three days manually rebuilding a production environment from memory.

They clicked through AWS console screens, cross-referenced old screenshots, and tried to remember which settings they’d changed six months ago. They got close. But “close” isn’t the same as “identical,” and those small differences caused issues for weeks afterwards.

That’s when I understood that clicking buttons might feel productive, but it’s technical debt disguised as progress.

The Problem with Manual Infrastructure

Manual infrastructure configuration feels intuitive because it mirrors how we interact with most software. You see a button, you click it. You need a setting changed, you change it. It’s immediate and satisfying.

But infrastructure isn’t like writing a document or sending an email. It’s the foundation that everything else runs on, and foundations need to be precise, documented, and reproducible.

The real problems emerge gradually. At first, you configure one server manually and it works perfectly. Then you need another server with similar settings. You try to replicate what you did, but you can’t quite remember every detail. You make your best guess.

Now you have two servers that are mostly the same.

Multiply that across dozens of resources, databases, networks, security groups, load balancers, and you’ve created an environment that nobody fully understands. Documentation goes stale the moment someone makes an undocumented change. Different engineers configure things slightly differently based on their interpretation of requirements.

Production and staging environments drift apart because there’s no single source of truth.

When something breaks, you can’t easily see what changed because changes weren’t tracked. You’re left troubleshooting based on hunches and tribal knowledge.

What Infrastructure as Code Actually Solves

Infrastructure as Code means defining your infrastructure in text files instead of clicking through web consoles. You describe what you want, servers, networks, databases, configurations, and the IaC tool creates it.

But IaC isn’t primarily about automation. It’s about reproducibility.

Code is version controlled. Every change is a commit with a timestamp and author. You can see exactly what changed, when, and why.

Code is reviewable. Infrastructure changes go through the same review process as application code. Another engineer looks at your changes before they’re applied.

Code is repeatable. You can delete an entire environment and recreate it identically by running a single command. No guesswork, no missing steps, no “I think it was configured like this.”

The first time you delete a staging environment and bring it back up in twenty minutes with perfect confidence that it matches production, you understand why IaC is fundamental to modern DevOps.

It’s not about being clever with code. It’s about making infrastructure predictable, reliable, and auditable.

The Tools Landscape

The main IaC tools each take different philosophical approaches, but they solve the same core problem.

Terraform is the industry standard. It’s cloud-agnostic, declarative, and has massive community support. You write configuration files describing your desired state, and Terraform works out how to achieve it. It works with AWS, Azure, GCP, and hundreds of other providers. If you’re learning IaC, start here.

CloudFormation is AWS-specific but deeply integrated with AWS services. It’s free, well-documented, and often gets support for new AWS features before third-party tools do. If you’re committed to AWS and want native integration, CloudFormation is powerful. But you’re locked into the AWS ecosystem.

Pulumi lets you write infrastructure using real programming languages; Python, TypeScript, Go, C#. Instead of learning a domain-specific language, you use tools you already know. It’s newer and has a smaller community, but it’s growing rapidly. If you’re a developer who prefers actual code to configuration files, Pulumi feels natural.

Which should you learn?

Terraform is the safest bet for career development. It’s widely used, cloud-agnostic, and the skills transfer across companies and environments.

But as with all things in DevOps, the tool matters less than the concepts. Understanding declarative configuration, state management, and idempotency is more valuable than mastering any specific tool’s syntax. Those concepts transfer across all IaC platforms.

Making the Shift

Moving from ClickOps to IaC requires a mindset change as much as a technical one.

You trade immediate gratification for long-term reliability. Writing code to create infrastructure takes longer initially than clicking buttons. But you’re paying that time cost once instead of paying it repeatedly every time you need to replicate, troubleshoot, or audit your infrastructure.

Start small. Don’t try to code your entire infrastructure at once. Pick one new resource, a single server, a database, a network configuration and define it as code. Deploy it. Learn how the tool works.

Then incrementally bring existing resources under IaC management. Most tools can import existing infrastructure and generate configuration from it.

Review infrastructure changes like you review code changes. Make infrastructure modifications a team activity, not a solo operation. That review process catches mistakes before they reach production and spreads knowledge across the team.

Document your conventions. How do you name resources? How do you structure your code? What’s your branching strategy? These decisions matter more than you’d think, especially as your infrastructure grows.

The Long View

Infrastructure as Code isn’t a nice-to-have anymore. It’s how modern infrastructure is built and managed.

The organisations doing DevOps well treat their infrastructure code with the same rigour they treat their application code. Version control, code review, automated testing, documentation, all of it applies.

The organisations still doing ClickOps are accumulating technical debt with every manual change. That debt compounds, and eventually it becomes expensive enough that they’re forced to address it reactively instead of proactively.

You don’t want to be the engineer who has to untangle years of undocumented manual changes.

Better to learn IaC now whilst you’re building your foundation. The concepts transfer across tools and clouds. The practices become second nature. And you’ll never again spend three days manually rebuilding an environment from memory.

The question isn’t whether you’ll eventually need to learn Infrastructure as Code, it’s whether you’ll learn it now whilst it’s manageable, or later when you’re already drowning in technical debt.