A DevOps rant: the sh** I’ve seen

Published in

Ministry of Programming — Technology

7 min readSep 24, 2018

I believe this is my first blog post ever, since I can’t find anything else I’ve written on the Interwebs. I probably forgot about it. So, a bit about myself. People who know me will tell you I usually rant about everything, mostly people not doing their jobs, so I’ll continue to do this online as well. I’ll occasionally write a helpful article with a tutorial for something, but mostly I’ll be writing about my experiences with system operations/SRE, or as they would say today, DevOps (God, I hate this “title” so much). The stuff I’ll write about in this article is just the tip of the iceberg of the stuff I’ve seen over the years. There’s more, I could probably write a book, but I’m not that much of a writer, as you’ll see below.

Service defaults

When it comes to creating a new service, there’s no better time to setup it the way you like, because you worked on something else before and it just wasn’t right or you wanted to quit, vomit or smash your laptop with a hammer as you looked at the code.

So, you’ll start creating something new to ease your job, but here’s where most of developers create a major problem (salute) for everyone else. They tend to make their perfect setup, usually targeting development, and in process forget that this new service might end up on production. The setup is usually poorly documented, barely automated and to get it to work you need a goat sacrifice to whatever you believe in . Your MacBook Pro isn’t going to be shipped to AWS or Google Cloud to serve traffic, so the phrase “It works on my machine” doesn’t work for me. When I take the repository with your service I want to run one command and be done with it. If it’s running in a container, great. If not, you better have a startup script that setups everything. Reading over README.md files is a waste of my time, the time I could have used to do something better other than to see you invent yet another way of setting up a Node, GoLang, Scala/Java service with unnecessary manual steps.

So, when you start writing your new service, think of the production environment first, then think about your development setup. Read about 12 factor app or cloud native applications development, and don’t be that guy.

Configuration

One of the principles when designing a modern service is to split configuration from the service itself. By default, it’s done by using either environment variables when starting the service or having some kind of process to provide a configuration file, whose contents are usually secret, to service. The former is usually preferred due to simplicity, but if someone insists on the latter, there are a lot of ways to achieve that. However, since I’ve always ran into problems with environment variables, I’ll be ranting about this.

So, to start, how many variables is too many? Have you ever seen a configuration with 300 environment variables, some being a multiline wall of texts? I have. When you decide what needs to be extracted into an environment variable, decide what’s environment specific and extract that. For example, you have a S3 bucket that has multiple folders for uploaded/resized images and you have three environments:

https://images-bucket.s3.something.com/image

https://images-bucket.s3.something.com/thumbnails

https://images-bucket.s3.something.com/avatars

You don’t need 3 variables per environment for these, you only need one. The bucket endpoint. If your service has any common sense in design, your environments will have identical folder structure, so the only variable here is the unique bucket per environment. This also becomes important if you’re using K8s or something similar where your multiple environments can have almost identical configuration. Think using docker-compose for your local development and having DB host named just db. Now you can have the same on your production or staging, and you can default the hostname in code and override if necessary using environment variables. Same goes for port numbers. Why do you need to have 5432 for Postgres or 6379 for Redis exposed as environment variables? Those are well known ports and should be defaulted inside your configuration, only overridden if, for any reason, you need them to be different. So, think about this next time you start coding some new service.

Dependencies

For God’s sake, do planets need to align for the service to work? You didn’t declare your library dependencies correctly? It’ll probably work fine for a certain period of time, until someone decides to push a new version of one of your dependencies and break everything.

A new developer joins the company, starts setting up his machine and cloning projects. Tries to run the service for the first time and it’s not working. Dependencies installed on his machine aren’t right. But it works on your machine. Just because you couldn’t be bothered to use a package manager with lock files and you never delete your package manager cache. Running build procedures inside containers is one way to check if you locked dependencies properly, since every will be built from scratch. Unless you mount node_modules or ivy cache inside the container. Don’t do that. Please.

Telemetry and logs?

You weren’t thinking about production at any point of design? I’m not going to think about saving your service when it starts crashing down in a burning wreckage. Developers decided to follow everyone else and have some new services running on K8s. Great, good for you, moving up in the world. But you forgot to include some kind of telemetry in your service? I’m not a fortune teller to predict when your service will crash and if it does I’ll make sure you’re the one waking up at 3AM to fix. If you’re gonna be running on K8s, exposing metrics to Prometheus is a couple of lines in code and we can do all kinds of alerting, scaling, restarting when things go bad. Or before they go bad. Oh, one more thing:

Does that look helpful to you? To me, it looks like you’re trying to kill our logging stack and make my eyes bleed out when debugging. It probably doesn’t even do the checking properly, it’s just returning 200. Anyway, you don’t need to log every successful request, you need to log when things go bad. And when they go bad, the service will be restarted anyway, so, the event should tell me why it failed. This goes for everything else related to logging events. Also, log levels. Anyone still using this? People forgot the basics. If you’re logging every request just because, it’s like having TRACE on all the time. I’ve seen services not logging anything at all, or logging everything. Neither is helpful and it looks like someone wasn’t thinking when they wrote it. Use proper log levels people!

Manual interventions

Your database schema migration is manual? You’re doing it wrong. You need to manually create exchanges on messaging middleware because you half-assed your code? Still wrong. If you go the microservices way, your microservices need to handle schema or data migrations. There are a lot of tools that allow you to do this, like Flyway or Liquibase. If you’re using Rails, it already has everything in place. If you’re doing everything with containers and hopefully using Kubernetes, you can wire your migrations on service startup, so on new version deployments it will startup a new container and execute migrations beforehand. Or you can handle the schema migrations as one-off jobs. Whatever floats your boat. Just don’t do it manually. You will probably forget half of the stuff that needs to be done. Your production and staging environments will start to drift apart really fast. Something will be working in production but failing miserably on staging. When the service fails in the middle of the night, you don’t want to dig out runbooks and look for that specific event, then go through steps to restore operation. Schema migrations and automation in general can be scary for people at first. I get it. You’re handing over database manipulation to a script you or someone else wrote. When your service starts, it’ll probably create everything it needs for work. It wasn’t tested properly or at all. But that’s just it, if it hurts do it more often, you’ll get better at it. You’ll get used to it and later you’ll be wondering how the fu** you used to work without it.

Enough from me for now. If I forgot to rant about something I’ll probably have a part deux. Feel free to keep on ranting in comments, I’m sure everyone has something to rant about. Now I need to get back to work, you know, with a whip and a grumpy look.

A DevOps rant: the sh** I’ve seen

Service defaults

Configuration

Dependencies

Manual interventions

Written by Zijad Purkovic