How to bring your infrastructure to the next level

Fred Wynyk
OneFootball Tech
Published in
13 min readJul 31, 2018

--

“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” — Bill Gates

Couldn’t agree more with uncle Bill. But don’t get me wrong. It’s not always lazy people who find the best solutions. You have to build a lot of knowledge and, above anything else, you need to have a good mindset to choose and apply the right technology and this could take a real effort. But in the end, an “easy way mode” should be the way to drive your day-to-day.

This article intends to motivate you to move to new technologies and all the benefits of doing that. So take some time for yourself and use this article to think a little bit deeper. First because the change starts with you. And having good tools makes things a lot easier.

I also expect to answer some whys and hows to devs and sysadmins somehow afraid of change from a DevOps (really as a culture) perspective. And I’ll try to show how the laziness can help with that.

Why do I need to move?

Imagine that everything is running good in your company. 2 weeks with not even 1 alert. No alerts in on-calls. You just go to work, grab some coffee, open your monitoring systems, everything is green, graphs without spikes. Your weekends are quiet. Your only job now is to deliver new features. Nobody is annoying you. The perfect dream of any IT professional.

Well, things are just not this way. You have broken builds. You have long deployments. You have tons and tons of code written by people that just tried to “solve” their own problems. You have alerts buzzing your phone. You have to deal with legacy. With some luck, you have some automation systems with bazillion lines (that probably don’t work) of legacy code to manage. And in the middle of chaos, yes, you need to deliver something. Sounds familiar?

This is reality for many devs and sysadmins. Working in an almost radioactive environment that everyone is afraid to touch because it can break anytime, receiving alerts from nowhere to many different channels, then no one can distinguish what is a real problem and what isn’t.

And then, some say: “I don’t want to change. It’s working today. I know how to deal with the current setup. I know how to quickly solve these issues.” Yeah… sometimes I prefer to deal with the logics rather than people. This kind of laziness is bad. It produces a thing that nobody likes: surprises. No one likes the goosebumps of oncall alerts in the weekends. Or maybe in the office: do you know that moment when some black forces come and give you an urgent problem in production to solve? That moment that all your motivation to deliver the sprint goes downhill. And living like this for a long time can take all your strength and push you out of the game. It’s just sad.

This is the result of many poorly thought-out decisions. Communication problems. Solutions delivered in the hurry of the moment. Limited knowledge. Disputes between departments. And the list doesn’t stop.

And there are more problems left: You have experienced people leaving the company and some of the company’s knowledge going with them too. You have company changes. You have OKR changes. Change is the only constant word. Yeah, it’s hard. But what should you do in this crazy business/tech world? Wait for other people to come solve your problems?

It’s time to some undeniable truth: no one care about yourself but yourself. So, don’t expect to anyone but yourself to take your pains and solve your problems. It’s easy to say that is the company’s or something else’s fault. It’s easy and comfortable to complain and expect to someone to bring solutions. How far is the dream from reality?

So, instead of waiting for some saviour, try to change something on your side in the first place. If you have that feel that ‘someone needs to do something’, or even if you are trying to get out of a similar situation, just start to be honest with yourself. Take time to self-reflect. How proud you are about yourself and what you built? How proud you are about your ecosystem? How proud you are about your team?

Well, you can open a beer in the end of the day and say that life sucks. Some people maybe prefer to look at other opportunities without even trying anything. And maybe similar problems will be waiting for you in another place, you’ll never know. In the end, the decision is up to you. Or maybe you can try to do things in a different way. Why not? Let’s start with some basics.

How do you monitor?

How do you see if something is wrong? By CPU usage? Number of requests? Do you have internal monitorings? How many? How much do they cost? Please, sum the precious time and effort you spent building monitoring servers and services by yourself and then trying to find the root cause of the problem crossing all this data from different sources. How quick did you find the solution? How lost did you get? How outdated are these systems?

But what if you had just one place to look? What about having application metrics, server metrics, custom metrics, API tests and alerts in just one place without the effort to build and manage anything?

TIP: APM

Application Perfomance Management, yes! Monitoring at application level. Because in the end, the most important thing is if your application is working and how it’s working. And you can easily find SaaS companies offering this kind of service.

Most of these companies offer at least 15 days of trial. If I hear some saying “it’s too expensive”, for me it’s clear that they don’t understand the value of having these tools. You probably already lost a lot of time and money doing monitorings by your own. I’m not saying that you don’t need to have any custom monitoring. But I assure you that these tools will deliver much more value with less effort to you and your company.

Newrelic, Datadog, Sysdig… they just worth the price you pay. You can amazingly find the problem and the solution quicker. And you can always negotiate pricing. It will save precious time for you to focus on what really matters: your ecosystem. See? Laziness, for the best.

What about configuration management tools?

First of all: read this part until the end, don’t be afraid. :)

These tools are intended to help sysadmins, right? But most of the time you need to spent learning how to deal with this tools. I’ve been there before with puppet, ansible, salt… so, I know the pain. Pain? Yes, it’s a pain in the ass to manage this code. “But my 4.597 scripts and modules are working for more than 5 years with no issues”. Almost a monolithic ERP and there are people in love with it.

They are great tools, some of them with more than 12 years of existence. Well, then sorry: they *were* great tools. Some of them like Ansible, Terraform and Packer are simple (Hashicorp is awesome!) and still can be used for a lot of purposes. But, do you really think the internet giants as Facebook, Netflix and Spotify use this to deploy and scale live traffic in production? These tools made you a slave and you didn’t even notice that. You need to work the way they work. And, if I’m right, you probably are trying to solve bigger problems that these tools were not meant to solve.

These tools give you the false sensation that you have control of your infrastructure, but when you need to do some critical change and it breaks, for whatever the reason (don’t be shy, it breaks for everyone), you realize that you spent some very precious time (months and sometimes, years) building a gigantic non bullet proof solution. Or maybe you used it in the wrong way. So, don’t be afraid to recycle your code.

Just for instance, Terraform is an amazing tool to build and control your infrastructure. Try to be audited or have the well architected session without automation. To not have a proper automated process to build your infrastructure, well, it’s bad.

But, to deliver new code, new features for your customers… Well, the problem lies when you try to make a simple juicer to cook an entire dinner. Are you gonna wait half an hour (being optimistic) to have a a new version in prod? Do you need to wait for some commands to run inside the instances to have your deployment in production? And when it fails, is the rollback process really working as you expected?

So my friend, this kind of usage for configuration management tools *was* really good some time ago. Don’t do it anymore, and please retire what you have. Today you have a better way to do this stuff.

TIP: Docker

This thing with configuration management can become a nightmare, in some point. But did you ever tried docker? “Uh, that buzzword”. Yes, and do you know why? There’re too many good answers for that. But instead of writing boring stuff, I would kindly ask to you update yourself with some spotify and netflix presentations. They can show you how many problems Docker solves and how many benefits it has.

Docker (now 6 years old) is good because it creates a pattern for any kind of environment (dev, prod, whatever). It meets the Twelve-Factor App in almost all categories. It was made to scale, to be secure and to be portable. And to reduce costs. You can have all of its benefits using it in the proper way.

The change isn’t easy. But imagine a world with no server/agents. Imagine that you don’t need to manage the management tool. Imagine just one replicable setup for all apps. A really cheap/thin/fast/reliable/ scalable infrastructure.

Besides that, imagine that you will never listen “it works in my machine” again. Imagine really quick builds and deployments (around 2 minutes or even less). Your application scaling in seconds handling big traffic spikes.

Docker delivers patterns that you always tried to have, but never could. So, don’t be biased. Give it a try, you’ll love it.

And where is the laziness here? You will experience laziness when all your infrastructure be running in docker with some good scheduler. A world with no headaches. Every system working as expected. You definitely need to try.

"Someone is reporting bugs, but the monitoring is fine."

Some new code goes to production and everything is fine. No alerts. No spikes. Then you go home. Next week, there are some e-mails from customers saying that something is not working in your system. What do you do? See the logs? Try to find when these problems started? When was the last change? Who changed? What was changed? Well, a lot of questions, some time lost trying to find the bug, and customers becoming angry with your company.

We know that in most of the cases, you can prevent this situation from happening by following some good and important practices in the software delivery process. And you cannot replace good practices with any "magical stuff". But what if you have a tool to tell you that something is wrong before anyone else?

TIP: API monitoring

Would be awesome to be notified when a small change affect your API response, right? Tools like Runscope and SoapUI and NewRelic Synthetics can do this for you. If you configure them properly, you can assure that, for example, your API payloads will always return the result that you want. And if it doesn’t, you can be notified right on (integrations with Slack, HipChat, PagerDuty, OpsGenie are in place), before anything breaks and mainly before users tell you that something is wrong.

But the greatest benefit of using API monitoring is that it’s proactive. You don’t need to wait for your application to suffer to be notified. The API monitoring will do it for you actively and testing real scenarios, which is a far better way and more important than any other monitoring. You have a swagger file? You can import it. And depending of the tool you choose, you can configure your CI/CD solution to call these tests as well. This for sure will improve the quality of the delivery process and will assure that your environment will be always rock solid.

"The whole infrastructure is a problem"

Now I’m talking directly to my sysadmin/devops friends: One thing that I notice in my career is that ops guys are hard workers. You can blame them for everything, but you can’t say they’re not hard workers. But unfortunately, they can be in a situation where there are too many problems to handle. It’s difficult to change something when you are sunk in many fire fightings. But the question is: How you, my friend, got there?

Most of the problems we already mentioned at the beginning of this article. Thing is: if you spend more than 30% of your week time solving issues, either code bugs or infrastructure setup related with production issues, then you are a serious candidate to change your infrastructure. Yes, just accept the fact.

It’s simply not healthy to work with everything buzzing around. Tweak configs, create servers, restart services, over scale databases… there’s a point you *must* say out loud in capital letters: ENOUGH.

But, by reading this article you probably noticed that things changed nowadays. You cannot use the same tools you used 5+ years ago the same way. You are not a Flintstone in a foot-powered car. Now you have engines, tires, automatic gearbox. There are some new nice patterns and nice tools that improved a lot our life. And they are really easy-to-use, you just need to be open to the new and trust in what the big tech companies are doing nowadays.

TIP: Container Schedulers

We talked about containers and somehow you want to give it a try. But it’s new knowledge. You don’t understand how it works and which problems it really solves. So let me give you a tip: never run a container in production without a container scheduler.

A container scheduler was made to give you resources to help you run your containers in production. It was made to take care of your infrastructure in many aspects: health, scaling, timeout, cpu and memory usage, service discovery… The point is to avoid you thinking about provisioning. And then you always will have one common structure for all applications.

You have many schedulers these days (Mesos, Nomad, Docker Swarm, ECS, etc). But my choice is Kubernetes. Many interesting companies like Github and Microsoft are using Kubernetes. And I can tell: compared with the competitors, Kubernetes is far easier to learn and to deal with. I will not give an extensive list of benefits, you can research them by yourself. But have in mind that it was basically running for years inside Google, so yes, it is powerful. :)

But the thing I like mostly about Kuberbetes is: it didn’t come to create new patterns. It was created to *meet* patterns. Many competitors try to solve their own problems, the way they think they can handle infrastructure problems, and at some point they figure out that they a have a new product. The familiar story of a 'bug that became a feature'. But Kubernetes is different. Kubernetes is the result of the rules and patterns that you read in the books. Every infrastructure aspect that you already learned about scalability, performance and management is done magnificently by Kubernetes. And it’s open source, so Cloud Native is now the owner and selected Kubernetes as its first containerization technology.

Google offers Kubernetes as a service. AWS just launched its own Kubernetes SaaS. Easy peasy, no need to code anything to build. But of course you can build by yourself in the hard way (with some help… lol). But I would say that takes more time to build some server/agent configuration managers than build your own Kubernetes cluster.

Are Docker and Kubernetes big changes to your environment? Yes. But they will push you and your team to reality. They will push you to deliver good practices, to write better software, to closely understand the behaviour of your app, to have a stable environment. You cannot imagine how many headaches they solve together. And in the end, everybody will be proud to have it.

What now?

Well, of course I cannot cover all imaginable points of having a good infrastructure in just one article. But I hope to have given you some direction on how to improve a little bit.

And in my opinion, SaaS is not the solution for everything. And we have a lot of open source tools and good scripts/functions that will never lose value. But, think for a second: why should you reinvent the wheel? Why should you spend time building a car if you can buy or even rent one? Just give yourself a chance to try these new things. They will deliver a lot more value for you.

You probably noticed how many times I said the words “easy”, “leaziness”, “less headaches”. And you probably already spent a lot of time trying crazy solutions. Why don’t you spend some time doing things right, doing things that can really solve your problems?

In the end, there is the undeniable truth: you need to take care of your own systems. The systems you need to code and deploy for your company. That ones which make your company be what it is and earn its money. This is the place you need to put your full attention. But some people just don’t have this clear vision. Or maybe they like to have some “pets” to take care of, press some buttons, turn some keys… Well, bear in mind that time is something precious these days for companies. And also for you and your future.

So, answering the main question: how to bring your infrastructure to the next level? Well, you noticed that this is a half-technical article. Yes, you have some good technical advices here, but the catching point is not technical. It’s more about people. And totally about mindset.

It’s about accepting reality. Accept changes. Take some risks. Face what's new. Some people fear not having the ability to do an SSH, a service restart. They need to see servers working from inside for some reason. They are just comfortable with that. But there are some realities to face: In which kind of team/company do you want to work? Which kind of problems do you want to handle? How proud with your role and your position you are today? How do you see your career for the next 5 years?

The answers are up to you. But I really like the idea of having simple setups. One pattern for any app. Just few pretty decent lines of code to solve many different problems. Just one reliable place to look if all apps are good. Good practices built-in. Scale up apps as fast as the spike they received. Better budget estimation. What a long list of good things!

Some people can say “oh, there’s too much magic to be real”. Well, did you tried before? Just give it a shot. Maybe there’s a good reason for companies like Google, Microsoft, Spotify, Netflix, Cisco, Airbnb (and the list goes on) to use this kind of tools. And I bet that you’ll be proud at the end. And certainly you will have more time left to be lazy.

--

--

Fred Wynyk
OneFootball Tech

Engineering Manager | Platform, Security, Backend