Because You Have to Plan; and Fail

Muhammad Sahputra
Aug 9, 2017 · 7 min read

Last week I had a very interesting discussion (well, interview to be exact) with CTO of one company in Jakarta. Amongst all interesting discussion, he asked about my activities from last job. I spent 6 years handling crucial product in telecom industry. He asked, “How crucial?”. I answered that the product is some kind of database technology holding all subscriber (telecom consumer) related information. Some people said it is similar to the “heart” of telecom provider / operator. As in the heart of human body, when the heart disturbed it will be affecting all other part no matter what. And if one day the heart down / collapse / dying, the whole body will not be functioning at all. In telecom operator, if the central DB down then the whole network will goes down. The effect is nationwide. Total outage. No matter how many radio system the operator had, once the “heart” down then no single person can get network signal.

Has the incident happened? Yes, it happened in some operator around the world, for example Telefonica’s O2 incident in UK at 2012. Of course central DB is not the only root cause of such big issue, other product could do the similar damage as well, but risk to central DB is the most well known root cause to such nationwide incident.

“So, how did you manage to maintain such systems there?”, he followed with another question. I had similar question asked several times in the past, and the answer always same: good plan, and testbed environment (staging).

Plan

I attend a Project Management Professional (PMP) course few years ago in Doha. As in PMBOK, there are 5 project management process groups: initiating, planning, executing, monitoring & control, closing. In the nature of my work as engineer, I would say that planning is the most crucial part. There is a good quote about planning,

Failure to plan means planning failure

Before performing any activities — small or big, I used to create a note composed with high-level to detail plan related to activities. What are required to execute the activities, what are the risk, how should I execute the activities, what would be the strategy when failing, who can support / involve in the activities especially when I face issue (mostly before big activities such as system upgrade), etc etc.

My PMP trainer was telling story about how a hotel in China consisted of 30 floor built within 15 days (2 weeks). Here’s a video footage of the construction.

He told us that planning was the secret key. In order to execute 2 weeks project the team spent 2 years in planning. As described in the stories, they used pre-fabricated modules that were produced earlier so in execution phase they just compile them into building.

In my work, I always requested longer timelines for planning phase. So let say an upgrade required on the systems, the actual execution in production might require only 5 working days (5 working days that spanned over 2–3 weeks depend on customer’s availability, as well as my availability since I worked alone so not possible to execute every night). But whenever asked by my bosses or customer, I always says 2 additional weeks to be included in the project plans intended for preparation. So the preparation would be 10 working days to plan and perform following activities: to test, to fail, and to troubleshoot (or sometimes even 14 days in total if I found issue and required me to work during weekend).

Since I need to test and fail then I need a special environment / playground to play with, then it comes to my second topic: testbed.

Testbed / Staging

We called it testbed environment, but in IT infrastructure the terms usually called staging. In my previous job, testbed was top priorities to be maintained properly. Luckily, my customer has the same vision. So we hardly keep our testbed environment to always mirror our production state. The main idea was if I do something and failed in testbed then the same result would be similar in production. As simple as that.

Activities such as system upgrade relied on new software coming from R&D. When they release the software for sure it has been tested within their labs, and passed Q&A to make sure its qualities. But telecom are a huge industries. There are hundred of different customer using same product and different environment around the world. In short, I would say every customer are unique. The lab environment used by R&D might be suitable to test core feature of software, or it has been deployed successfully with other customer, but whenever deployed in my customer network there are still risk that software deployment will be failing. Why? Because there are hundred of “unknown” factor within each customer environment that might affect new software.

So that’s why I need to test the new software in an environment that is similar to my production system but with zero risk. Following another good quote from sillicon valley,

Fail often, and fast

I need to immediately find issue that might occurred when executing the activities in production system. I need to fail fast, troubleshoot the issue, make a note, fixing the deployment procedure that originally come from R&D to suit my customer environment, and re-test. And I have to do the same cycle for all scenario (sometimes upgrade consisting of several parts). That would include to test the fallback scenario in case unexpected thing happened because sometimes fallback mechanism is somehow not working either. Of course during this phase I also have opportunity to discus with higher technical support and even R&D for any issue that I can’t resolve locally.

I could also calculate the correct timing to execute activities within customer environment. For example, the server specs and load used by R&D lab might be different to specs used within my customer environment so execution like system reboot would be different. My customer were very strict in terms of timing, usually because production execution time is limited (during low-peak hour like whenever people sleeping) so I must calculate the correct timing in testbed to create good strategies of how activities should be executed later in real production system.

Was above strategies work? Yes. All the time. I found lot of issue during planning and testbed phase. One time I said to my boss that the testbed phase was actually considered as hardest part because I have to make sure the deployment in production system 99% successful — there is no such a perfect plan so impossible to assure 100% success rate, btw.

Execution

Similarly to what happened when China built hotel within 2 weeks, I have my own “pre-fabricated modules”. The modules mostly composed of my own notes for execution in production system, own script, own troubleshooting procedure, own customized solution (that consulted already with R&D), etc.

When execution phase come, I just simply read my own notes and follow the procedure step-by-step. It is similar to stack a pile of pre-fabricated modules when building 30 floor hotel. I tried my best not to even *think* during production deployment since everything has been planned properly during earlier phase — the analogy is like copy-pasting everything into production system. It helps a lot especially since in telecom industry we have to work in the late evening — normally starting from 00:00 AM to 05:00 AM. During that time sometimes my brain not fully work due to overnight situation, and as described earlier, due to the nature of product I handled that would cause nationwide outage so pressure during activity most of the time very high. Some activities were considered very risky, even some of them doesn’t have fallback scenario. In such situation, having a good plan of execution helps a lot. Because as a human, we, (well, me 😔), tend to make mistakes especially when such big risk are at stake, so having a proper plan would help to make my work easier.


“So, how’s the infrastructure here?”, I asked back the CTO. He answered that pretty much the same. They have development environment, staging, and production system. And the process flow also quite similar to what I described earlier. But, he added a lot more information regarding the infrastructure architecture as well as the process flow.

I then understand that there are different nature of infrastructure between IT environment such as e-commerce and telecom industry. Telecom is a carrier-grade network. It might not having lot of servers (I handled less than 100 servers only), not too complex architecture, etc. But it expose the high risk that whenever it fail then it would affect people’s life especially within this era (how do you feel when you are not able to call your familly, especially in such urgent time?).

Other IT environment such as e-commerce have different nature of risk, multiple complex architecture, multiple different team accessing the systems, multiple commit performed by developers everyday, etc etc.

However, in terms of successful execution both environment having similar goal. I believe good execution plan, and also requirement to fail quickly are also suitable for any teams within IT environment.

In a later discussion, he described that the cycle I performed earlier must be done automatically all the time in their enviroment. Meaning that they’re utilizing DevOps and infrastructure as code to do so. I know about DevOps and use it for my own project so far, but due to the risk in telecom environment I don’t use them especially for production system. I think telecom industry is going toward that trend since they are moving to cloud solution.

Moving forward from carrier-grade industry, I think other technology environment such as e-commerce are also interesting field. They introduced new challenge to deal with, more complex architecture to setup, more teamwork and collaboration to work with, and more philosophy to live with.

Don’t you think so?

Muhammad Sahputra

Written by

Kapitan авантюрного корабля

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade