Worst Practice Development and How to Avoid It!

Hendra Prastiawan
Inside Bukalapak
Published in
20 min readMar 9, 2023

Basically, people will always learn every time they have, especially when the worst things happen. The worst things in life can be very important, and why is that? because when people experience bad things, they will remember that stronger, even can traumatize them to ever experience that again. This is because humans by nature can remember bad things more than good things. Then what is the correlation between bad experiences with software development? Jack Ma, founder of Alibaba once said

What is the difference between smart and wisdom? Smart people know what they want, but wise people know what they don’t want

Smart people often want this and that even after everything, they forget the consequences of what they want in the future, but wise people know what they don’t want, they know how to stop, and they know the limit of their desire. Because of that, smart people are often trapped by their silly mistakes, but wise people can often avoid silly mistakes.

During my five years of being part of the Bukalapak engineering team, I have made various mistakes, both small and even fatal mistakes. In this post, I will share the worst practice development that I had done before, which I learned a very hard lesson from that, and I convert them to be my mantra now in the software development process. There are four things that I learn the hard way very much which is:

  1. Bombing feature at one shoot!
  2. Work harder, not smarter
  3. Testing in production
  4. Over future thinking

Bombing Feature at One Shoot!

This event happened at the end of 2017 when at that time our team was making a new promotional feature for the Ramadhan campaign in 2018. At that time, our team still didn’t know how to deliver a feature or product gradually, so what we did back then was create one branch that will have every change we need to have the feature shipped. Yes, there was one living branch that lived in our big monolith service repository which have around 300 to 400 engineers who contributed to that repository, and this became our first obstacle. Because of rapid development on that repository, of course, there are a lot of changes every day that cause conflict with our changes, making us need to resolve those conflicts again and again. Then the development continue, and our living branch becomes larger and larger, then until we finished the coding, we ask so many engineers from other teams to also help review our changes, but because the changes are too large, of course, it will be hard to review by other engineers, as you can see below is the number of changes that we do on that single living branch.

The changes are around 5000 lines of code. Can you imagine you in the position to review those changes? What do you have in mind after being asked to review such big changes? Well, I might have thought this in my mind “Wow, the changes are big, ok let's get this merge then, LGTM!” or “Damn, I’m too lazy to review these changes, let’s auto approve it then”, or “Reject, this is a really big change, you need to split this”, but luckily, the changes are reviewed by them even I know that reviewing process might not efficient because of reviewers might lose context of that big changes, it’s hard to compare big changes, it’s hard to have deep knowledge on the changes, and it generates very slow feedback generation due to the big changes. The code review process takes more than two weeks if I recall correctly, so the whole development process is also being impacted and the deadline for the feature is becoming closer and closer. But, the problem doesn’t stop there, after the development is done, our next job is to release the feature to production. When we first try to deploy the changes to production, we need to do several fixes, since the changes cause bugs in the production. Well, we can roll back the changes when bugs happen, but it was hard to do back then, and that is why we decide to fix forward the bugs. After several times doing fixes, fortunately, we can deliver the feature on time, right before our Ramadhan campaign in 2018 started.

From that promotional feature development, I learn that there are some pain points from the process, which is:

  1. Hard to rollback the feature on production after deployment
  2. Hard to fix bugs since the changes are too big, and we can’t track the issue or root cause fast enough
  3. Hard to test when in development, it takes a long time to make the changes ready to be tested by the QA team
  4. The development process is kinda slow, we cannot generate fast feedback from code review or testing process

Then, how do we overcome this and prevent this issue to happen again in the future? There are two sides we need to take care of so the synergy while the development process would be great, there are:

  1. On the technical side, we need to do several things like:
  • Change the development process to Trunk Based Development (TBD), so we can chunk little by little every change we need to gather faster feedback either from code review or testing, and also can ship the code to production faster. TBD itself was introduced to me and my team in 2018 by our manager. At first, it was kinda a hard thing to do development with TBD, because of our unfamiliarity to chunk the process, we don’t know how to chunk the requirement and development process, thanks to the people in the management and others who always gave feedback on how to chunk the requirement and the development process either on one on one session, sharing session, code review, etc, so we can understand the concept of TBD better and now TBD becomes our habit on product development itself.
  • Use the release toggle for new features or things you think crucial, by doing this we can ship our changes to production safely because those changes would not be automatically activated, we need to activate those toggles manually, also by using the release toggle, we can rollback our new feature easily if something happens when releasing the feature without the need to deploy the service, we just need to toggle off the release toggle. At first, we only use the feature toggle, but later on, because we need to have some improvement on releasing changes to production, then my tribe back then (Growth Tribe) have a direction to normalize the usage of the release toggle, so every change that we think critical or need fast mitigation, we should use release toggle, and since the tools of our toggle can be implemented as release toggle, then there is no additional effort on doing this.
  • Release our changes gradually to see the impact and gather faster feedback, by doing this we can increase our confidence level when we going to release the full features because our changes are already battle-tested in production. To release the changes gradually, we can combine the changes with the release toggle, so when the code is deployed to production, it will be safe enough that the code is not being called anywhere until the release toggle is activated.

2. Then on the product or requirement decision, we as engineers also need to pay attention to

  • Make decisions together with the product team, set the MVP of the feature we will develop, discuss with them to deliver the feature gradually, and don’t ship one large feature into one big iteration, by doing this, we can validate the ideas of the feature as soon as possible with smallest efforts needed.
  • Give ideas or alternatives for developing those features, and think of another strategy that might be easier but still can achieve the same goal, so we can minimize the effort of the development. We often create technical documents before proceeding with development, we provide multiple solutions so we can have a comparison between solutions and we can decide on the better solution later.
  • Create a realistic timeline with the product team, don’t be a yes-man. Collaboration between the product and engineering teams is a great synergy to achieve the goal, so we as engineers need to estimate a reasonable timeline with them, explain to them why we need to take more time to develop the feature, and gave some data if needed or explore first before deciding on the timeline.

Work Harder, Not Smarter

I still do it sometimes, and I believe that most people think that if you work harder then it can give you better results, it is like I do before when I pursue the position I want to achieve. I try harder in various ways such as almost working overtime every day, always trying to work faster, and always wanting to be seen that I have been working harder than everyone else to make me stand out more. But the thing that is forgotten is how much impact I give from the work I do harder itself, does it provide more value or not? Then I thought, why do I have to work harder when I can make it simpler? Now I am aware that the leverage I provide before is arguably smaller than the leverage generated by my colleagues who always makes simple solutions and still achieve the desired goals.

So, what is leverage? According to The Effective Engineer talks by Edmond Lau, leverage is the impact produced by the time you spent.

Then leverage can be called higher if the impact produced is larger than the time we spent, and this reminds me of the Pareto Principle, where 20% effort you spent can produce the 80% result we want. We can take some examples if a staff engineer has 3 times more impact than a junior engineer, does a staff engineer have to spend 3 times more time than a junior engineer? Or if a technical fellow engineer can give 10 times more impact than a staff engineer, does he/she need to spend 10 times more time than a staff engineer or 30 times more time than a junior engineer? That could be nonsense, the time they have to work would still be the same around 8 to 9 hours per day. So that is leverage, and that is why leverage is important. When a person can create high leverage, then the effectiveness of that person must be high so that he/she can give a higher value than others.

The most memorable of “Work Harder, Not Smarter” is when my team was given a task to discount all products in our marketplace with a certain nominal. Well, at that time, our active products are estimated around 300 million products, and we don’t have any kind of feature to subsidize products with specific nominal by Bukalapak, we only have features to allow the seller to subsidize their products. So we decide to use those features and modify the implementation so the subsidy amount would be billed to Bukalapak, not to the seller. But to set the subsidy itself, we cannot do that in bulk, we need to apply that one by one to each product we want. So, we try to find a simple solution and then we decide to create a script to bulk apply the discount to given products, but then the next problem came that we cannot apply the discount to many products at once, so we need to chunk the numbers of the product we need to apply the discount on each script execution. On that day we decide to apply the discount to 150.000 products on every script execution with an estimate that each execution will be completed within 12 hours. So, how many times do we need to run this script, it should be 2.000 times to execute the script if we want to discount all products in the marketplace, which it would take around 24.000 hours or 1.000 days to run the script. Then after we run the script and applied the discount to around 3–5 million products, the result of this subsidy doesn’t seem well, the metric that we want to achieve is hard to reach. So then we decide to shut this feature and retract the applied discount by running the script again. This kind of effort is such a waste because we do extremely hard on the script execution, but the result is too poor. Then now I just realized, why don’t we just do something like add a hardcoded discount to all or specified products and handle the discount on the payment process so it will be less effort than to run the script multiple times and if we need to shut or retract the discount, we can use a toggle to turn it off, less effort and faster feedback can be achieved.

Then in another case, when migrating one of the highest traffic features from our monolith service to a separate microservice, at that time we tried to implement the feature to be the same as those implemented in the monolith service, especially in caching. Previously we used three platforms for caching, they are Redis, Memcached, and Varnish Cache. The reason why we do this is because of “Cargo Cult”. We never ask ourselves why we use Redis, Memcached, and Varnish Cache all at once, because we see it works before, then we should do the same to make it work later. Using those three caching platforms, of course, increases the complexity of the microservice because we will have more things to be maintained. Actually, at that time we have a caching feature in our in-house API Gateway platform, so actually, we can use that feature to handle the caching without the need of having another platform for caching. The main reason why we need a cache is that we need to handle spike requests on retrieving data requests, so it will not overload the database request on handling the write query process. We want the request to retrieve the data to not hit our database by adding a Redis and Memcached cache and to not hit our new microservice by adding a cache in front of the microservice using Varnish Cache. But those two problems can be easily fixed just by using our caching feature from the API Gateway, and the setup is easy we just need a YAML configuration in the API Gateway routing config.

From both experiences, I learned that working harder is not the best approach, it has several drawbacks, such as:

  1. The imbalance of risk and reward ratio. The risk obtained is greater than the result we want, as we can see from the above experience, the effort we do is too high, but the impact is poor.
  2. Wasted effort, we run the script every day and monitor the execution all the time, but later on, the feature needs to shut down because the impact is not as we expect, or we set up multiple caching platforms, but we can just configure our API Gateway to cache our endpoint, so many wasted efforts on doing those.
  3. Exhausted because we always use our full energy to deliver the task without considering work-life balance. We almost do every task over time to achieve all the overloaded tasks given.
  4. Toxic productivity, this one is very dangerous, when we are used to working hard and when we see our teammates relax doing their work, or they gave less effort than us, sometimes it will trigger a feeling of “I have worked harder than him, then my performance should be far better from him”, but trust me we might be didn’t give more impact than them. It might seem like we perform better but actually, our leverage is lower than our teammates. Because when they relax on their job, they can still achieve the goal of their task so they might have higher leverage than us. So productivity should be not counted by the effort we give, but it should be counted by the impact.

Then what do we need to do to avoid this situation from happening? This is what I usually do now:

  1. Looking for an easier solution. Sound easy but hard to do. Finding easier solutions can be done by setting up a discussion with other engineers, creating a comparison table between each solution and writing those into technical documentation, and asking other engineers to review. By doing so we can take the most efficient solution, so we can estimate better the timeline needed. Or within requirement gathering, we can also provide alternatives to the product team that would have less development. Don’t be shy to express your idea to the team. It is nice to have a working environment where we can express our idea without hesitation.
  2. Try to find existing technology and check whether we can use it to reduce the development effort. As in the case of feature migration to the new microservice above, we can simply use cache from API Gateway instead of setting up Redis, Memcached, and Varnish Cache.
  3. Ask and challenge every requirement given to us, be critical, and give some questions that the answer can convince us so that what we are going to develop will be useful and in line with our expectations. Ask something like this if you need:
  • If we develop this, is it really can increase sales?
  • How many increases we can achieve?
  • Is there any supporting data to those assumptions?
  • What is the expectation from this feature?

4. Always provide comprehensive data when challenging a requirement that we feel is inappropriate or unreasonable.

Testing In Production

Sometimes we hear someone says “Real man test in production”, and it was created as a meme this day. Do we need to test in production? Well, testing in the production environment is a normal thing in the future, and why it can be like that? Our application will continue to grow over time, and the larger the application is, the more testing we need. If we test all the features manually, when the application became larger until we run out of manpower to do the test, it will become harder to test in multiple development phases. Let’s say we have 100 features that need to be tested in a staging environment for the testing phase and the canary server before we deploy them fully to our production server in the release phase. It would be exhaustive to test all the features, so we need automation testing that will run on our continuous delivery pipeline, so every time we deploy code to staging or production, the automation test would run, and it will notify us faster when bugs happen. So that is why testing in production would be a normal thing in the future. But aside from that, testing in production, especially manual testing in a production environment has multiple risks that should be considered low to fatal risks.

Ok let’s go to the case, back in 2018 when we developing our promotional feature mentioned in the bombing feature at on shoot above, at that time we don’t have a testing environment that we can mimic to have the same resources as a production environment, so when we need to do some stress test to gather information of how many traffic that this feature can handle, we do it in production server. Yes, we do it in the production server, and we always test the feature until the service cannot handle the request again or simply until the service is down. We do it every two times a week until we have an optimized version of the feature and we do it at midnight starting from 1 AM until 5 AM. When the test is going on, our customers also can see the dummy product that we set as a testing product. Those products also trigger our customers to complain that when they try to purchase the product, it is canceled, and also they complain that the product on the testing campaign has a nonsense price, for example, a Pencil Box sell at the price of Rp1.000.000. This kind of testing, of course, can deliver confusion to customers, because they can see many dummy products in the testing campaign that appear on the homepage section. Also because of this test, we always came with heavy load results to the service which might be there are potential losses when the service is under heavy load or when the service died.

Another case is when I do some manual testing in production, I have done a very large database query in the beta server, where the query should be run and must be run only under our data warehouse, not on our production server. But because of my innocence at that time, I tried the query multiple times, even though my first try of the query didn’t give me any result because of the run timeout. I tried more and more until someone from the infrastructure team noticed that there are heavy queries in the database process and asked our engineers group if anyone is doing some data science queries in production. After that, I stop the query and never do the same thing again in the production, because that action almost makes our database down if I do the query again and again.

From the two events above, I learned that doing tests in a production environment requires special attention where when we do some tests, they must not interfere with the needs of our application users, and they are:

  1. Make sure we limit the access to what we need to test, for example, if we want to test new changes to a feature, make sure only we can see and use that changes. We can use a whitelist mechanism for this, so the access would only be limited to users that are whitelisted, and other users will not be impacted by the changes.
  2. Automated testing, this automated testing must be validated before we add the test into our pipeline, so using an automated test, should minimize the human error that can be implicated by doing a manual test.
  3. Have a checklist or guideline of what we are going to test in a production environment, so we can convince ourselves, that we can run the test step by step as written on the checklist or guideline. By doing so, we can minimize the human error that might happen when we don’t have any guidelines or checklists like we forgot to set up some prerequisites before we can test the changes.
  4. Write down all dependencies that are related to the testing process, so that if a problem occurs, we can pin down the error quickly, and define where the problem exists.
  5. Don’t test in production by yourself, ask your teammates or colleagues to join the test session, so it can increase the awareness of what we are going to test. More eyes are better when you test in production.

So, testing in production, is a wrong decision? I don’t think so, if we can fully make sure that some of the points above are fulfilled, then a proper test in production is ok to do, but I would prefer to test it before on staging or testing environment instead of testing that directly into a production environment. So the words “Real man test in production” can be complete if we change to “Real man test in production properly”, also, I might say that when our Engineering Management encourages software engineers to take responsibility of to do automation testing is really a great decision, because of that I as a software engineer feel become more mature in the development process and I also gain more knowledge on the testing because I never know how to do automation testing before and it becomes a very good benefit to us to learn on how to create automation testing.

Over Future Thinking

Sometimes, in the requirement-gathering phase, we often think or suggest a solution to solve every problem, even though the problem might not occur in the future. Is it wrong to think like that? Well, sometimes things like this are necessary in certain cases, especially with things related to data security or user privacy but in the real life, we often try to improvise the requirement without thinking that those improvisations would be needed by our customers. We often don’t ask if is it really what our customer needs. Also, we often decide the development to be over-engineer because we think this should be reusable later if in the future we need to develop similar features as of now we develop. If we too focus on what might not happen in the future, we might lose the opportunity to deliver the requirement as soon as possible so we can validate the idea faster or our work might be useless because it would never be used in the future.

For example, when my team is discussing our upcoming task, I sometimes give so many ideas that of course increase the development effort, also suggest taking some general ideas on the implementation so the things we develop now can be reused later in the future. Luckily, my team is full of critical people, so when the idea is too much, they can argue that idea and ask if “is it really needed now, or can we do it later”. Because if no one argues on that, then our team will be caught in tight deadlines to work on all the things needed and not needed to deliver the task, and sometimes we didn’t focus on our main goal of what we can achieve from the task.

So why future thinking might be future trashing? We can see the following points:

  1. By overthinking the future, we might forget our main goal of the task, we just focus on what needs to be improved from that task.
  2. The planning would be immature because the idea became wild, and we cannot handle all the ideas at once.
  3. Our improvisation might not be used in the future and become a tech debt later on because we need to clean up that implementation.
  4. We try to solve problems that might not occur in the future.

Then how I can avoid this mistake? What I did was:

  1. Firm in determining the scope of the feature. This cannot be achieved by ourselves and cannot be achieved if we don’t have “gotong-royong” and speak-up culture, because to achieve this, we need some kind of great synergy all around the team, and must have the same vision on the team.
  2. Speak up when there is an idea that is too excessive or too wild, eager to ask if any supporting data is to validate the idea, then ask if the idea is still in line with our main goal.
  3. Have a customer point of view, imagine yourself being the customer of the products then think if the idea or solution provided is really what you need, or ask yourself if those improvement doesn’t exist, can you still use the products without any problem, does it fulfill all you need?

The thought of future thinking is really difficult to get rid of, we need a lot of experience to be able to differentiate what we really need and what complements the requirement. I also still doing this kind of mistake sometimes, so I always ask myself later after throwing the ideas “do we really need this?”.

Those are all the mistakes that really changed my perspective as a software engineer here at Bukalapak. Actually, there are many other things that have taught me valuable lessons other than these four, such as “don’t make decisions quickly”, “don’t do all the work at the same time”, “don’t be a yes man”, “avoid cargo cult”, and many others. However, the four mistakes above are the things that have a big impact on my journey as a software engineer here and become my guideline in every development process. But, how can I realize all the mistakes, and how do I realize that I can overcome all the mistakes? Well, all the improvements above are not instantly achieved, there is a long journey that I have faced and I cannot achieve all of those by myself. There are so much support from the people around me, the team, and also Bukalapak as a company, where here in Bukalapak we have a blameless culture that makes us grow without hesitating to be blamed, we constantly improve our engineering technicality over time so that we are encouraged to enhance our abilities, and not forget about Bukalapak culture value back then that is “try, fail, and try again” which encourage us not to be afraid to face failure, but learn to overcome those failures in the future.

After all, humans cannot be ever escaped from doing mistakes, don’t be afraid to make mistakes, be afraid to make the same mistake twice, and one thing that we all need to remember that “what makes humans is humans are when they are willing to learn and try not to fall into the same mistake later”.

--

--