On Technical Decision Making and Problem Solving

Published in

Cermati Group Tech Blog

16 min readJul 25, 2022

As software engineers, many of us work to build systems according to the specifications derived from business needs. For software engineers who work in product development, these requirements tend to come from the product managers. For people who work with the corporate technologies not directly related to the services provided by the company’s core products, these requirements might come from some other stakeholders.

Without product managers or any other people from the business division to closely direct and oversee our development most of the time — because we’re mainly interfacing just with fellow engineers from other teams, and occasionally supporting the product and business divisions with technical stuff related to the projects they’re handling — Cermati’s infrastructure platform team tend to have quite a lot of freedom in determining what to build and setting the requirements for the system we’re building.

As the engineering manager of the team, I’m usually placed as the one to identify the problems we need to solve and come up with the general strategy to approach the problems — in other words, if what we’re proposing to build sucks, that’s going to be on me — from which we can derive the system requirements, and then be broken down into executable steps and deliverables for us to start working on them.

When the problems and the general strategy have been defined, usually I’m going to ask for thoughts and feedback from the team members and the CTO to make sure I don’t miss anything. After the misses have been revised and the plan is refined, we can proceed to set the goals for it.

In this article I’d like to share a few things about how to identify problems, model the problems, decide on the goals to achieve, and plan the execution according to our team’s scope of work.

The reason why I’m writing this article is to help me transfer the general thought process to the team members — as I think they might benefit from this article— because I don’t think when performing one-on-one sessions I manage to convey it all properly due to my verbal communication skills being somewhat of a disappointment. Another reason is that some other people might also find it useful.

Stewart Brand, the American writer associated with the taken-out-of-context quote “Information wants to be free” (image from Wikipedia).

Aside from helping their personal growth, developing the team’s competence in this aspect will also help us scale our thinking and decision-making capacity. If every single member of the team is capable of identifying high-value problems, designing and implementing good solutions, and managing the rollout to the organization, the workload of identifying what problem we should tackle next and what solution we should build can be shared among the team members.

Identifying the Problem

The first step to solving a problem is to identify the problem. To do this, we need to first assess the current state of our system and see if there’s anything that might be of concern in its current state.

The following list shows some example questions that we can ask ourselves to see if there’s anything we can improve.

Is it stable and operating well? Does it work well in serving the business purpose it’s supposed to serve? Is there anything we can do to make it better?
Does it scale? Given how much load we’re handling right now and the growth we’re seeing, how much longer will it hold with the current setup?
Is it reasonable to maintain and operate? Is there any way we can improve the situation to reduce our workload on this at a reasonable cost, allowing us to have more spare capacity to work on something else?
Can it support us towards the organization’s strategic goals as defined in our strategic roadmap? If not, is there anything we require in order to support the goals?

The problem should always be a real business problem faced by the organization, where the organization has a tangible return of investment from the money and effort spent on solving the problem. Otherwise, it’s not a valid problem for the organization to solve.

Examples of valid problems:

You noticed that the organization’s software build and deployment pipeline is kinda awkward, as the developers need to build the new build for the service to be released on their local machine and push it manually to each production machine. You want to implement a proper build and deployment pipeline so the developers can focus on developing their services and have minimum time spent on infrastructure operations to improve productivity.
You noticed that the network segmentation and firewall policy for the organization’s production VPC network is improper and insecure. The improper segmentation makes it difficult to manage the connectivity between the network segments, and the firewall rule is too permissive that a compromise in even a single machine — that’s not even running any sensitive operation nor handling sensitive data — has the potential to cause great loss to the organization. You want to rearchitect the production VPC network, migrate the services to a new VPC network, and apply proper firewall policy to the new network to improve the stability and security of the production environment.
You noticed that a lot of manual work needs to be done to provision and manage user accounts for an engineer to all system components they need in order to perform their day-to-day work. The organization is planning to start hiring rapidly and the number of engineers whose user accounts we need to maintain will soon enough skyrocket to the level where we can’t handle it manually anymore with the current team size. We have the option to hire people just to operate and administer the user accounts, but humans are unreliable and the tasks are pretty repetitive if defined well. You want to implement a service to handle the engineers’ user account access management in a way that’s cheap and reliable for the team to operate so we can greatly increase the team’s capacity to handle the access management tasks without the need to expand the team as much.

Examples of invalid problems:

You think that Discord is cooler than Slack, where both basically do the exact same thing for the organization if implemented. You want to push the organization to migrate from Slack to Discord, but the organization doesn’t gain any meaningful benefit from the migration.
You think that Golang is the best thing since sliced bread. The organization is using Java. You want to push the organization to rewrite everything in Golang. Golang does offer some performance benefits over Java for the organization’s use cases, but the cost way outweighs the benefit.
You’re aware that adopting Kubernetes can help improve the productivity of a software engineering team from some articles you read online. You propose the adoption of Kubernetes in the organization. But the organization is using an on-premise server deployment with an already well-established waterfall-based workflow for development and delivery. The organization is a bank with a relatively slow software development and delivery process due to the number of security controls they have in order to ensure the security of their software releases, and the software is made into monoliths to make the build version management easier for compliance purposes. Kubernetes adoption might not affect productivity that much since the development is still using a waterfall model with monolithic software and most of the time is spent on the extensive security review and testing process anyway.

Setting the Goal

I don’t recommend operating by feeling or simply copying what other people are doing when setting goals. Someone who gave me advice on decision making and goal setting earlier in my career said to perform these steps.

Look at what other people are doing.
Do some thought experiments, imagine what it’ll be like if it’s implemented.
If it feels right, set it as the goal.

While it could work, there are a few problems with this advice. It’s a bit vague and it emphasizes feeling instead of actual reasoning.

Even if other people’s problems are similar to ours, the constraints and resources available at hand can be very different so their approaches need to be taken with a grain of salt. Also, we’re not even sure if their approach is optimal for their case, let alone our case.
The thought experiment part is actually very good advice.
What feels right isn’t necessarily right, especially if we’re framing the problem in the wrong way without realizing it.

Human feeling is feeble and unreliable. Proper reasoning is the way to go.

Of course, our reasoning can also have its flaws and weaknesses. But we’re being way more responsible if we apply proper reasoning to a decision that affects an entire organization compared to if we’re making the decision purely based on feeling.

To make it better, the three steps in the advice we mentioned previously can be expanded into the following steps.

Understand the problem well (we can take different points of view for references), mathematically modeling the problem might help us understand it better.
Look at what solutions have been built and implemented by other people to approach the problem as references.
Identify the strengths and weaknesses of the solutions implemented by other people, based on how they were modeling the problem and how they implemented the solution.
Identify what constraints they have that might limit their options in modeling and implementing their solutions, and also see what is the context of their decisions in regards to the problems their organizations are facing.
Identify what constraints we have that might limit our options in modeling and implementing our solution, and also see the context of our organization and the problems we’re facing.
Construct possible solutions to be implemented in our case, taking our constraints into account.
Try simulating those solutions in situations that might end up with the best, average, and worst-case scenarios for each respective solution.
Choose the solution that’s likely to be the best fit for our problem given the best, average, and worst-case scenarios we’ve identified, and set that as the goal.

Modeling the Problem

Before we work on the architecture and implementation, it’s best to take some time and think about the problem from various perspectives first.

It’s tempting to immediately start working on the first obvious solution that comes to mind or even just to hack our way through the problem by simply starting to code and letting the solution comes naturally. But if we’re not careful we might approach the problem in a suboptimal way and miss some better solutions that could give a better payoff in the long run, or we might even approach the problem in a wrong way and cause bigger problems in the future.

In order to model the problem, we need to start by defining how we expect the system to behave.

What inputs does it need to handle?
What outputs does it need to produce?
What states does it need to keep track of?

From the previous points, we can define the system as one big function F that accepts possible inputs from the input space I = {i_1, i_2, i_3, …, i_n} and results with possible outputs from the output space O = {o_1, o_2, o_3, …, o_n}, and (if the system is stateful) with possible states from the state space S = {s_1, s_2, s_3, …, s_n}.

This function F is the core of the system. If the expected processing from the function F is still too complex for us to define the system’s input-output relation, we might need to break down F into smaller functions. For example, we can have F(i) = B(i) + C(D(i)) where we need to define functions B, C, and D which are components of F — to get the result of F, the results of B, C, and D need to be combined.

For a stateless system, we can consider that o = F(i) where o is the output of F and i is the input. If it’s stateful, we can add the state to the input and output such as (o, s_o) = F(i, s_i) where o is the output, s_o is the new state after F is performed, i is the input, and s_i is the state before F is performed. Now, we just need to define what is the expected range of the input, output, and state spaces and how the function F transforms the input into output.

Note that if there are several core functionalities in the system we can model it as multiple functions, for example, functions F, G, and H.

Modeling it this way allows us to focus on the core structures and behaviors of the system without getting bogged down too much by the implementation details. The system’s core modules should be modeled to be as simple and as predictable as possible. Ideally, it should also be extendable in order to accommodate possible new functionalities in the system.

Once we have modeled the core functionality, we can proceed to think about the input validation, data access policy, and the interfaces the system should be interacting with.

What interfaces does it use to take the input?
What validation methods do we need to use on the input?
What interfaces does it use to deliver the output?
Who should receive the output?
…and some other questions that might be relevant to the requirements.

These parts can also be modeled in the exact same way to help us define the problem while structuring it.

The resulting model should give us a good hint on what the system can fully automate and what the system will need human operators for.

What we can model as a simple function with a well-defined rule for its input-output mapping should be quite easy to implement as a software function. We don’t have any justification to have human operators for this kind of function, as implementing it as a software function will most definitely be cheaper to maintain and more scalable. Human actors might be needed to trigger the function, but the trigger action should be as effortless as possible.
What requires a relatively simple statistical modeling to implement as a software function might be able to be automated using some machine learning-based approach, but this is probably not something we’d like to pursue very early in the system’s lifetime especially if there are no data points from its actual operations that we can use as a reference. This might be something that we need to have human operators for, especially in the very early stages. If this is something we’d like to automate, wait until we have enough data to build a good enough model to automate the task.
What requires an extremely complex statistical modeling to implement as a software function and might require some advanced state-of-the-art deep learning stuff and very expensive computing resource in order to work is probably something we prefer to have human operators instead — unless automating the function is the problem we’re trying to solve in the first place, and we’re confident enough that it is a good investment for the organization according to the problem we’ve identified in the previous phase.

By the way, this modeling approach can also be applied when we’re trying to understand and debug an existing system, as modeling the system’s supposed behavior and then matching the inputs, outputs, and states of various modeled scenarios to the actual system can help us identify where the system breaks away from its expected behavior.

Preparing the Implementation Plan

After we model the system, we now need to create our implementation plan as a guideline for us to implement what we’ve previously modeled. This is basically where we turned the model we’ve already made, either formally or mentally, into a formalized set of requirements for the implementation.

The plan should cover the general breakdown of the system’s components, what existing tools and platforms we can leverage, which parts we need to integrate ourselves, and how the components are interacting with each other. This plan may include a proposal to use existing technology products we can readily use to solve our problems if there’s already one — given that the cost is reasonable for our case. If not, we need to identify which components of the system can use an existing technology product and which components can’t.

Cloud platform services like Google Pub/Sub, Google Cloud Storage, and mature DBMS systems such as PostgreSQL should count as readily-available technology products we can leverage here.

If the organization already has development standards for the programming languages, development framework, and the overall technology infrastructure components, just follow that. If we need to use something different from the standard, state why it is needed, what is the justification for choosing it over the ones we already have in the standard (if something listed in our standard does the same thing), and how much it’s going to cost us in terms of maintenance and incorporating it into the standard.

The components that we should build ourselves must be described also, mainly in terms of what functions we need and how those functions work. The expected input, output, and state spaces we’ve defined during the modeling phase earlier for these functions can be put here.

If there’s already an existing implementation of the functions we’re trying to implement, unless we have concerns regarding the existing implementation, just use that instead of implementing the same thing ourselves. That’s going to save us time as we can avoid the work of implementing it, and we can also avoid the cost of maintaining the implementation ourselves.

Having visual aids like diagrams when describing how the system should work as a whole, how the components interact with each other, and how it should behave in certain scenarios is a must to help us better understand what we’re trying to build.

For the implementation plan to be robust, it’s a good idea to build prototypes and test the feasibility of the plan while writing it. For example, if we’re considering the use of HashiCorp Vault in the implementation and we’ve never really used HashiCorp Vault before, it would be nice to also put the references to the relevant documentation and set up HashiCorp Vault on our development environment to verify if the scenario we’re expecting to carry out would work with the setup we’re expecting to use later in production.

The resulting document should contain the breakdown of the system components, along with the specs of the system’s behaviors and performance requirements. At Cermati, we usually create a technical design document (TDD) for our engineering initiatives to lay out the details of the implementation plans — basically everything I’ve mentioned in this section — to be reviewed by our fellow engineers and the engineering management.

Implementing the System

After the plan has been reviewed and approved, we can proceed to implement the system. This part should be obvious enough, as we only need to implement the system according to the specs we’ve defined.

If the implementation plan we previously created is already well-tested and properly reviewed, the actual execution shouldn’t be that much of a problem. It might be the easiest part of the work.

Suppose we notice some mismatch between our expectations during the modeling and planning phase and the reality during the implementation. In that case, we might be missing something when working on the plan and it’s a good idea to go back to review it and check if the new information we got when working on the implementation can fit into the existing plan nicely. If not, we might need to adjust the plan — and the adjustment might be significant, depending on how far the reality is from what we assumed before.

Generally, during the software implementation phase, most of the decisions we’re going to make will involve the code structure, design patterns, conventions, and test coverage. Try to make the code and configuration as simple, clean, and maintainable as possible, while making sure the system performance, security, reliability, and business requirements are met.

Conclusion

In this article, I try to explain various layers of technical decision-making, and how to approach them. The closer we are to the implementation work, usually the simpler the trade-off from the decisions we’re making as the variables directly impacted by the decisions usually revolve around the reliability and scalability of the service runtime and the maintainability of the code. But in the stages prior to the implementation, there are a lot more considerations to be made.

Correctly identifying the technical problem to solve and aligning it with the organization’s business strategy is critical to ensure that we’re solving the right problem. We can design an extremely sophisticated system and develop elegantly implemented software all we want. But if what we’re building doesn’t address any of the organization’s concerns, then it wouldn’t be a good investment.

From the problem, we need to derive the goals to be achieved. The goals must be measurable and must be aligned with the business strategy. If we’re pursuing the wrong goals, it’s likely that our solution — while it works for the problem we’re trying to solve — won’t be very well-adopted and might be conflicting with some other goals the organization has.

After we have the goals defined, we can start modeling the problem. Modeling the problem can give us a sense of the system’s general structure and behavior. The problem modeling should be a part of the system design phase, but I notice that even experienced engineers may neglect proper modeling for the problem and proceed directly to the service and infrastructure architecture design using one of the first rough models they have in mind when they try to deliver fast and then improve the system iteratively.

The approach of creating a very basic and somewhat rough working version of the system to be released as soon as possible, and then iteratively improving it, might be a good approach when delivering early product iterations of an early-stage startup for market validation and pivoting. But when we’re working in the domain of infrastructure services and platform architecture, where the interfaces and the expectations we have for each component of the system are already somewhat well-established and there’s no need to validate the product in the market (and no competition), I think it’s a better approach to properly model the problem before we start the design and implementation phase.

Given the model of the problem as the reference, we can start designing the system architecture and then proceed to the implementation as we usually do.