A Smart Programmer Understands The Problems Worth Fixing

The difference between solving any problem and the right one

The picture of two firefighters clearing up the snow surrounding a hydrant. In case there's an emergency, the cost in lives due to having snow blocking the hydrant is more significant than the cost of clearing them all beforehand. Clearing a hydrant after a snowfall is a problem worth solving.

This is the story of Peter. Peter is a programmer that can do anything. He can create software as good as any of his other peers. However, there's a difference between a programmer with experience from a programmer without experience, even though both have the same technical skills.

How's that possible?

Peter has built a great distributed online booking system for a shop. The user can choose the time that is available to book, the "Available Time Screen." Mary, the operator of the system, uses the code that Peter wrote. She is responsible for receiving the next person in the line and providing the service inside the shop.

A diagram showing the Browser, one service for checking the Available Time and one service to Book. The Browser initiates the booking process on the Available Time Service then redirects the user to the Booking Service to confirm and finalize. The Booking Service notifies the operator when the booking is complete.

However, the booking system has a problem. Although the state of the system updates immediately in the server, the user has to refresh the page or reenter the website to see the updates to the available time. That way, if two people are looking simultaneously at the "Available Time Screen." they can create the same booking at the same time. If that happens on a busy day, Mary will get in trouble trying to fix the problem in the shop. This will make the customers unhappy with the service and will cause frustration to Mary.

Peter is smart and wants to fix it. He plans to develop an in-memory system which updates the "Available Time Screen" instantly and send a push message to the browser so that it can update the UI in real time. Even if that's not fast enough, the code also makes a check to verify the duplication right before finishing the booking. If the user tries to create a duplicated booking when the update is not fast enough, the system shows an error.

The same diagram as before. However, now instead of the Booking Service notifying only the operator when the booking is complete, the Booking Service also notifies the browser using a WebSocket.

Mary is starting to get frustrated. Peter is taking too much time to deliver this.

After 2 months, Peter delivers the improvement. After a few days, Mary notices that there’s no duplicate booking anymore.

Peter considers the job done and moves on to the next thing.

Peter is smart, he can solve anything as good as anybody else!

Two months later, Mary calls Peter again. The booking system is working well. However, three users out of a few thousand complained that they got errors when trying to book due to a duplicate booking. Peter thought that the real-time update of the booking system in memory would be sufficient. It clearly wasn't.

After a few hours of debugging and many days of analysis due to the complex code and validations surrounding the duplicated booking check, Peter can't figure out what happened. He doesn't know how to reproduce the problem because there are too many possible conditions. The software is too complex! Therefore, he decides to add extra logging that will provide more debug information when the error occurs again.

Six weeks later, another user managed to create a duplicated booking and bypass the validation. Although the last one didn't happen on a busy day, Mary only realized the issue when both customers came at the same time.

The logs didn't show anything useful, only what everyone already knows: A limited number of people are able to bypass the validation and create duplicate bookings, and it's impossible to reproduce the problem.

Peter believed he could solve everything, but he didn't. The difference is that he only noticed too late.

Ten years later I asked Peter what he learned about this episode with the booking system.

Here's what Peter said:

I've learned with my colleague, Jasmin, that there are some technical limitations in distributed systems. Constraints such as to prevent duplicate bookings are impossible to solve efficiently without a significant cost. There are multiple computers involved, network latency, etc. When you work with a distributed booking system at scale, you need to accept two people will try to do the same thing twice. If there's a business imperative for that not to happen, like no duplicate bookings, you need to take that limitation into the business stakeholders and understand how they want to solve it.
If you allow the program to create the duplicate but create a process to enable the operators of the system to fix the problem by themselves, the user that is booking won't get a bad experience like an error. Once the cost for the operators of the system to fix the incorrect booking is too high, then you'll have a good problem to solve. In the meantime, you can spend your programming efforts on more important things.
It doesn't make sense to invest your energy before you have enough evidence the cost pays off.

Peter went on to give an example:

For example, the server might not send a notification to the user until the operator confirms. If the operator has a UI to notify there's a duplicated booking, they can talk to both users and reschedule one of them. That happens hours or even days before they come to the shop.
Once the cost to reschedule becomes too big, you can automate the process.
For example, you can create a system that looks at the database and lists all the bookings that are duplicated. Then, you write code to send a notification to interested parties that are able to solve that duplication manually, perhaps even another department.
If the cost to solve the duplication manually becomes too big, then you can create a system that automatically sends an SMS or an e-mail to the user saying "Sorry, we can't serve you at 10AM, please book another time". You switch the considerable cost from the operator to a small inconvenience to the user.
If the cost to the user is unacceptable, then you can explore more complicated ways to validate.
This way, at every step, you spend your time to write software that solves only the right problem. You don't waste your time trying to solve a problem that doesn't exist.
Peter learned that it is more valuable to lift some technical limitations into business scenarios and develop incrementally instead of trying to solve all the problems in one go.

The connection between multiple computers works like communication in the universe does. It's messy, complex, and asynchronous, where events happen in ways you can least expect. It doesn’t work like a perfect and synchronous operation, where every event occurs one after the other.

Before trying to fix every problem, focus on the problems that matter. If requirements require too much effort to solve, handle them as valid business scenarios. Once you have enough evidence that it's worth investing the time and effort to fix them, then you'll have a good problem to solve.

Peter thought he was smart at that time, but it took many years of experience for him to understand that …

… a smart programmer is not the one who fix all the problems, it's the one who understands what the problems worth fixing are.

It’s natural to think this is the job of the "Product Manager," but it is your job. Next time you have a business problem to solve, look at the edge cases and see if you can turn those problems into a valid business scenario.

This is an industry where algorithmic interviews and segregation of roles are the norms. What we need are programmers who can understand the value they bring and look at the tradeoffs instead of naively pretend they can fix everything.

You may realize in a few months or years that the code you're writing today is a mistake. At that time, you'll learn the same thing that Peter did. However, learning with your own mistake means that you need to commit them in the first place.

It's better if you don't need to.

It's better if you can learn with Peter.


Thanks for reading. If you have some feedback, reach out to me on Twitter, Facebook or Github.

Thanks to Ian Tinsley and Gustavo Henke for their insightful inputs to this post.