XY Problem in Software Engineering

9 min readSep 6, 2020

XY problem is classified as a communication problem in which the person who asks the question cannot communicate the problem statement clearly. This situation arises when the questioner needs to do X and they think they can use Y to achieve X so they ask how they can perform Y. However, they should really ask how they can solve X. Without the context of X, people who want to answer can only suggest how to solve Y. Meanwhile, there may be a solution Z that can solve X in a more efficient way.

There are many instances of XY problems that happen every day in companies. The effect can be from being negligible to sucking resources out of the engineering team for years. There are some case studies from real life examples that help better describe the problem at hand.

Case Study: MySQL for Search

In early stages of a startup that has thousands of users, it makes sense to go with the easiest solution that can solve the problem at hand and just keep the business up and running. However that solution may not scale well when the business gets to millions of users.

An example is when MySQL was being used for performing search on customers and merchants. Since all the data was already in MySQL, it was the easiest solution to run query on the tables:

SELECT * FROM shop WHERE shop.name LIKE '%q%' OR shop.address LIKE '%q%'

With thousands of rows and even without indexing, this returns results in a reasonable amount of time that matches UX expectations.

As the company grew, the number of rows also grew and we faced a different problem: Paging! Lots of thought went into how to implement a good paging that returns the closest result to the user’s current location. This meant we had to add more criteria to the SQL to include some math for user location. Since the logic was too complicated for ORM to handle, we had to switch to native queries, eventually bypassing the ORM.

When we added international shops to our repository, we faced a new problem. We had to also search both English and the other language name and address. Our SQL looked like this:

SELECT * FROM shop WHERE shop.name LIKE '%q%' OR shop.i18n_name LIKE '%q%' OR shop.address LIKE '%q%' OR shop.i18n_address LIKE '%q%' [plus paging and user location logic]

With more users and merchants added to the system, the query was not efficient anymore. MySQL service connection exhausted and many downstream failures happened. At this point we didn’t know that the increase in number of connections was because of the inefficient search so we focused on healing the database by adding replicas and increasing the instance size for higher CPU and memory.

We had a group of engineers that were opinionated to using PostgreSQL instead since it has better performance. Another group debating migrating to a different cloud provider. They spent a lot of time on prototyping and coming up with a migration plan to move to another storage and use another provider.

At this point let’s take a step back and analyze the situation:

X: “We want to give the ability to the users to perform a search on existing merchants with partial data on the name or address.”

Y1: “How can I use SQL to search the merchant data?”
Y2: “How can I improve the SQL performance to get faster/better response?”

Z: “There are solutions like ElasticSearch that is created to solve this kind of problem”

Later, we moved to ElasticSearch with couple of months of planning and migration with better and faster response. We left many months of tech debts in our code because of going to the wrong path. This caused the company many engineering hours and a waste of resources.

Case Study: Scrum is not the goal

XY problem can happen in a process in the engineering organization. Agile methodologies have been adopted by many companies. Scrum, which is a very popular one, has been used in many startups to power agility and fast feedback loops.

Earlier in my career, we used scrum pragmatically. We focused on having all ceremonies, SMART tickets, burn-down charts and etc. One part that we got obsessed about was the ticket estimation. We iterated through different methods:

1 story point = half a day of work, 2 story points = a day of work, …
1/2 story point = half a day of work, 1 story point = a day of work, …
Complexity-based Fibonacci V1 (1,2,3,5)
Complexity-based Fibonacci V2 (1,2,3,5,8)
Software-based online poker planning
Physical cards poker planning

The obsession was so much that it took a good chunk of the meetings to make sure that the estimation is right. Every day we had confidence vote to see how confident people are that they can finish the sprint without any leftovers. We had advanced maths to look at previous X sprints and see the average of story points and predict the future.

Pragmatic scrum advocates called the sprint a failure even if one ticket would slip and had hours of discussion and postmortem analysis of why this has happened. The outcome of these analyses was either a new way of doing scrum or changing the math.

After a while the performance and output of the teams decreased significantly. Later we analyzed this and found out the following:

Engineers were more worried about the Scrum ceremonies than their own tasks. They didn’t want to be the reason of the failure of the sprint
Engineers started to overestimate their tickets so they have more room for errors
Engineers tend not to work on anything outside of their sprint like helping another team, onboarding new engineers, participating in interviews, etc.

At this point Scrum advocates were mostly satisfied since they had the perfect planning but at the cost of a lower engineering department performance and higher stress of engineers.

Let’s analyze this situation:

X: “We want to ship our products and services on time with efficient utilization of our engineering workforce.”

Y: “How can we improve the accuracy of sprint estimates?”

Later the company moved toward the OKR framework and used that to measure the success of teams and individuals. The team started to adopt a standard way of doing sprints across the board with less stress on making it a perfect sprint.

How do we end up with XY problem?

Mostly we end up with XY problem unconsciously and would understand it maybe years later. We have limited resources and we can’t always afford delaying a change for a comprehensive investigation. However, there are some factors that can negatively boost this trap.

Small wins culture

If there is an outage because of thread pool exhaustion and the (short-term) fix is a single liner to increase thread pool versus the (long-term) fix for investigating of possible bad architecture that may take couple of weeks, many of engineers will go with the first one since it brings early win and they turn up as heroes.

Alpha geek domination

Alpha geek is a slang term for the most tech savvy person within a group. Once identified, an alpha geek becomes the go-to for all problems, issues and advice when it comes to technology.

Each company at some level has alpha geeks who get their reputation by doing an amazing job over the years and owned a major part of critical systems at some point in their tenure. Engineers will look up to these folks to answer most of their technical questions or review their designs.

Being an alpha geek is not bad and companies usually in early stages depend on these folks. However, as discussed earlier, the XY problem is a communication problem in which the questioner cannot communicate the actual problem. The problem with alpha geeks in this situation is that they will act the same way as other people, with the information they are provided they make the best judgment. However, it is hard for others to question their judgment since they are mostly always right. This usually keeps other engineers to step back and let the alpha geeks decide on what to do.

Resource scarcity

At early stages of a startup, the goal is survival and iterate over features quickly to gather more revenue. In these situations, more senior engineers are equipped with mission critical projects and there are not enough engineers with adequate knowledge to dig into finding the real problem. In these cases, the easiest solution is the one that gets picked.

Complexity

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. — John Gall

All systems eventually will turn into a complex system and depending on the engineering team, it will be broken into one or multiple simple systems again. As the company grows and technical debts are not being paid in time, the system gets harder to simplify. The tendency in this case is to keep the system working until the right moment comes for refactoring and paying debts.

It is hard to analyze the system when it is complex. The time for an engineer to dive deep to find the real issue can be spent somewhere else to provide a user facing feature. From the engineer perspective, it is not rewarding to dive into a complex legacy system so the choice here is to address a simpler version of the problem at hand.

Lack of diversity

If members of a team lack diversity, they tend to think the same way about the problem and would eventually agree on a common solution. On the other hand, diversity helps bring different points of views to the problems. Looking at the bigger picture or looking at the problem from a different angle may surface a simpler solution.

How can we prevent falling into the trap?

We can’t eliminate this completely. It is part of the risk in our day to day decision making. However, we can help the organization to be more aware and take other possibilities for a problem into account. This shouldn’t be a burden for individuals but a culture that needs development. Every individual in an organization can make a difference.

As an individual contributor

Go deeper into the problems and think outside of the box. The problem that is given to you to solve is not necessarily the root problem. Don’t shy away from asking your teammates, your manager and other people in the organization about the root problem.
When framing a problem, provide as much context as possible. Exercise Five Whys when analyzing a problem to get closer to the root of the problem and solve it efficiently.
Don’t get discouraged if you can’t convince your team to follow the actual problem. Your team may have time constraints and they want to have an easy patch for today’s problem. Create a ticket to follow up later but let the team know this needs to be solved.
Work on your presentation skills and how you frame the problem. You can change the way people think about the problem by providing convincing facts. Always back your facts with verifiable data.

As a manager

Look for creating diversity in the team. People with different mindset can provide unique values and can help the team improve.
Avoid Common Information Effect. Look for unique information within your team. If people agree on a solution it doesn’t mean this is a right path.
Encourage your team member to go beyond the current problem and focus on the root cause.
Prevent discussion domination by alpha geeks. In team discussion encourage all members to chime in and provide their inputs.
Encourage simplicity in the team designs. When a system is simple, finding issues in it is easier and individuals can find root causes for problems easier.
If for a resource constraint reason you want to go with a different solution, communicate it with the team clearly. Let the team know that you acknowledge their effort to find the actual solution but because of the current constraints you have decided to focus on Y instead of X.

As a senior leadership team member

Encourage postmortem culture in your organization. When an incident happens, have individuals dig deep in the root cause analysis to find the actual problem to solve.
Reward innovations and thinking outside of the box. People in the organization would likely invest more time digging in the problem if they know it is the company’s culture.
Set the goals in a way with the teams that it shows the ideal state and not pushing any specific solution. For example instead of “Increase user engagement by replacing our current email campaign tool from X to Y”, you can set “Increase user engagement by 20%” and let the team define the way they want to get there.
Embrace OKR. If implemented correctly, each contribution of the individuals can be clearly seen as a contribution to the overall company’s objectives.