21 principles for systematically solving problems in computer science
Let’s say you wanted to:
- (a) Design a Kafka alternative that was 10x cheaper
- (b) Figure out how to store the entire Internet on a mobile device
- (c) Invent the next generation of IDEs
Is there a systematic way to go about these problems?
In this article I’ll go over several principles I’ve come up with.
Imagine the solution
If you don’t know what the solution looks like, you’ll need to figure that out first.
For many problems the solution is clear: like designing a 10x cheaper Kafka or storing the whole Internet on a device.
But if you want to invent the next generation of IDEs, it’s not. You have to imagine what functionality would make them a step change better than what exists now.
Another examples include Google aiming to index the whole Internet or VMWare virtualizing servers.
- One way to imagine what’s next is to consider the job to be done: how can it be best accomplished? Discard preconceived assumptions rooted in how things have always been done.
In the IDE example, a next generation IDE might not allow humans to enter code at all. Instead, it might be a set of tools for a human to manage a fleet of cloud-based LLMs which write the code for you.
A IDE’s purpose was never to let you to write code; it was to let you develop software. It doesn’t matter if it accomplishes that by being a programming language editor or a code LLM manager.
2. Another way to imagine what’s next is to think about where things will be in 25 years. Or, what does the ideal service look like?
In the IDE example, the ideal service might be that I can enter in a description of my problem, the service will ask me a few relevant questions, then it can generate a design I can tweak. Once I approve, it will develop and deploy the software component for me. I can ask it to modify the artifacts at any time using plain English language.
3. A third approach is to see how you can improve a existing technology along some meaningful axis (to the end user) by an order of magnitude.
Essentially, you take a major pain point that users have — like Kafka clusters being too resource intensive or difficult to operate — and reduce it, by 10x, enough that the change is palpable.
Redefine the problem
4. Relax one assumption at a time from the problem statement (to make the problem easier)
Many problems are too hard to solve on their own.
However, by solving a easier, related problem, you can contribute some progress to solving the original problem. In the best case, it might give you some insights into solving the original problem.
Let’s consider the initial example (b). We want to store the entire Internet on a mobile device so galactic colonists can browse it from other planets.
Considering the Internet’s size is 64 trillion gigabytes, and the largest iPhone stores only 1 TB of data, this problem seems hopeless. But other problems are not:
- “entire” — can we store a subset of the Internet?
- “Internet” — can we store a curated set of relevant data rather than the whole Internet?
- “mobile device” — can we use another device with more storage instead?
A more promising problem statement might be: “Can we store a curated subset of Internet data on a hardware device (like a NFS drive) colonists can keep? If possible, let colonists connect their phones to this drive to browse files”
5. Add and remove constraints (to make the problem easier)
Taking the Kafka example, we could remove a constraint: Can we make Kafka 10x cheaper if we could accept less durability?
While not ideal, we would at least offer some low cost alternative to potential users, where before there was none. And for many users it might be perfectly suitable.
Down the line, we might find a way to solve the problem and remove the constraint.
Or, we could constrain the input: Can we make Kafka 10x cheaper if we were guaranteed a network which guarantees packet delivery and ordering? It’s not possible but it’s a starting point in what might otherwise be a intractable problem.
6. Move up the abstraction ladder: What’s the real problem we want to solve?
Maybe the real purpose behind storing the entire Internet on a mobile device is education: So colonists can figure out how to farm on a foreign planet. In that case, we would be better off just printing out the relevant information and concentrating on more pressing challenges.
Or maybe the purpose is entertainment, in which case we can download a large number of TV shows and movies offline to the colonists’ devices before they depart.
With this approach, we can see if there is another, easier and also more effective way of solving the same problem.
7. Perform root cause analysis
It’s important not to mix how the problem is presented with what it actually is.
For example, a user might say “your web app is slow” but it might actually be a problem with their Internet connection. The technique in “Diagnose the problem” below of creating a MECE tree breaking down the problem and then ruling out each sub-tree can help here.
Or, our widget defect rate might be high, because our QA team is not available, because their offices are closed, because the building’s maintenance staff are on strike. A simple 5 Whys interrogation would suffice here.
8. Restate the problem in a specific and measurable manner
We can make our problem SMART: Specific, Measurable, Attainable, Relevant, and Time-bound. We can use numbers as much as possible when restating our problem.
As a example, we can redefine the Internet problem as: “Devise a way to store 64 trillion GB of data on a 1 TB mobile device.” Obviously this is still too difficult so we’ll have to apply steps (4) and (5) to make the problem easier.
9. Understand the problem’s context
For instance, if our API endpoint is unusually slow, we can ask:
- Since when did it get slower?
- Who is calling this API?
- What was the P95, P99 before this time? What is the P95, P99 now?
- How is the distribution of HTTP status codes before and after?
- Were there any code changes, config changes, infra changes, known outages around this time?
We ask questions like who, what, when, where, why to understand what’s going on before thinking of solutions.
Diagnose the problem
We should now have a problem that is somewhat solvable. We might be tempted to start brainstorming solutions.
Continuing the Kafka example, we might think to ourselves:
- What if we reduce the replication factor?
- What if we don’t persist to disk altogether?
- (Among many other ideas)
We might think of some good ideas, we might not think of good ideas, and even the “good” ideas we think of might not move the needle much.
If you reduce the default replication factor of 3 to 1, you may have reduced cost by 3x but not the 10x we seek. Not persisting to disk altogether means you no longer have a viable Kafka alternative at all.
We need to be more systematic. Coming up with random ideas won’t cut it.
10. Break down our problem so we can grasp its biggest drivers. From this, identify the root problem(s) to be solved.
(Aside: This is not trivial, knowing which axes to break down something by requires skill. For instance you need to know to break down latency on a node by thread state or on-CPU vs off-CPU.)
If we want to reduce Kafka costs by 10x, we first need to understand the drivers behind Kafka costs.
If we break down Kafka costs, we might find that 80% of the cost comes from EBS volumes (for per GB storage) and 15% from EC2 instances, and 5% from other sources. (Just for illustration’s sake.)
(If you would like, you can break down the drivers above into sub-drivers and keep going until you find major drivers you are able to move. Creating a MECE tree might help too.)
So, what makes the problem difficult is we need to somehow reduce our EBS storage costs by 10x. Reducing EC2 instances or anything else won’t be sufficient. This seems daunting.
Sketch out designs
Now we can start coming up with potential designs for our new technology.
11. Apply patterns to solve sub-problems
Let’s continue with the Kafka example. We want to create a Kafka like service, which reduces cost by 10x. Moreover, we’ve determined it’s monthly per GB storage cost which drives almost all of the total cost.
We can come up with a few designs created from the patterns for reducing storage cost below
- Cold(er) storage — We can request HDDs instead of SSDs from EBS
- Retention policies or sampling — But we can’t drop persisted messages
- Compression — We can use specialized compression algorithms and organize our data on disk to maximize our compression rate
Patterns can be gleaned from a formal education, experience, blog posts, or just looking them up.
Often one problem can be modeled as another; this is especially true in computer science. Think modeling course scheduling as graph coloring and solving it that way instead of writing a brute force backtracking solution.
12. Study prior art
Studying what solutions exist so far can help a lot. In particular, borrowing solutions from state of the art research can help a lot, as industry is often well behind academia.
Almost every problem has already been solved. So, logically, it’s a matter of finding the equivalent problem and its solution (which may be in a different domain).
As a example, once I was having trouble designing a relational database schema for audit logs. So I looked at how GitLab’s table in their open source code and that solved my problem, too.
When studying prior art, aim to understand: What are the different ways people solve this problem? Then, what are each solution’s goals, how does it achieve them, and what are its limitations? From there, given your own problem’s goals, you can identify the solution which maps most closely to it.
13. Study the fundamentals
If you were designing Kafka, 90% of the design would just be straightforward application of the fundamentals.
Example: We need durability, so we need to replicate data across machines. Let’s say we go with single-leader, asynchronous replication with a typical n/2+1 quorum. We use Raft consensus for leader election.
The remaining 10% would then be challenging problems like guaranteeing message ordering.
Fundamentals can be broad, like distributed systems, or more granular, like message brokers. You could write a basic Kafka clone from scratch so you properly understand how it works.
14. Apply first principles reasoning
There’s really only two ways to reason:
- Examples -> conclusion (inductive)
- Facts and logic -> conclusion (deductive)
Patterns are inductive reasoning, while first principles reasoning is deductive reasoning.
In first principles you would start with facts such as:
- Sequential SSD writes take X microseconds
- Random SSD writes take Y microseconds
- Memory writes take Z microseconds
I don’t have the exact numbers now but we all know Z is a order of magnitude less than Y which is much less than X.
So, naturally, if you want to create a program that writes data very quickly to disk, you want to find some way whereby you can write to memory and then we flush to disk later. And, if we need to write to disk, then we write in a sequential, append-only manner.
If you keep following this thought process, you might end up with something like a LSM tree, which is designed for fast writes.
15. Use different sub-technologies
Let’s practice first principles thinking further.
Some estimates when it comes to storage costs:
- On-premise hard drive — $10–15/TB (one-time, to buy)
- Backblaze B2 — $5/TB/mo
- Cloudflare R2 — $15/TB/mo
- Amazon S3 (Standard Tier) — $20/TB/mo
- EBS (SSD gp3) — $130/TB/mo
Now we can do the math: If you have a 10 node Kafka cluster, each node has 100 GB of EBS SSD space, then you’re paying $130/month. If you manage to store that data in B2, you’re now paying $5/month or 26x less!
What happened? The performance properties of any system are determined by the performance properties of the different pieces of technology it builds on.
Swapping out one piece of sub-technology with another will usually improve some performance properties of the system while damaging others. There is rarely a free lunch. Moreover, if there was, the new technology would already be widely deployed.
In this case, by using a (remote) object store instead of a (local) disk, read and write latency increases vastly. Given this new constraint, how can we keep delivering messages nearly as fast? This is another sub-problem problem we need to solve now, but if we can, our original problem of a 10x cheaper Kafka alternative is solved!
Suggestions I would add —
- It’s preferable to loosely know a large number of technologies. That way you can generate a lot of potential solutions and see what is the best out of a wider selection
- New technology will keep emerging; it can be beneficial to stay up to date. For instance, let’s say you needed to optimize latency instead of cost, then you could swap out S3 with AWS Wavelength (which lets you store files with your 5G provider)
- Embrace technology that’s more optimal, but has constraints. The most optimal technology is often like this. For instance, radix sort is faster than other sort algorithms, but only works on integers (or values that can be converted to integers).
- Question common practices, even if widespread. For example, ElasticSearch is fast because it indexes many fields in each document you ingest. However, due to this, ElasticSearch is both very disk and memory intensive. You may view indexing as non-negotiable — how else would you achieve full text search? However, Grafana Loki proves indexing is actually not necessary — brute force, parallel search works.
16. Model the problem using a known problem, algorithm, or data structure
As a example, Uber observed the problem of scheduling push notifications is a optimization problem, and applied integer linear programming to solve it.
Typically problems in CS are either decision (yes/no), search (find a solution), or optimization (find the best solution).
Each type of problem has a well known set of techniques for tackling it. For instance, other optimization methods include genetic algorithms and ant colony optimization.
Often, if you state your problem abstractly, you can reduce it to a famous solved problem. For example, let’s say you are creating a tiering algorithm to move files between different storage classes in S3. You could restate the problem as:
I have 3 buckets. Each has an infinite capacity. Bucket 1 charges $a per item, bucket 2 charges $b per item, and bucket 3 charges $c per item, where a < b < c Bucket 1 costs $d to access, bucket 2 costs $e and bucket 3 costs $f where a > b > c. How should I place algorithms between different buckets to minimize cost?
Now, we can look into prior work describing the multiple knapsack problem’s optimal solution to help us with this problem.
Finally, you can model your problem using a known data structure. Examples include abstract syntax trees or graphs for maze finding. Once done, you get the data structure’s algorithms like shortest paths for “free” — you don’t need to reinvent them yourself.
Evaluate designs
After generating many designs we can evaluate them. I describe some techniques below.
Before we do this, we should identify the key criteria to be used in evaluating any design.
17. Tradeoffs
You can rule out many options just by considering their different tradeoffs and using intuition. You can also use a Pros/Cons/Fixes or a decision matrix.
18. Back of the envelope estimation
By estimating the key factors like latency, throughput, cost, durability, you can get a more precise idea of the properties of different designs, helping you to choose between them.
19. Prototyping
You can develop a simple version of each design, concentrating only on the novel (and risky pieces).
20. Benchmarking
After developing a prototype, you can develop a benchmark yourself, or choose a existing well known benchmark, to test them against. Load testing would be a example of this.
21. Experimentation
You can also run multiple technologies at once in a realistic environment and see how they fare. For example, writing to and serving from two data stores at once in prod based on a feature flag, in order to compare them.
Common guidelines
Problem solving is rarely so linear. Below I’ve collected various common guidelines which didn’t fit into the framework above.
General:
- Practice. Especially in order to build a intuition of when to use each pattern and problem solving technique
- Draw things out: Component, state, and sequence diagrams and flowcharts in particular
- Keep a journal: Document what I tried so far, the results, and any notes
- Write a design doc: To think through solution before coding
If stuck:
- Naive solution first: Get a solution working first (you can brute force, hard-code, etc) then improve from there. Also, the naive solution gives you a baseline to compare future solutions against.
- Iterative progress: Instead of solving the whole problem initially, what’s the next thing I can do or learn to make progress? Reducing iteration time is helpful: for example, Cruise creating a driving simulation with many roads to speed up assessing new software versions.
- Try different approaches for the same goal: For instance, if you can’t connect to the database using psql, you can retry using PgAdmin or a script. Or if you need to make a service S3 compatible, see if there is a library in your language, then another language, then if you can copy code from a open source project, then if you can do it yourself.
- Divide into cases and solve each problem differently. For example, if some code is too slow for 1% of customers, you could provision special hardware for that 1% only rather than optimize the code for everyone
- Cycle through possible solutions: For instance, for a coding problem, you could consider each of the common data structures and algorithms and see if any of them would help
- Work examples by hand & try induction: See if you can find a pattern (induction). Also, see how you would solve the problem by hand and see if you can turn that into a algorithm
- Solve easier problems to build skill: If you wanted to write a filesystem, you could first write a virtual filesystem in Python, then a basic one in Go, then a advanced one in Rust or C. Each incremental step is easier compared to diving into writing a advanced Rust filesystem initially
Coding — General:
- Always have a plan: Either a end to end plan which encapsulates solving the whole task, or a logical next step progressing towards one. Avoiding diving into coding without a plan, or trying random solutions.
- Always be aware of knowns and unknowns: Then you can develop the knowns, and tackle each unknown one at a time. Examples of unknowns include “how should we store seen nodes” and “how should we model each tree node”.
- Break down the problem to the method level (or as much as possible): Define input and output for each method and leave them as black boxes if needed.
- Build solid functions then build on them: Write a function, test it properly, and then you can write another function which builds on it. This saves a lot of time as opposed to bottom-down testing.
Coding — Dealing with unknowns:
- Understand how your code will work on paper before coding: Go from low to high fidelity. Don’t dive into coding if the algorithm’s operation is not obvious. Run the algorithm by hand on some inputs on paper, run it on edge cases, understand how and why it works, draw things out. Then write pseudocode, then real code
- Write code in a scripting language first: I tend to get code working in Python first, then translate to the actual language, as it’s much faster to develop in and debug
- Write throwaway code: When writing non-trivial code, throw away your first version of the code; just use it to identify any pertinent issues. You can address them in the second version
- Create and stitch “mini programs”: For instance, you could write a separate, “mini program” with its own “main” to parse a input file. It will be faster to iterate on and debug when written as a script rather than integrated into the main codebase. Once it’s working, integrate it into the main codebase. Or, you can stitch all the mini programs together at once at the end. (This method works for debugging too.)
- Make new code callable. For instance, create a mini program, call it from the existing “main” function, or expose a API. Often new code is only callable indirectly through other functions which makes it difficult to test edge cases and debug, and the feedback loop is slow
Coding — Sequencing work:
Work sequencing is important because the naive, linear approach of coding can lead to wasting a lot of time writing code which is not risky, not having a way to run code as it doesn’t work end to end, and no feedback loop.
Instead, writing code smartly by combining the techniques below avoids these issues.
- Walking skeleton: Get the code working minimally end to end. All layers are touched, your code is scaffolded, and you can start using and improving the end product
- Top-down development: Scaffold the top level method, then lower level methods, and keep repeating. Lets you create a walking skeleton
- Bottom-up development: Start writing the lowest level functions, make them solid and move upwards. Top-level development lets you create something that works end to end while bottom-up development lets you start filling in your code in a minimally risky way.
- Hardest parts first: Solve the hardest parts first in order to derisk the solution. Useful if a subset of parts are much harder than the rest
Coding — Getting things working:
- Get things working fast, then improve. Write as fast as possible a initial, poor version which works before creating another version with good testing, error handling, edge cases, code quality, validation, etc. Don’t jump straight to production code as that will waste a lot of time
- Scaffold code: Modifying mostly working code is much less taxing than writing code from scratch. Whenever possible, use Stack Overflow, project templates, code generators like ChatGPT, GitHub Copilot, JSON-to-Go, Postman to scaffold your code.
- Hard-code to get things working: For instance, if your endpoint calls another service, you can just get the HTTP call working in Postman and paste generated code into your project, despite hard-coding. Once your endpoint works, you can parametrize