4 Problem Solving Frameworks Every Data Scientist Needs to Know

Aditya Chawla
NYU Data Science Review
8 min readNov 8, 2022
Photo by Andrew George on Unsplash

Introduction

Data Science is popular because it is the most abstract way of solving problems and requires more pragmatism than most other fields. If you ask a physicist what they covet the most, they will tell you that breaking things down in simplest terms gives them happiness. Everything has a reason and can be explained using mathematics. This is also why they will use mathematical proofs to solve every problem. But sometimes, math proofs are either hard to find or impossible to reconcile because of all the factors involved. But then, how should we proceed? This article explores the problem-solving frameworks that we have available to us right now. A problem-solving framework is a way of mapping some inputs to the output with some degree of precision. Of course, higher precision is more desirable but we can only sometimes achieve that.

You can access the code used in this article here. Click here to access the code to Colab.

Theory

When problems are relatively simple and there are countable measurable factors that directly affect the outcome, we can usually use theorems derived from fundamental axioms to come up with a formula that maps the input to the output that we look for. This is the purest and most precise approach but requires knowing everything about the problem at hand. This is where “domain knowledge” becomes a necessary prerequisite to solving a problem. But real-life problems are harder than that. There are too many factors affecting the outcome. Oftentimes, we are only able to measure input variables which are combinations of multiple independent “hidden” variables. If the theory doesn’t work, what is the next best approach? Well, that would have to be simulation/random walks. Let’s explore that paradigm with some examples first before we talk about it.

A simple example

Let’s start with a very simple problem statement. If we flip a coin 10 times, how many times do we get heads on average? The answer is obviously 5 (50% of 10). It looks simple but things can get a lot more complicated. It’s better to start simple so we can understand the framework better.

First things first. What are we trying to do here? Well, we are carrying out 100 trials of an experiment. In each trial, we flip a coin 10 times and count the number of heads we get.

This demonstrates how we can leverage pseudo-randomness to carry out random experiments using Python. Scroll up to find a reference to the code.
Simulating a coin toss

If you run this code block multiple times, you will see that you get a different answer each time but the answer tends to be around 5 all the time.

This is because our number of experiments isn’t enough for a precise answer. Let’s try increasing the number of simulations.

More experiments means greater precision. Scroll up to find a reference to the code.
Simulation a coin toss many times

Now, the answer would be a lot closer to 5. Simply put, more experiments help us be more precise.

Let’s try something a bit more contested. Have you ever heard of the Monty Hall problem?

Suppose you’re on a game show and have the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say №1, and the host, who knows what’s behind the doors, opens another door, say №3, which has a goat. He then says to you, “Do you want to pick door №2?” Is it to your advantage to switch your choice?¹

This problem was addressed by Marilyn Vos Savant, an American magazine columnist with the highest recorded IQ in Guinness’s book of records. As per her answer, switching improves the odds to 2/3 but she received a lot of backlash for this claim. Marilyn received thousands of letters from her readers, many of whom were PhDs who disagreed with her answer. Was she right? Let’s find out using simulation.

Simulation & Experimentation

We can simulate the Monty Hall problem using a pseudo-random simulation. Let’s try 10⁴ simulations so that we can get a more precise answer.

Here, we leverage pseudo-randomness to build upon a more complex simulation. Scroll up to find a reference to the code.
A simulation of Monty Hall problem

Voilà! Marilyn was indeed right. If you flip the boolean “switch” in the above code block to False and execute it again, you will see that answer becomes close to 33.33% i.e ⅓.

To see a mathematical proof that yields the same outcome, you can refer to this link².

By now, it must have become clear as to what we are trying to achieve here. We build a mathematical model of a problem statement, then write a heuristic algorithm that tells us the answer and we run that algorithm multiple times. We then observe a very large sample and try to infer the output from this data. The more data we have, the easier it is to get to a precise solution.

Alright. Now let’s try something which progressively shows that this technique is useful. Consider the problem statement of the gambler’s ruin problem:

Consider a gambler who starts with an initial fortune of $1 and then on each successive gamble either wins $1 or loses $1 independent of the past with probabilities p and q = 1−p respectively. Let Rn denote the total fortune after the nth gamble. The gambler’s objective is to reach a total fortune of $N, without first getting ruined (running out of money). If the gambler succeeds, then the gambler is said to win the game³

To get an idea of how this can be solved using math axioms, refer to the same link. But is there a simpler way?

Let me make a claim first: The most precise answer can be found using mathematical proofs and axioms but when too many hidden variables are involved, it might take someone their lifetime to come up with those mathematical proofs. That is when people start relying on methods that are more practical but a wee bit less precise. In the current scenario, we will leverage pseudo-random simulations yet again to estimate a solution to gamblers’ ruin.

A simulation of gambler’s ruin problem. Scroll up to find a reference to the code.
A simulation of gambler’s ruin problem

So to answer the previously posed question about there being a simpler way: Yes there is a simpler way. But it requires computational power and is more prone to errors. At the same time though, it is an acceptable way to find an approximate solution as long as there aren’t too many hidden RVs.

Let’s now consider a case for which we don’t have a mathematical theory. Consider this problem:

A pegboard is arranged like a triangle with 1st peg on level 1, 3 pegs on level 2, 5 pegs on level 3, and so on. All the pegs are symmetrically arranged around the central peg and all central pegs are on the same axis. We drop a ball on the top of the pegboard. As the ball hits a peg, there is a 50% chance of the ball going left or it goes right otherwise. It is assumed that when the ball falls to a peg at the next level, it always falls onto an adjacent peg on the lower level. A question could be framed here: What are the chances that the ball crosses the 10th peg on the right without ever crossing a peg on the left?

This problem is essentially the gambler’s ruin getting rephrased so we can solve it using the same theorems established for gambler’s ruin. But wait, there’s more. What if the problem gets more complicated? Consider the situation in which when a ball falls on the right, it gains “momentum” i.e if the ball had gone right in the previous run, the chances of the ball going right increase by 0.01 on the next run. At this point, it becomes a lot harder to solve using probability theory. But you know what? Pseudo-random simulations can still work!

This is essentially a more complex version of Gambler’s Ruin problem. Scroll up to find a reference to the code.
Gambler’s Ruin problem with added complexity

There you go. We managed to arrive at an acceptable approximate solution. So why were we able to solve the problem even though we didn’t work out a mathematical proof for the solution? Well, that’s because we traded precision for a solution that requires less domain knowledge to implement. While writing this code, I didn’t know enough probability theory to answer this question using mathematical axioms but simulations worked. But again, we can’t always simulate real-life scenarios using mathematical models. What do we do then? Well, we make another precision trade-off because of which we lose even more precision but even less domain knowledge is required now.

Machine Learning

Quoting the Wikipedia page

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that ‘learn’, that is, methods that leverage data to improve performance on some set of tasks.⁴

Making the best of what historical data gives us. When we use machine learning to solve a problem, we first determine the set of variables that are likely to affect the output. We collect this data and try to come up with an “approximate function” that maps the inputs to the outputs with minimum error. Most of the time, heavy feature engineering is involved which helps reduce the information we have to extract relevant inputs which the ML model can process easily. This requires domain knowledge for it to be done effectively. However, we still don’t need as much domain knowledge as is required for simulating the problem. This is another tradeoff of precision vs domain knowledge requirement.

Deep Learning

Deep Learning is ML but at a larger scale of computing and data. When we use DL to solve a problem, we don’t worry about trying to extract information that very clearly maps to the output in some way or another. We expect our algorithm which is designed to work with a lot of computing power to do that for us. We try to come up with a very flexible function “approximator” and give it enough computation power to converge to a point that gives the solution with satisfactory accuracy. At this point, it becomes very hard to understand why we are getting the solution but we just know that we are getting a good solution because our accuracy over the test dataset accuracy was pretty high. This reduces the want for domain knowledge even further making it possible for an individual with the DL skillset to tackle a wider range of problem statements.

Where do Data Scientists fit into all of this?

This essentially points toward what Data Scientists are supposed to excel at. They need to be able to solve real-life problems without explicit domain knowledge. The best Data Scientists can achieve very little loss of precision and can help companies arrive at solutions to very complicated problems very fast. Do you, the reader, have every tool required in your arsenal to be such an individual?

References

[1] Monty Hall Problem, Wikipedia

[2] S. Glen, Monty Hall Problem: Solution Explained Simply, StatisticsHowTo.com

[3] K. Sigman, Gambler’s Ruin Problem (2020), Lecture Notes on Stochastic Modeling

[4] T. Mitchell, Machine Learning (1997), Machine Learning, McGraw Hill

This article was edited by Yara Kyrychenko.

--

--