Ygritte and John Snow from Game of Thrones

Introduction to Theory of Probability — an approach to build models to solve real world problems

How can we help John Snow and Ygritte meet up for a Date using Theory of Probability ?

Suvro Banerjee
Sep 4, 2018 · 17 min read

Motivation

I have watched all the seven seasons of Game of Thrones (GOT) and eagerly waiting for the last season like many of you. In this article we will try to help John Snow and Ygritte meet up for a date. So, before we do that, let me flash some light on these two characters so that they become lively in our discussion for everyone.

I have two most favorite characters in GOT, John Snow is one of those (the other one take a guess :), he is someone from Lannisters :)). John is the new king of North and a prolific fighter, swordsman and a brave and honest man from the word go. He is so affectionate and likable that even death took a pause to see his mastery and still undecided when to get him. So, there he is, Mr. John Snow a perfect candidate for a date.

There were two options for his date partner, the first one was Daenerys Targaryen, a vivacious young lady, also a brave queen who had a terrific on-screen chemistry with John Snow but I chose not to select her as his dating partner as I am expecting season 8 to throw some surprises about their relationship, so till then I will let them at HBO’s mercy :). The other one was also my favorite, Ygritte, a woman of the Free Folk who lived north of the wall. She was a fighter, raw and attractive enough to become the lover of John Snow. To me, she had a big role making what John Snow is today and the best part is that in real life they are a beautiful husband and wife (God bless them.)

So, no confusion our pair is John Snow and Ygritte. I am still reminded of Ygritte looking at John Snow with a mouth full of taunt and repeating “You know nothing, John Snow”. Well, they had a great on-screen relationship, I enjoyed them thoroughly.

Now, let’s come back to our objective. We are here to help both John and Ygritte meet up for a date and as we do that we will be introduced to a fascinating subject (more fascinating than GOT) called Probability Theory. We will learn the basic foundations of this subject and will solve this dating problem in a very interesting manner. So, let me define the problem statement before we go any further.

A brief history of Probability Theory

William Feller teaching his students

According to William Feller (1906–1970), an Eugene Higgins Professor of Mathematics Princeton University, around 1925 few mathematicians outside the Soviet Union recognized probability as a legitimate branch of mathematics. Now, as we stand in 2018 we see Probability Theory has sprawled across multitude of disciplines like Physics, Engineering, Social Sciences, Artificial Intelligence & Machine Learning and many more from both pure to applied disciplines.

Pioneers in this field including Mr. Feller gave this field a proper structure and underpinned on the fact that any field must be studied from three different perspectives, a) the formal logical content b) the intuitive background c) the applications, otherwise the character and the charm of the whole structure cannot be appreciated fully. Let’s understand what is meant by this-

  • Formal Logical Content — Axiomatically, mathematics is concerned solely with relations among undefined things. Take geometry for an example, it doesn’t really care about what a point or a straight line really are, and no wonder they remained undefined. But what is important is that their relationships like “two points determine a line etc.” becomes the axioms of geometry. These are the rules, and there is nothing sacred about them. Similarly the non-Euclidean geometry has different set of axioms from the Euclidean geometry (commonly known as geometry). So, you see it is important to define the logical structure of a field for it to function. Probability Theory went through this process as well and we will soon define the concept of sample space, events etc to make this discipline fully functional and bring it to life.
  • Intuitive Background — A bewildered novice in chess moves cautiously, recalling individual rules, whereas the experienced player absorbs a complicated situation at a glance through intuition. Maxwell’s concept of electromagnetic waves were at first decried as “unthinkable” but when radio was invented and got into almost every households it was part of the ordinary vocabulary. Same is with telecommunications, internet and modern internet of things. It was all intuitions which was first thought about then it got conceptualized into an idea and later into a product which we enjoy.
  • Applications — The abstract mathematical model which we created serve as tools to the applications. Probability Theory is just not restricted to Mathematics but well ingrained in Natural Sciences, Applied Sciences, Social Sciences and surely in other fields which is beyond my knowledge.

What is meant by Probability, tell me about the INTUITION ?

Say, you are flipping a fair coin. The term fair is important as it implies that the coin has an equal chance (equally likely) of landing on one side or another. It can be also called as an unbiased coin.

If we ask “What is the chance of getting a head ?” This question is kind of getting your hands around an event which is fundamentally random in nature. We don’t exactly know when it’s a head or a tail but we are trying to find the chance of it. This brings me to the definition of Probability.

To deal with the uncertainty in a systematic way we use certain models which is called Probability Theory.

So, in our flipping a coin example, as we said its a fair coin, so it is equally likely to get a head and a tail, the probability of getting a head is defined as -

Or we can say that the probability of getting a Head is 50%.

But does that mean if we flip the coin 10 times it’s always going to be 5 heads and 5 tails ? The answer is No and you can try it right now and see it for yourself. But if you flip a fair coin many many times say a million times (just a large number) then it will be 50% heads and 50% tails. You might as well ask WHY? Well, to answer it in short, in Statistics the law of averages states that over repeated independent trials we will observe a value on average that is close to our expectation. Here, the expectation is 50% and hence after millions of independent trials the outcome will be equal to the expectation.

Now, let me not digress a lot, I will get back to this once I start discussing topics like Random Variables, Population, Sample and Expectation. But I encourage the reader to explore more on this and also share with me by writing on the comments section., But I promise I will return to this with a more convincing mathematical treatment which you would like more :)

Let’s see how does Probability work with rolling a fair die. Keep the definition in mind and then look at the results-

Now, how you can intuitively analyze these results. Take the next two examples to tighten up our intuition a little more.

Here, you can see that it is more likely to get {3, 4, 5, 6} than {1, 2} when we roll a fair die. Further {3, 4, 5, 6} outcome is twice as likely as {1, 2} outcome we can see the probability values which we had just computed. And why not, there are twice as many possibilities in {3, 4, 5, 6} than in {1, 2}. So, probability in fact just reflects our intuition in a mathematical manner.

Now, two more questions still remain to be answered before we move to the next section where we build a Mathematical framework of Probability Theory.

The first question is what is the maximum chance of getting a given outcome, in other words how high a probability value can go? In our example what is the probability of getting any outcome from {1, 2, 3, 4, 5, 6} when we roll a six-faced fair die. It’s a common sense that every time we roll it we get one of these values, so probability of getting one of these is always 1, and that is the highest.

The second question is what is the minimum chance of getting a given outcome, or how low a probability value can get to? Which is to say, when we roll a six faced fair die what is the chance of getting a 7. You can say it’s absurd or in a probability lingo you can say it is always 0. So, probability of getting this outcome is always 0, and that is the lowest.

Illustration: Brian Cronin

Now, that’s the intuition which is super important to approach any field and certainly Probability Theory. I reckon, when I was in school the same problem which I could solve today, at times I failed to solve the next day, and much later I realized why it was the case. The day when I could solve the problem I used to start with intuition and the bad days I always jumped to the solution. So, intuition or lack of intuition was the game-changer.

Let’s move to the next section where we formally define our first probabilistic model.

Setting up the Probabilistic Model

  1. Experiments — Tossing a coin once, tossing a coin 10 times, rolling a die once, observe the life span of a person, random noise in an electrical communication system or someone defaulting on the loans etc. They are all experiments or observations which we are interested in.
  2. Events — The results of experiments or observations is called an event. For example, for an experiment “Tossing a coin 10 times” the event could be “Six outcomes were Head”.
  3. Sample Space — Let’s take an experiment “Throwing two dice” and let’s say the event we are interested in “Throwing two dice resulted in sum of six”. So, in what all different ways we can get there? (1, 5) or (2, 4) or (3, 3) or (4, 2) or (5, 1); this enumeration decomposes the event “sum six” into five simple events which is also called as “sample points”. The aggregate of all sample points of an event is called a “Sample Space” of that event. If you see all those points (sample points are also called points in short) they are mutually exclusive meaning two of those points can not occur simultaneously and also collectively exhaustive meaning there is no more points possible outside these five points and hence it is said that sample space is mutually exclusive and collectively exhaustive. The event could have been most trivial like “tossing a coin” in that case the sample points would have beed (H) and (T) and hence the sample space would be (H, T). The sample space is often represented by Ω (omega)
  4. Discrete Sample Spaces — The simplest sample spaces are those containing only a finite number of sample points, n, like tossing a coin. It takes only one step from here to imagine a sample space with an infinite sequence of points E1, E2, E3, … like say toss a coin as often as necessary to turn up one head. The points of the sample space are then E1 = H, E2 = TH, E3 = TTH, E4 = TTTH, … and hence you can see the sample space is infinite. So, a sample space is called discrete if it contains only finitely many points or infinitely many points which can be arranged into a simple sequence E1, E2, … Let’s look at a simple example -

Say a die has 4 faces also called a Tetrahedral die. The event we are interested in is to roll a tetrahedral die twice. What is the sample space for this event ?

A good way to approach this problem is to draw a picture of all the sample points. We can draw it in two ways and both are quite effective.

Rolling a tetrahedral die twice

So, in total you can see the sample space is 4*4 = 16 and we can say that for this event the sample space is discrete and it is 16.

5. Continuous Sample Spaces — Not all sample space is discrete. Clearly the sample space of all positive numbers is not discrete. We don’t know how many positive numbers are there. Let’s draw a parallel with Mechanics. At first we always consider that individual points carry a finite mass or we call it as discrete mass points and then pass to the notion of continuous mass distribution, where each individual point has zero mass. In the first case, the mass of a system is obtained simply by adding the masses of the individual points; in the second case masses are computed by integration over mass densities.

We do the similar stuff in Probability Theory as well. The probabilities of events in discrete sample spaces are obtained by mere additions, whereas in other spaces integration is needed. Let’s look at an example to visualize it properly-

Say, a person is playing Dart (where you throw arrows/darts at the board attached on the wall). Now say the Dart board is a square of size 1meter x 1 meter. What is the sample space of the event that the player will hit the arrow/dart on the board ? And say he/she does then what is the probability of hitting a specific point “M” whose coordinates is (0.5, 0.4) on the board?

So, let’s formulate this question in terms of the outcomes of this event in terms of coordinates (x, y) on the board.

A Dart Board of dimension 1m x 1m

The outcome would be (x, y) i.e. between 0 and 1 which is infinitely many such points in the square and hence the sample space is an Infinite Set. Now, the probability of hitting the specific point “M” is 0 which is exactly what we discussed about the continuous mass distribution in which every mass point individually is 0. And, intuitively it’s not very difficult to comprehend that out of infinitely many points the probability of hitting a specific point is very very less and hence that is zero. So, in case of continuous sample spaces we are always interested in the probability of a set of points and not on individual points.

Let’s call it out something very important is the concept, and fundamentals of Probability Theory is same for both Discrete and Continuous sample spaces, it’s just the way we compute things are different. For the former we use additions and for the later we use integration. We shall keep our mind open and also focused on the Probability Theory and not get deluded by the unnecessary technical difficulties.

Fundamental Axioms of Probability

  1. Non-negativity : P(A) ≥ 0, i.e. probability of any event A should be non-negative.
  2. Normalization : P(Ω) = 1, i.e. probability of the entire sample space is equal to 1, which also says that the sample space is collectively exhaustive.
  3. Additivity : If, A∩B=∅ then P(A ∪ B) = P(A) + P(B), i.e. If two sets A and B are mutually exclusive, then probability of A or B is their sum of individual probabilities.

Let’s use an example to see these axioms at work. Let’s go back to our earlier example on the Tetrahedral die.

Say a die has 4 faces also called a Tetrahedral die. The event we are interested in is to roll a tetrahedral die twice. As it is a fair die, all the outcomes are equally likely. We have seen earlier that the sample space is 16 (as there are in total 4*4 = 16 sample points). So, every possible outcome has a probability of 1/16

Now, let’s answer the following question-

Q1: What is the probability of P((x, y) is (1, 1) or (2, 2)) i.e. on both the rolls it is either 1 or 2 ?

A1: As (1, 1) and (2, 2 ) are mutually exclusive, the resulting probability is P(1, 1) + P(2, 2). Now every possible outcome from rolling the die twice is 1/16 (please refer to the square in rolling a die section) so the resulting probability of this event P((x, y) is (1, 1) or (2, 2)) = 1/16 + 1/16 = 2/16 = 1/8.

Q2: What is the probability of the event where the first roll is always 3, i.e. P({x=3}).

A2 : The sample space for this event is {(3, 1) (3, 2), (3, 3), (3, 4)}, and we know each outcome has a probability of 1/16. So, the probability of this event P({x=3}) = 1/16 + 1/16 + 1/16 + 1/16 = 4/16 = 1/4

Q3: What is the probability that the sum of both the rolls is Odd i.e. P({ x+ y is Odd}) ?

A3: Sample space for this event is {(1, 2), (1, 4), (2, 1), (2, 3), (3, 2), (3, 4), (4, 1), (4, 3)}. So, there are in total 8 sample points, hence P({ x+ y is Odd}) = 8/16 = 1/2

Now the next question without scribbling on a diagram might be difficult.

Q4: What is the probability of the event P(min(x, y) = 2) ?

A4: So the sample space is (2, 2), (2, 3), (2, 4), (3, 2), (4, 2). Thus the probability of the event P(min(x, y) = 2) = 5/16

So, you might have noticed that the above example displayed “Discrete Uniform Law” , i.e. if all outcomes are equally likely then,

There is another set of questions which deal with “Continuous Uniform Law”, we just saw in the Continuous Sample spaces. Let’s do a problem.

Q5: What is the probability of finding two random numbers between 0 and 1 or can be written as [0, 1] ? Also, what is the probability of finding two random numbers whose sum is less than equal to 1/2 ?

A5: Let’s formalize this question and look at it visually what it means. I would attempt the 2nd part first as that is more easier which is P(x + y ≤ 1/2)

x + y ≤ 1/2 is a straight line with coordinates (1/2, 1/2). So, all those pairs of random numbers will lie within this area whose sum ≤ 1/2. So, the probability of the event is the area of the triangle, i.e. P(x + y ≤ 1/2) = 1/2 * 1/2 * 1/2 = 1/8. We have seen before that how for Continuous Sample spaces the probability is the area and we will look more into it when we take this as a separate topic.

Now, let’s attempt part -1 of it. When it says, two random numbers between 0 and 1; let’s pick any two numbers say 0.4 and 0.7 which is denoted by the point M on the diagram. So, the question is what is the probability of getting 0.4 and 0.7. Now we have seen that, for continuous sample space the probability is the area of the sample space. And we also know that points don’t have any area hence the probability of this event is 0, i.e. P((x, y)) = (0.4, 0.7) = 0

Now before we attempt to solve the main question of this article “The dating problem” we will explore another technique called partitioning which will be very handy in solving complex probability problems. Let’e explore that in our next section with a problem statement.

Creating Partitions

In a class of students 60% of the students are genius, 70% of the students like chocolates and 40% of the students like both. Determine a probability that a randomly selected student is neither a genius nor a chocolate lover ?

Let’s define the events in terms of probability values-

G: Event that a randomly selected student is a genius. So, P(G) = 0.6

C: Event that a randomly selected student loves chocolate. So, P(C) = 0.7

Both G and C: P(G ∩ C) = 0.4

Now, let put all of them into a Venn diagram, which is a tool to show how different sets look like.

Now to solve this problem we will use the concept of Partitions. With partitions we create all the disjoint sets (can’t overlap) so that if you put them together it will comprise the entire sample space.

In this question we are interested to know probability of the set X which represents students who are neither a genius nor a chocolate lover ?

P(G) = P(y ∪ z) = P(y) + P(z) = 0.6 [as both y and z are disjoint sets]

P(C) = P(z ∪ w) = P(z) + P(w) = 0.7

P(z) = 0.4

So, substituting the values in the above three equations we get,

P(y) = 0.2 and P(w) = 0.3

Now, as x, y, z, w all are the partitions (i.e. disjoint sets which fill the entire sample space) we have -

1 = P(Ω) = P(x) + P(y) + P(z) + P(w)

=> P(x) = 1 — (0.2 + 0.4 + 0.3)

=> P(x) = 0.1

So, the probability of randomly selecting a student who is neither a genius nor a chocolate lover is 0.1 or just 10 %.

Help John Snow and Ygritte meet for a date

So, the problem which we are trying to solve here is as follows -

Now let’s pretend for a moment that John and Ygritte only arrive in 15 mnts increment. How will their arrival sample space look like ?

For this discrete framework, the sample space are those blue points. You can verify that looking through the above figure. Now, as all the arrival points are equally likely there are in total 25 points ([1, 1/4, 2/4, 3/4, 1] on both the axes). So, the probability of each of the outcome is 1/25.

So, in the discrete case where they only arrive in 15 mnts increment the number of desired outcomes are 13 (count all the blue points).

So, the probability that they will meet is 13/25 or a chance of 52 %.

Now, let’s head back to the real life scenario and accept that time is continuous. So, in that case the probability that they will meet is exactly the area enclosed by the curve (we have seen earlier that for continuous sample space the probability value is the area.)

Here, all the possible outcomes is the area of the square i.e. 1*1 = 1.

So, in real world, the probability that they will meet is we subtract two external triangles from the area of the square to get the area enclosed by that shape which is -

1–2*(1/2 * 3/4 * 3/4) = 7/16 = 43.75 % chance.

So, if both John and Ygritte come with a delay between 0 and 1 hour and the first who arrives waits for 15 mnts and leaves, then the probability or chance that they will meet for the date is just 43.75 %.

I am sure John would have done way better in the real life :)

Author’s Note

This is the first article of the series, “Probability”. In the next article we will explore another aspect of Probability Theory called ‘Conditional Probability’ and also look how Bayes’ Theorem was formulated as a result of that. Stay tuned …

Happy Learning :)

Sources

  1. An Introduction to Probability Theory and its Applications by Prof. William Feller
  2. Probabilistic Systems Analysis and Applied Probability taught by Prof. John Tsitsiklis at MIT
  3. Khan Academy, course on Statistics and Probability

Explore Science & Artificial Intelligence

Share interest in Science and explore AI through the principles of Machine/Statistical Learning, Mathematics and Computer Science.

Suvro Banerjee

Written by

“All that is not given is lost” — Tagore | Founder of Explore Science & Artificial Intelligence | MSc. Econometrics from UB | Machine Learning Engineer

Explore Science & Artificial Intelligence

Share interest in Science and explore AI through the principles of Machine/Statistical Learning, Mathematics and Computer Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade