MSc Data Science Lecture Notes. L2: Introduction to Statistics — Part 1 of 2

Ivan Reznikov, PhD
10 min readJan 28, 2024

--

My name is Dr. Ivan Reznikov.
I’m teaching MSc Data Science at Middlesex University Dubai.
This article is part of the series with my brief lecture notes.

This article will be useful, if you’re starting with stats or refreshing your memory for an interview.

In case you’ve missed it, check out the first lecture:

· Introduction to Statistics
Historical Reference
· Types of Statistics
· Data Types
· Descriptive Statistics: Measures
Means
Mode and Median
How to choose
· Solutions
Descriptive or Inferential
Means
· Reference

Introduction to Statistics

Let’s face it. Statistics is boring. This is a fact.

I’ll jump out of my pants to make it interesting.

Let’s start.

Historical Reference

The drawing below illustrates a real problem that took place during WWII. The picture displays the frequency map of shots taken by airplanes. Each new shot is registered when the plane comes back from a mission.

The army was considering putting on additional armor to increase the survival chances of their planes. The question was: where to put it? You can’t put it everywhere — the plane won’t leave the ground then.

What would you do? Given what you know about airplanes, what part of the plane would you equip with more armor? Don’t rush. Take a couple of minutes.

The story goes as follows: The army wanted to figure out exactly how much armor belonged on parts A, B, and C of the plane. That was the question they asked Abraham Wald — a statistician from Columbian University. The answer surprised them quite a lot.

The armor, said Wald, doesn’t go where the bullet holes are. It goes where the bullet holes aren’t: on the engines and where the pilot sits.

He was met with disagreements: why would one put the armor where no plane was shot? But the reality was different: the planes were getting shot. They weren’t making it back.

Wald explained that if the plane gets shot in any of the ABC sections, it can still make it home. But if the pilot is shot — the chances are close to zero.

This is called survival bias, and it’s quite often met in real life:

  • The enduring Roman aqueducts, some still standing after centuries, showcase their advanced engineering — man, how good were the architects of the past. Bias: only some still exist because the poorly constructed aqueducts have already fallen apart
  • Medieval medical texts often advocate questionable practices like bloodletting and herbal remedies. Bias: many failed treatments and ineffective medicines that were likely never documented. Maybe it was worse than we imagine
  • Ghost towns scattered across the American West standing as reminders of the Gold Rush era’s boom and bust cycles. Bias: does anybody know how about the numerous small settlements that sprung up and persisted?

And the list goes on and on. The musicians who wrote classical music were more talented, cinema is now much worse, and don’t get me started on news :)

I’m giving a large intro to show how easily one can get confused with statistics if presented incorrectly. Our goal today is to learn more to avoid such confusion.

Types of Statistics

When discussing statistics, most assume one of two types: descriptive or inferential.

Descriptive statistics is used to describe the characteristics of the population using a sample. To describe, we’ll use mean, median, and different measures of dispersion. We use it to organize, describe, and present data meaningfully with the help of charts and graphs.

Descriptive statistics is used to summarize known data in a way that can be used for further predictions and analysis.

Inferential statistics uses various analytical tools to draw inferences about the population using samples. We use hypothesis testing and regression analysis to test, predict, and compare data obtained from various samples.

Inferential statistics aims to conclude a sample and generalize it to the population. It is about using data from a sample and then making inferences about the larger population from which it is drawn.

Try answering if the following tasks are descriptive of inferential:

  1. We measured the pulse of all football players from our team during the match. Their average was 120 bps with a maximum of 180
  2. We measured the pulse of all football players from our team during the match and can conclude that the goalkeepers’ pulse is lower than anyone else’s
  3. 79% of employees of company A prefer to work at home
  4. The average number of time spent by a person on the beach in Dubai is 3.5 hours

Take time. The answers will wait below.

Data Types

There are several data types understood in statistics.

Categorical or Qualitative data type represents characteristics. Such data can represent things like gender, language, car brand, etc. Categorical data can also take on numerical values (for example, 0 for male, 1 for female). Such numbers do not always have mathematical meaning.

Nominal values represent “categories” or “labels”:

  • Tesla, BMW, Mercedes, Toyota …
  • English, Arabic, Spanish …
  • Male, Female

Nominal data has no order.

Ordinal data, on the other hand, has order. This is the only difference that it has with nominal data. Ordinal data can be represented by:

  • Intern, Junior, Senior, Manager, Director
  • Elementary, High School, Graduate, Master, PhD

Numerical or Quantitative data is a bit spicier. Let us remind ourselves what are discrete and continuous values.

Discrete data can only take certain values. It can’t be measured but can be counted — 4, 8, 15, 16, 23, 42 …. Continuous data represents measurements. It can’t be counted but can be measured — 12,500 km, 6× 10²⁴ kg, 4.5 billion years, etc.

Interval values represent numeric values that are ordered and where we know the exact differences between the values. They don’t have a true zero. Let’s look at a couple of examples:

  • Temperature: you can’t say that 50°C is twice as hot as 25°C
  • Clock time: 3 pm is three times smaller than 9 pm

Ratio values are the same as interval values, with the difference that they have an absolute zero.

  • Height
  • Weight
  • Speed

Descriptive Statistics: Measures

Means

Usually, we describe the “typical” or “average” value when talking about something. In a conversation about wildlife, when brought up an elephant, it’s most likely a “typical” elephant. If you discuss cars, most likely, it’s a typical four-wheel vehicle. In statistics, this is called a central tendency.

Most of the time, the mean is understood as average. But in fact, there are several types of means:

  • Arithmetic Mean or Average is the most common type of mean, calculated by adding all the values and dividing by their number. It’s the most intuitive measure of central tendency but is easily skewed.
  • Geometric Mean is calculated by multiplying all the values in a set and then taking the nth root, where n is the number of values. It’s useful for datasets where data is constantly changing or growing at a rate. Such data might be COVID cases, investment returns, or population growth.
  • Harmonic Mean is calculated by taking the reciprocal of the arithmetic mean of the reciprocals of all the values in a set. It’s useful for data sets where you want to average rates or inverse values, such as speeds or work rates.

Besides these means, there is a weighted mean, a root mean square, contraharmonic mean, and others. One may wonder why there are so many, but they are all used in different fields of math and science.

To better understand their applications, let’s try solving several tasks:

Task1. People contracting some pandemic virus increased
by 10% in day1,
by 20% in day 2
by 30% in day 3.

What’s the average daily increase rate?

Task2. A swimmer spends one lap of freestyle at 3km/h, then one lap of breaststroke at 2 km/h.

What’s his average speed?

Once again, the solutions will be below. Take your time, and try solving it in python, excel, or even pen and paper.

Mode and Median

There are cases when the mean (here and later, mean is referred to as arithmetic or average) isn’t descriptive and might even be misleading.

Let’s look at the above list of 20 numbers. Quantiles are values that split sorted data into four equal parts. They basically represent 25%, 50%, 75% of the data. The second quantile is the median — the middle of the dataset.

The mode represents elements met in the set most often.

The purpose of using multiple values is to describe the sample with several statistical numbers.

How to choose

Let’s inverse and play a game. Assuming we only know the mean, median, and mode, can we describe our data?

We know that the median is seven, and the mean is slightly higher than the median. That means the second part of the sequence is located farther from the average than the first part.
We also know that the most common element is 15, which represents a large chunk of the second part. Based on that, we can estimate that the average of the first part should be around 4.5 (the average of averages will be the same as the average of the sample). With this, we can estimate our 1st and 3rd quantiles to be close to 4.5 and 15, respectively.

Basically, just by knowing three numbers, we’ve efficiently described the dataset. If we had included max, min, and other quantiles, we’d be close to reproducing the whole dataset.

In practice, statistical and machine learning algorithms go much deeper in their “thought” process than we’ve done above.

Another reason why using multiple statistical metrics allows the describe biased datasets. Let’s look at the above picture, which illustrates the salary distribution in a company.

What do you think describes the data best?

  • mean is 5000
  • median is 3000
  • mode is 2000

Whatever metric you’ve chosen — I doubt it will be enough. Using all three, in this case, is the only correct way to proceed. If you plot the data, think of it as having just one column instead of the whole distribution.

Let’s wrap up with some practical advice:

  1. Use mode for nominal data can reflect the “popular choice.” For example, surveying ice cream flavors. Since vanilla, chocolate, and other flavors are categories and not numbers, calculating the mean or median wouldn’t provide meaningful insights.
  2. Use median for skewed data, for example, analyzing incomes. Most of the time, you’ll find a situation with many low and a few high earners. The median would represent the income at the middle point, giving a more accurate picture of “typical” income than the mean, unaffected by the few highly paid individuals.
  3. Use mean for normally distributed, for example, analyzing heights of adult males in a city. If the data follows a normal distribution (don’t miss the next part), the mean will be close to the median and mode, effectively representing the “average” height. In this case, the mean gives a concise summary of the central tendency, simplifying the interpretation of the data.

Solutions

Descriptive or Inferential

  1. Descriptive, as the average (mean) and maximum pulse rates are being described.
  2. Inferential, as a conclusion (or inference) about a larger population is made based on the sample data. In this case, the conclusion is that goalkeepers have a lower pulse rate is an inference from the data collected.
  3. Most likely inferential. Think of it this way: who are these “79% of employees of company A prefer to work at home?” Most likely, the survey hasn’t covered all employees. Moreover, those employees who didn’t have any preference ignored the survey.
    In a sense, the situation is similar if the statement was phrased as “98% of survey respondents indicated they enjoy participating in surveys.”
    Does this ring survival bias bells? :)
  4. Similar to the previous case, this could be both descriptive or inferential statistics. But in this case, it is much more likely that it’s descriptive, as we’re told about a certain location (Dubai) and may be based on parking tolls or other ways, allowing the possibility the numbers represent the general population.

Means

Task1. Let’s take a look at how this problem might’ve been solved:

  • day 0–1000 cases
  • day 1–1000 + 0.1×1000 = 1100 cases (+10%)
  • day 2–1110 + 0.2×1100 = 1320 cases (+20%)
  • day 3–1320 + 0.3×1320 = 1716 cases (+30%)

In this case it makes sense to use geometric mean => (1716/1000)1/3 = 1.197 => 19.7%

Task2. For simplicity, let’s set the lap as 3km. Freestyle will take 1 hour, and breaststroke will take 1.5. It’ll take him 2.5 hrs for 6 km => 6/2.5 = 2.4 km/hr.

A more “mathematical” solution is through harmonic means: (1km+1km)/(1/2 hr+1/3 hr) = 2.4 km/hr

Reference

Work on Walds Survivalabity: https://people.ucsc.edu/~msmangel/Wald.pdf

This is the end of Part 1 of the Lecture 2.
The next part will be published soon.

Also, check out my LangChain 101 course:

Clap and follow me, as this motivates me to write new parts and articles :) Plus, you’ll get notified when the new part will be published.

--

--