Statistically Speaking: A Beginner’s Guide to Data Science and Machine Learning. Lecture-01

10 min readApr 24, 2024

In my upcoming blog series, I will delve into statistics from scratch, focusing on demystifying complex topics and making them accessible to beginners. From basic principles probability and descriptive statistics to advanced topics such as inferential statistics and hypothesis testing, my blog will guide readers with each concept step by step, using real world examples and practical applications.

Now before going into advanced topics let’s first understand the basics of statistics starting from the definition of it.

Topics Covered:

What is statistics?
Population and Sample
i) Simple Random Sampling
ii) Stratified Sampling
iii) Convenience Sampling
iv) Systematic Sampling
What is Discrete Statistics?
What is Inferential Statistics?
Variables

1.What is Statistics?

Statistics is a science of collecting, organising, analysing and presenting the data. In the realm of data science and machine learning, statistics plays a pivotal role in extracting meaningful information from vast amounts of data.

Understanding statistical concepts is very crucial venture for anyone who wants to learn about data science and machine learning, as it forms foundation for making informed decisions, building predictive models and drawing accurate conclusions.

Statistics is like a magic tool that helps us understand and helps make sense of the numbers and data all around us. Imagine you have a big jar filled with candies, and you want to count how many of each colors are present in the jar. You cannot count them all one by one because there are too many, so you pick a sample out of it, like taking handful of candies from the jar. Then you use statistics to figure out how many of each candy is present in the jar based on the sample you have picked from it.

In everyday life we use statistics every time without even realizing it. For example, when weather forecasts predict that it is going to rain tomorrow, they are using past weather patterns to predict the future. Or, when your teacher gives you a test and tells you class average score, they are using statistics to summarize how well everyone in the class did.

In data science and machine learning, statistics helps us to understand huge amounts of data. For instance, if we are trying to teach the computer to recognise different types of cats based on the pictures, we can use statistics to analyse thousands of cat pictures and find patterns that help computer to distinguish between cats and other objects.

Now, statistics can be mainly divided into two parts as descriptive statistics and inferential statistics, before going into inferential statistics lets first understand what descriptive statistics is and what are all the things that come under descriptive statistics.

Before diving into them, as discussed earlier about picking up handful of candies from a big jar. Is there any way that we can do it better, is there any way that we need to follow some process to get the sample from the whole jar so we have collected the proper set of samples from the jar?

2. Population and Sample:
Now think of it like this, population is something like all the candies in the big jar and it is represented by N. Sample is the handful of candies that you have taken from that jar, it is represented by n. Now, there are different way to get this sample based on the use case you are solving, and the sampling techniques have been mentioned below.

i) Simple Random Sampling:

Simple random sampling is like picking up the candies from the jar in a fair and unbiased way. Imagine you have closed your eyes and picking up candies from the jar without looking at it. That’s simple random sampling. Every candy has an equal chance of getting picked up.

So, let’s say you want to know which color is the most common candy in the jar and you cannot count them all because there are too many. Instead, you try to randomly pickup few candies without even looking at it. This way, you are giving every candy an equal chance of getting picked, just like if you are picking names out of the hat.

When should we not use simple random sampling?

Well, imagine some candies in the jar were very tiny and few were very huge in size. If you randomly pick up candies without looking at it, you might end up having with mostly tiny ones or huge ones. This wouldn’t give you fair picture of candies in the jar or in your sample.

So, when the candies are really different from each other in some way, simple random sampling might not be the best choice. Instead, we might want to use different sampling method that makes sure that we get good mixture of all different types of candies. So, we can understand the jar better.

ii) Stratified Sampling:

Imagine you have noticed the candies in the jar come in three different sizes. Small, medium and large. With stratified sampling, instead of picking up candies randomly without looking at them, you might decide to group them in different parts based on their sizes.

So, you create three groups: one for small candies, one for medium candies and another for large candies. Then, you randomly pick up few candies from each group. This way, you are making sure that you have fair representation of different candies in the jar.

When should you use stratified sampling:

Well, if we know that the candies in the jar are different from one another in some way, like size, color or flavor. Stratified sampling might be a good choice. It helps us make sure that we get a different type of candies, so we get a better understanding of the jar.

So, stratified sampling is used to group the candies in three different groups and selecting few candies in each group to get a better picture of the jar. It’s bit more complicated than simple random sampling, but it helps out to make sure that we are not missing out on any important information.

iii) Convenience Sampling:

Imagine you are in a hurry and you want to pickup few candies quickly without much effort. So, instead of closing your eyes, or randomly picking up the candies or even grouping the candies, you simply reach out to the jar and simply pick up the candies which are easy to reach and pick. That’s convenience sampling!

Convenience sampling is something like taking the easiest way out. You are not thinking about fair representation of all the candies in the jar anymore. You are just grabbing what is most convenient for you at that moment.

When should we use convenience sampling?

Well, if you are in a hurry and you want to know what’s inside the jar quickly without wasting much time or putting too much effort, then convenience sampling might be suffice. But it’s important to remember that convenience sampling won’t give the full representation of all the candies in the jar.

So, convenience sampling is like taking a shortcut. It’s quick and easy, but it might not always give you the best results. If you have time and resources, it is always good to go with simple random sampling or stratified sampling so you would get a better understanding of your data.

iv) Systematic Sampling:

Imagine, you have long line of candies in row like a conveyor belt. With systematic sampling instead of randomly selecting the candies or grouping them together, you decide to pick up at regular interval. So, you might decide to pick every 5th candy in a row, and keep picking every 5th until you end up at the other end. That’s systematic sampling.

Systematic sampling is like following a pattern. You’re not randomly picking up the candies, but you are also not grabbing whatever is most convenient. Instead, you are following a systematic approach to make sure you are getting the representative sample of all the candies in the jar.

When should we use systematic sampling?

Well, if the candies or the data are placed in the nice, orderly row like a conveyor belt, then systematic sampling can be a good choice. It helps us make sure we are getting fair representation of all the candies in the jar.

So, systematic sampling is like following a plan. It’s not random but it is also not out of based on convenience. Instead, it’s methodical way of picking candies to make sure that we are getting a good sample of all the candies in a row.

3. Descriptive statistics:
It consists of organising and summarising of the data. It is like taking a closer look at the candies in the jar to understand them better. Instead of just guessing how many colors of each candy is present, descriptive statistics helps us to organise and summarise the information that we have.

So, let's say we have picked up handful of candies and we need to describe it to someone else. We might count the number of candies and find out the most common color in those and we might even calculate the average size of the candy.

In simple words, descriptive statistics is like telling a story about candies in the jar. We are not making any big predictions or guesses yet, we’re just describing what we see in clear and organized way. It’s kind of like saying, “There are 10 red candies, there are 15 blue candies and 20 yellow candies in the jar”.

4. Inferential Statistics:
Imagine we have described the candies in the jar using descriptive statistics, like counting the number of candies of each color. Now, let’s say we want to make a count of entire jar based on this small sample.

Inferential statistics is like using the information from our small sample of candies to make predictions or draw conclusions about the entire jar. Instead of just describing what we have seen in the sample, we are making educated guesses about the larger population of candies in the jar.

So, for example let’s say we have found in our sample that 50% of the candies were red. Using inferential statistics, we might take a guess that around 50% of all the candies in the jar is red. We are using the information from our sample to infer something about the entire population of candies in the jar. This is why collecting the sample from the jar in a proper way is so much crucial in making or drawing conclusions from something.

Inferential statistics allows us to make these kinds of predictions or draw conclusions based on limited information. It’s like making the educated guess based on the information we have on the sample, even though we haven’t counted every single candy in the jar.

Overall, while descriptive statistics helps us to organize and summarize the data we have, inferential statistics takes it a step further by allowing us to make predictions or draw conclusions about larger population based on the smaller samples.

5. Variables:
A variable is a property that can take on any value. In our example we can think about two main types of variables: Qualitative and Quantitative.

i) Qualitative Variables: These are like qualities or characteristics of candies that we can describe with words. For example, the colors of the candies (red, blue, yellow) is a qualitative variable. It’s something that can be described with words but cannot be really measured in numbers.

Now, this qualitative variable is again divided into two parts.

Nominal Variables: These are like categories or labels that don’t have any kind of order. For example, the color of candies (red, blue, yellow) are nominal variables. There’re just different categories, but there is no inherent order to them.
Ordinal Variables: These variables are like categories or labels that do have an order or ranking. For example, if we had the candies of different sizes (small, medium, large) the size of the candies would be ordinal. There’s are clear order from smallest to largest.

ii) Quantitative Variables: On the other hand, these are like quantities or amounts that we can measure with numbers. For example, the number of candies of each color in the jar is a quantitative variable. We can actually count the number of red candies, the number of blue candies and so on.

Even quantitative variable is also divided into 2 types.

Discrete Variables: These variables are like whole numbers or specific counts. For examples, the number of candies of each candies are discrete variables because we are counting the number of whole candies.
Continuous Variable: These are the variables which are like measurements that can take on any value within a range. For example, if we calculate the weight of each candy in a jar that would be continuous variable because weight can take on any value.

So, in our candy jar example we have qualitative variable such as color of a candy which are nominal because they are just different categories. And we have quantitative variables such as number of candies of each color and this is discrete variable since we are counting the whole number of candies.

This might be enough for the first lecture that I am giving. I know this lecture is not much and just tells you about some basic in statistics, but when you are building advanced machine learning algorithms or artificial intelligence models statistics plays a key role in selecting the data that you are going to feed to your model. So, basically first understand and make a strong foundation of basics.

Why reading is better than watching YouTube videos?

Reading let’s your imagination run wild. You can create pictures in your mind, and that’s like having personal movie in your mind.
When you are reading you will be focused on words, but while watching videos you will be distracted by flashy visuals and adds.
Reading encourages you to think. You pause, reflect and understand things deeply. It’s like having conversation with the author itself.
Reading is patient game. It’s not a race. You learn to enjoy the journey, and that helps for you in many areas of the life.

Statistically Speaking: A Beginner’s Guide to Data Science and Machine Learning. Lecture-01

Written by Anju Reddy K