Lesson 2: Introduction to Statistics

Oluwafadekemi Ogundiya
dsnaiplusui
Published in
4 min readJul 2, 2020

Statistics is a subfield of mathematics. It helps us to make educated guesses of the unknown and find useful information in an ocean of data. But despite its usefulness, many people struggle with statistics. What is it? How does it work?

Bringing sense out of data can be cumbersome and alarming for every machine learning enthusiast. As part of my contribution to the 7 days of statistical theory challenge by AI+Club UI, a student community under Data Science Nigeria, I shall write about on Introduction to Statistics.

What is Statistics?

It refers to a collection of methods for working with data and using data to answer questions. The field comprises a grab bag of methods for working with data that can seem large and amorphous for beginners. It's hard to see the line between methods that belong to statistics and methods that belong to other fields of study.

The first thing we will delve into is what type of data we will encounter when dealing with Statistics. Roughly we can divide data into two distinct classes namely Categorical data and Numerical data. Now, this sounds somewhat self-explanatory because they are, but I’ll explain.

Numerical data: This is data that is measurable, such as time, height, weight, amount, and so on. And it is further divided into two which are discrete and continuous data.

Discrete data: This is data that is numerical data that represents countable items. They take on values that can be grouped into a list where the list may be finite or infinite. Continuous data is numerical data which represents measurements that describe their value as intervals on an actual number line, rather than take counting numbers. We may further divide continuous data into intervals and ratios.

Categorical data: This is data that may be divided into groups such as race, sex, age group, and so on. And it is further divided into two categories namely nominal and ordinal data. Nominal data is data used to name variables providing no numerical value. This data type is a subcategory of categorical data and sometimes called “labeled” or “named” data. Ordinal data is a data type with a set order or scale to it. However, this order does not have a standard scale on which the difference in variables in each scale is measured.

Photo Credit: Ask Data science

That being said, for statistical tools that we use it's helpful to divide the field of statistics into two large groups of methods: Descriptive statistics for summarizing data and Inferential statistics for drawing conclusions from a sample of data.

Descriptive Statistics: Descriptive statistics refer to methods for summarizing raw observations into information that we can understand and share. Descriptive statistics describe a sample. That’s straightforward.

You take a group you’re interested in, record data about the group members, and then use summary statistics and graphs to present the group properties. With descriptive statistics, there is no uncertainty because you are describing only the people or items that you measure. You’re not trying to infer properties about a larger population.

The process involves taking a potentially large number of data points in the sample and reducing them down to a few meaningful summary values and graphs. This procedure allows us to gain more insights and visualize the data than pouring through row upon row of raw numbers!

Inferential Statistics: Inferential statistics is a fancy name for methods that aid in quantifying properties of the domain or population from a smaller set of obtained observations called samples.

Inferential statistics take data from a sample and make inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population. This requirement affects our process. At a broad level, we must do:

  1. Define the population we are studying.
  2. Draw a representative sample from that population.
  3. Use analyses that incorporate sampling errors.

We don’t get to pick a convenient group. Instead, random sampling allows us to have confidence that the sample represents the population. This process is a primary method for obtaining samples that mirrors the population on average. Random sampling produces statistics, such as the mean, that do not be too high or too low.

Using a random sample, we can generalize from the sample to the broader population. Unfortunately, gathering a random sample can be a complicated process.

Conclusion

A famous writer and philosopher George Santayana once said

“Those who do not learn history are doomed to repeat it.”

Statistics help you to make a calculated inference from the past and help you to predict what is to come (future) and make sense of data.

Thank you AI+Club UI for the opportunity.

--

--