Introduction to Descriptive Statistics

Addya Joshi
The Business Club, IIT (BHU) Varanasi
3 min readMay 23, 2020

In this series of articles, we aim at covering statistics from scratch to its implications in the real business world. We’ll start by defining what essentially a variable in a given data means. Variables are used to represent data points and can be broadly categorised into numerical and categorical.

Numerical variables are either a set of discrete or continuous numbers. Continuous numerical variables can take in an infinite number of values, whereas discrete can only take in a set of finite values. Categorical variables are of two types — ordinal and nominal. As the name suggests, ordinal categorical variables have some order or pattern in them. For example, our education levels follow a pattern — pre-school, high school, undergraduate, post-graduation, PhD. On the other hand, nominal categorical variables lack any order or levels.

Now how does one analyse the given data? Studying and analysing data in statistics follows two methods — observational and experimental. In an observational study, the collected data does not infer with how the data arises. We merely observe the data and do not infer or conclude anything from it. We establish an association, i.e., a correlation. If the observational study uses data collected in the past it’s called a retrospective study and if it collects data along the study process it’s called a prospective study. In experiments, we randomly choose some variables out of the entire population and after that establish connections.

Let’s make this a bit more clear. Suppose we aim at establishing a relationship between regularly smoking cigarettes and having cancer. In an observational study, we would categorise the population into two parts — one who regularly smoke and one who does not. Then we study the two categories for people who have cancer and compare. Whereas in an experiment, we sample a group of people and then randomly categorise them into two — those who would regularly smoke and those who would not throughout our study. You can see that we impose upon our subjects the decision to smoke or not to smoke. At the end of the study, we compare both the categories for people having cancer.

We saw that in an observational study, even if we found a pattern in people who have cancer, we can’t imply or attribute this solely to regularly smoking cigarettes. There may have been other variables, like medical history, lifestyle, etc. that we did not control and affected the outcome. However, in an experiment, such variables are likely equally represented in both groups due to random assignment. Therefore if we observe a pattern in regularly smoking cigarettes and having cancer, we can make a causal statement.

Variables that affect both the explanatory and response variable, like medical history in our case, and make it seem that there is a relationship between the two are called confounding variables. The important thing to note is that “Correlation does not imply causation”. The correlation or the causation is dependent on the kind of study we are basing our conclusions on.

In the above paragraph, we spoke about sampling the population. But how do we decide who to choose and who to not? There exist three types of sampling schemes, namely, simple random, stratified and cluster sampling. In simple random sampling, each individual is chosen randomly. Each individual stands an equal chance, i.e., an equal probability of being chosen. In stratified sampling, the population is categorised into subpopulations. The individuals are divided into homogeneous subgroups before sampling. The subjects vary across the different strata but are similar in each stratum. The strata are hence mutually exclusive and collectively exhaustive. After that, simple random sampling is applied within each stratum. Lastly, in cluster sampling, the population is divided into groups. These groups are internally heterogeneous but similar to each other, i.e., mutually homogeneous. Now simple random sampling is applied within individual groups.

With this, we have covered the basics terminologies and shall continue from here in the next article.

--

--