Statistics & Probability — Introduction

A gentle introduction to statistics

Omar Elgabry
OmarElgabry's Blog
9 min readFeb 24, 2019

--

This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.

Why do we need statistics?

We need answers, and answers come in the form of numbers. But why?

Perhaps, if you analyze those numbers, you could make more informed decisions. Maybe you want to convince someone of something. Or, by looking at those numbers, you might discover something new about your organization, employees, or customers.

But, …

Some of those numbers are helpful. Some are confusing, and others are probably just distracting us from what’s really important.

The problem is, for sure, trying to figure out which numbers are useful.

So, statistics …

  • Can help us quantify uncertainty.
  • Can help us notice if results are providing us with a true illustration of a situation. Or, if results are presenting us with a biased view.

The outcome?

  • One can make more informed decisions by knowing which numbers are helpful in each situation.
  • Explain and illustrate to others how certain statistics will lead to better outcomes.

The first steps …

Not only are questioning and wondering overall essential skills in becoming a statistician but they are also the first steps in a new study or a dataset.

  • Whats the target audience? The population of interest (everyone, only people in a city).
  • What does the sample (data) represent? It might represent everyone or just some people in an area — While we are interested in the whole population but the sample might represent only some people.
  • Where did those numbers come from?.
  • How they were collected and calculated? It might have some flaws.
  • Are those the right numbers needed to make this decision?.

Observational Studies & Experiments

These are the two types of studies to answer research questions.

source

Observational

Researchers collect data, and merely observe the association (correlation) between the explanatory and the response variables.

The response variable is the focus of a question in a study or experiment. An explanatory variable is one that explains changes in that variable, and we are interested in studying its affect on the response variable.

Experiments

Researchers “randomly” assign subjects to groups and can, therefore, establish causal connections between the explanatory and response variables.

(~) The relationship between those who regularly working out and their energy level.

In an observational study, we would sample two types of people from the population. Those who “already used to” work out and those who don’t.

In experiment, we randomly assign each person to two groups: One which “will” work out, and the other won’t.

In both, we then find the average energy level for the two groups of people and compare.

So, what’s the difference between two studies?

— 1. Observe vs. Impose

Whether a person considered to be work out or not is not based on the person as in the observational study, but is instead “imposed” by the researcher, by randomly assigning all people.

— 2. Correlation vs. Causation

In observational study, even if we find the difference between the average energy levels of these two groups of people, we can’t attribute this difference solely based on whether person work out or not. Why?.

Because there may be other variables that we didn’t consider, that contribute to the observed difference. For example, people who are younger might be more likely to regularly work out and also have higher energy levels. These variables called confounding variables.

However, In the experiment, such variables that might contribute to the outcome are likely equally represented in the two groups due to the random assignment.

Therefore, if we find a difference between the two averages. We can indeed make a causal statement attributing this difference to “working out”.

In general, what determines whether we can infer causation or just correlation is the type of study that we’re basing our conclusions.

Observational studies for the most part, allow us to make only correlational statements. While experiments, allow us to infer causation.

Causation means one event causes the other, while correlation means a change in one event results in change in the other.

The smoking is correlated with alcoholism, but doesn’t cause alcoholism.

Sampling and sources of bias

Whats a sample, and why we need it?. Wouldn’t it be better to just include everyone?. But its not always a good idea:

  • Measuring everything is just way too expensive, and too time consuming.

Political operatives can’t poll every voter. Cell phone companies can’t measure the quality level of every single item the produce.

  • Some individuals may be hard to locate or hard to measure.

Sampling

source

Instead of measuring everything, we just measure a small group or subset of the total population.

That small subset of measurements is a sample. This sample can act as a representative of the entire population.

Think about something you’re cooking. We taste a small part of what we’re cooking, to get an idea about the dish as a whole. We would never eat a whole pot of soup just to check it’s taste after all. Analyzing a sample is called “exploratory analysis”.

When you taste a spoonful of soup and decide that spoonful you’re tasted isn’t salty enough. If you then generalize and conclude that your entire needs salt, that’s making an “inference”.

For your inference to be valid, the spoonful you tasted, your sample, needs to be representative of your entire pot, your population. If your spoonful comes only from the surface, and the salt is collected at the bottom of the pot, what you tasted is probably not going to be representative of the whole pot unless if you first stir the soup thoroughly before you taste.

Sampling Methods

source

Now that we have a good idea of why we might want to sample, and why it’s important for our sample to be representative of the population.

Next, how can we take a sample?.

— 1. Simple random

We randomly select cases from the population, such that each case is equally likely to be selected.

— 2. Stratified

The population is divided into groups (strata), based on some characteristic, such as based on geography or gender. Then, within each group, a random sample is selected.

— 3. Cluster

Every member of the population is assigned to one group, one cluster, randomly. Each cluster is similar to another (unlike strata). Then, a random sample of clusters is chosen. And all individuals within sampled clusters are surveyed.

The difference between cluster and stratified sampling is, with stratified sampling, the sample includes elements from each stratum. With cluster sampling, in contrast, the sample includes all the elements but only from the sampled clusters.

— 4. Multistage

We select a sample by using combinations of different sampling methods.

So, we might use cluster sampling to choose clusters from a population. Then, use simple random sampling to select only a subset of elements from each chosen cluster.

One might divide a city into geographic regions that are on average similar to each other (clusters), and then sample randomly a few of these regions, and then, sample a few people from within these regions.

Sampling Bias

It means that the sample is selected incorrectly and do not represent the true population because of non-randomness, and so some are less or more likely to be include than the others.

Going back to the soup example, if the soup is not well stirred, it doesn’t matter how large a spoon you have, It will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.

This can happen due to the following reasons:

— 1. Convenience

Its occurs when individuals who are easily accessible, are more likely to be included in the sample.

If you only poll people in your neighborhood, and not from the whole city, study would suffer from convenience bias.

— 2. Non-response

It happens if only a non-random of the randomly sampled people respond to a survey, such that the sample is no longer representative of the population.

Like when a certain segments of the population, say those from a lower status, are less likely to respond to the survey.

— 3. Voluntary response

It occurs when the sample consists of only people who volunteer to respond because they have strong opinions on the issue.

If people who responded are only who felt strongly enough to vote, and so definitely do not make up a representative sample.

Experimental Design

There are four principles to carry out an experimental design:

— 1. Control

Compare treatment group to control group.

Treatment means the thing you are studying, and will be applied on the subject, like medication, energy gel, etc.

Treatment group is a group of subjects (people) who will receive the treatment, while the control group won’t. But, everything else (other conditions) should remain the same.

We can have multiple treatment groups receiving different treatments, such as different drugs. So, we compare different treatment groups plus (or without) the control group as well.

— 2. Collect

Collect enough large sample, or replicate (re-do) the entire study.

— 3. Randomize

Randomly assign subjects to treatment and control group.

— 4. Block

What are the variables that might affect the outcome (response variable). These are the confounders variables we discussed earlier.

source

(~) Determine whether the energy gel affects the performance.

  1. Treatment: energy gel (explanatory). Control: no energy gel. Outcome: performance of athletes (response).
  2. Collect data.
  3. Randomize samples across two groups (treatment and control group)
  4. Block variables: energy gel might affect pro vs amateur differently.

So, split the subjects into pro and amateur athletes, and randomly assign athletes to the groups (treatment and control). Therefore, pro and amateur athletes are equally represented in both groups, and hence, the outcome is reliable.

Experimental Design Terminologies

  • placebo: A fake treatment, often used as the control group in medical studies.
  • placebo effect: When experimental unit show improvement simply because they believe they’re receiving a special treatment (placebo).
  • blinding: Experimental units do not know whether they are in the control or the treatment groups.
  • double-blind: Both the experimental units and the researchers do not know who is in the control and who is in the treatment group

Random Sample & Random Assignment

Random sampling occurs when subjects are being selected for a study.

The subjects are selected randomly from the population, and so each subject in the population is equally likely to be selected, and the resulting sample is likely representative of the population. Therefore the study’s results can be “generalized” to the whole population.

Random assignment occurs when the subjects are being assigned to various treatments (including control group).

Through a random assignment, we ensure that these different characteristics are represented equally in the treatment and control groups.

This allows us to attribute any observed difference to the treatment being observed on the subjects. So, random assignment allows us to make “causal” conclusions based on the study.

source

(~) Evaluating whether people read “serif” vs “sans serif fonts”, faster.

  1. Select randomly select subjects for your study from your population.
  2. Then, assign the subjects in your sample to two treatment groups: “serif” vs “sans serif”.
  3. Ensure that other factors that may be contributing to the outcome like fluency are represented equally in the two groups.

So, sampling happens first, and assignment happens second.

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Omar Elgabry
OmarElgabry's Blog

Software Engineer. Going to the moon 🌑. When I die, turn my blog into a story. @https://www.linkedin.com/in/omarelgabry