Statistics………

Unraveling Data Mysteries

Nishitha Kalathil
13 min readSep 14, 2023

--

Statistics is all about understanding and working with data. It helps us make sense of information and draw meaningful conclusions from it.

Imagine you have a bunch of numbers or facts, like how tall people are in your class. Statistics gives you tools and methods to organize, summarize, and analyze this information. It helps you find patterns, make predictions, and decide if the data is reliable.

For example, if you want to know the average height of your classmates, you’d use statistics to add up all the heights and then divide by the number of people. This gives you a number that represents the “typical” height in your class.

Statistics is used in all sorts of fields, from science and economics to sports and medicine. It’s like a toolkit that helps us understand the world around us, using numbers and data as our guide.

What is Data?

Data is information. It can be anything we collect or record to learn more about something. Imagine you’re counting how many apples are in a basket. Each number you write down is a piece of data.

Types of Statistics

Measures of central tendency

Measures of central tendency are statistics that describe the center or average of a dataset. The three main measures of central tendency are:

  1. Mean: The mean is calculated by adding up all the values in a dataset and dividing by the total number of values. It’s sensitive to outliers and extreme values.
  2. Median: The median is the middle value in a dataset when it is sorted in ascending or descending order. It’s not affected by extreme values and is often used when there are outliers.
  3. Mode: The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode at all.

Let’s use a set of exam scores for this example:

Exam Scores: 15, 18, 20, 20, 22, 23, 25, 25, 27, 30

Mean: Add up all the scores and divide by the total number of scores. (15 + 18 + 20 + 20 + 22 + 23 + 25 + 25 + 27 + 30) / 10 = 215 / 10 = 21.5. The mean is 21.5.

Median: Since there are 10 scores, the middle two scores are the 5th and 6th scores (22 and 23). The median is the average of these two middle values. (22 + 23) / 2 = 45 / 2 = 22.5. The median is 22.5.

Mode: The mode is the value that appears most frequently. In this dataset, the value 20 appears twice, which is more than any other value. The mode is 20.

These measures provide different perspectives on the center of a dataset, and each has its own strengths and use cases. The choice of which measure to use depends on the nature of the data and the specific question being addressed.

Measures of dispersion

Measures of dispersion, also known as measures of variability or spread, describe the extent to which data points in a dataset are spread out or dispersed. Some common measures of dispersion include:

Range: The range is the difference between the largest and smallest values in a data set. It provides a basic understanding of the spread of data.

From the above exam score we can calculate the range.

Range = Largest Value — Smallest Value = 30–15 = 15

Variance: variance measures how far a set of numbers deviates from its average. It is the average of the squared differences from the mean.

From onlinemathlearning.com

Mean = (15 + 18 + 20 + 20 + 22 + 23 + 25 + 25 + 27 + 30) / 10 = 215 / 10 = 21.5

Variance = ((15–21.5)² + (18–21.5)² + (20–21.5)² + (20–21.5)² + (22–21.5)² + (23–21.5)² + (25–21.5)² + (25–21.5)² + (27–21.5)² + (30–21.5)² )

Variance = (44.25 + 12.25 + 2.25 + 2.25 + 0.25 + 2.25 + 12.25 + 12.25 + 30.25 + 72.25) / 10 = 190.5 / 10 = 19.05

Standard Deviation: The standard deviation is the square root of the variance. It gives a measure of the average distance between each data point and the mean.

Standard Deviation = √Variance = √19.05 ≈ 4.37

Interquartile Range (IQR): The IQR is the range between the first quartile (Q1) and the third quartile (Q3) in a dataset. It’s a measure of the spread of the middle 50% of the data.

  • Q1 (25th percentile) = 18
  • Q3 (75th percentile) = 25
  • IQR = Q3 — Q1 = 25–18 = 7

Coefficient of Variation (CV): The coefficient of variation is the ratio of the standard deviation to the mean. It’s used to compare the relative variability between different datasets.

  • Coefficient of Variation = (Standard Deviation / Mean) * 100
  • Coefficient of Variation = (4.37 / 21.5) * 100 ≈ 20.37%

Percentiles: Percentiles divide a dataset into 100 equal parts. For example, the 25th percentile (Q1) is the value below which 25% of the data falls.

Distribution shapes

In descriptive statistics, distribution shapes refer to the way data is spread out or clustered around different values. Understanding the shape of a distribution is crucial for making inferences and drawing conclusions about a dataset. Here are some common distribution shapes:

Distribution shapes are essential concepts in descriptive statistics that help us understand how data is distributed or spread out. Three key aspects to consider when examining distribution shapes are modality, skewness, and kurtosis:

Modality:

Modality refers to the number of peaks or modes in a distribution. Modes are the values that appear most frequently in the data. There are three common types of modality:

  • Unimodal: The distribution has one clear peak or mode. It is often symmetrical, like a normal distribution.
  • Bimodal: The distribution has two distinct peaks or modes, indicating the presence of two separate groups or processes within the data.
  • Multimodal: The distribution has more than two modes, suggesting the presence of multiple subpopulations or processes contributing to the data.
From StackExchange

Modality is important because it provides insights into the underlying structure of the data and can help identify distinct patterns or groups within it.

Skewness:

Skewness measures the asymmetry of a distribution. It indicates whether the data is skewed to the left (negatively skewed), skewed to the right (positively skewed), or approximately symmetrical (no skew).

  • Negatively Skewed (Left Skewed): The tail of the distribution extends to the left, and the majority of the data is concentrated on the right side. The mean is typically less than the median.
  • Positively Skewed (Right Skewed): The tail of the distribution extends to the right, and most data points are concentrated on the left side. The mean is typically greater than the median.

Symmetrical distributions, like the normal distribution, have skewness close to zero. Skewness is important because it affects the interpretation of central tendency measures (mean, median) and can impact statistical analyses and modeling choices.

Kurtosis:

Kurtosis measures the degree to which data points cluster in the tails of a distribution compared to a normal distribution. It tells us about the “tailedness” or peakedness of the distribution.

  • Leptokurtic: A distribution with positive kurtosis has fatter tails and a sharper peak than a normal distribution. It indicates a higher probability of extreme values.
  • Mesokurtic: A distribution with kurtosis equal to that of a normal distribution. It has neither unusually fat nor thin tails compared to a normal distribution.
  • Platykurtic: A distribution with negative kurtosis has thinner tails and a flatter peak than a normal distribution. It indicates a lower probability of extreme values.

Kurtosis is important because it provides information about the tails of the distribution and can help identify the presence of outliers or unusual data points.

Summary table

A summary table is a structured way of presenting information in a clear and organized format. It’s commonly used to condense and display essential information from a dataset, making it easier to understand and analyze. Summary tables are used in various fields such as statistics, finance, research, and business.

A typical summary table might include columns for different variables or categories, along with corresponding statistics or metrics. For example, in a finance context, a summary table for a portfolio might include columns like “Asset Name,” “Quantity,” “Price,” and “Value.”

Frequency distribution is a way of organizing data into categories or intervals, along with the corresponding frequencies (counts) of observations that fall into each category. It provides a concise summary of the data’s distribution, making it easier to understand patterns and trends.

Graphical visualization

Graphical visualization in statistics refers to the use of visual elements, such as charts, graphs, and plots, to represent data. This visual representation aids in understanding the underlying patterns, relationships, and distributions within the data. It is a powerful tool for summarizing and communicating complex information in a clear and intuitive manner. Here are some common types of graphical visualizations used in statistics:

Estimation

In inferential statistics, estimation is the process of using sample data to make an educated guess or estimate about a population parameter. A parameter is a numerical characteristic of a population, such as the population mean, variance, proportion, or some other measure.

Here are the key concepts related to estimation in inferential statistics:

Sample and Population:

  • Population: The entire group of interest about which you want to make inferences. For example, if you’re interested in the average height of all adult males in a country, the entire population would be all adult males in that country.
  • Sample: A subset of the population that is actually observed or measured. It’s often not feasible to collect data from an entire population, so a sample is used to make inferences about the population.

Parameter and Statistic:

  • Parameter: A numerical characteristic of a population, such as the population mean (μμ), population standard deviation (σσ), or population proportion (pp).
  • Statistic: A numerical characteristic of a sample, used to estimate the corresponding population parameter. For example, the sample mean (xˉxˉ), sample standard deviation (ss), or sample proportion (p^p^​).

Point Estimation:

Point estimation involves using a single value (a statistic) to estimate a population parameter. For example, using the sample mean to estimate the population mean.

Interval Estimation (Confidence Intervals):

Interval estimation provides a range of values, called a confidence interval, within which we expect the population parameter to lie. It incorporates a level of confidence (e.g., 95%) that the true parameter value falls within the interval.

Margin of Error:

In interval estimation, the margin of error is the amount by which a sample statistic is likely to differ from the true population parameter. It’s determined by factors like sample size and variability.

Sample Size:

The size of the sample used for estimation plays a crucial role. Larger sample sizes tend to result in more accurate estimations.

Sampling Distribution:

The distribution of a statistic (like the sample mean) over many repeated samples taken from the same population. It provides information about the variability of the statistic.

Bias:

In statistics, “bias” refers to the systematic error or tendency of an estimator to consistently deviate from the true population parameter it is trying to estimate. A biased estimator tends to either overestimate or underestimate the true parameter value. This deviation is consistent across multiple samples from the same population. A statistic is considered unbiased if, on average, it equals the true population parameter.

Efficiency:

An estimator is considered efficient if it has a small sampling variability (i.e., a small standard error) relative to other estimators.

Regression analysis

Regression analysis is a statistical method used to examine the relationship between one or more independent variables (predictors or features) and a dependent variable (the outcome or response). Its primary goal is to model and understand the relationship between variables, make predictions, and infer causality in some cases. Regression analysis is widely used in various fields, including economics, social sciences, healthcare, and machine learning.

Correlation analysis

Correlation analysis is a statistical technique used to quantify the strength and direction of a relationship between two or more continuous variables. It helps us understand how changes in one variable are associated with changes in another. Correlation does not imply causation; it simply indicates that a relationship exists.

Covariance and correlation are both measures of the relationship between two variables in statistics. However, they have some key differences in terms of their interpretation and scale:

Covariance:

Covariance measures the degree to which two variables change together. Specifically, it measures the average product of the deviations of each variable from their respective means. A positive covariance indicates a positive relationship: when X is above its mean, Y tends to be above its mean as well, and vice versa. A negative covariance indicates an inverse relationship: when X is above its mean, Y tends to be below its mean, and vice versa.

Correlation:

Correlation measures the strength and direction of a linear relationship between two variables. It standardizes the relationship to a scale of -1 to 1.

Correlation Coefficient:

The correlation coefficient (often denoted as r) is a numerical measure that ranges from -1 to 1. It quantifies the strength and direction of the linear relationship between two variables.

  • r=1 indicates a perfect positive correlation (both variables increase together).
  • r=−1 indicates a perfect negative correlation (one variable increases as the other decreases).
  • r=0indicates no linear correlation between the variables.

Pearson Correlation:

The Pearson correlation coefficient is used when both variables are normally distributed and have a linear relationship. It is sensitive to outliers and assumes a linear association between variables.

Spearman Correlation:

The Spearman correlation coefficient assesses the strength and direction of the monotonic relationship (nonlinear) between variables. It is more robust to outliers and can handle non-linear relationships.

Significance Testing:

Hypothesis tests can be conducted to determine whether the correlation coefficient is significantly different from zero.

Hypothesis testing

Hypothesis testing is a fundamental statistical technique used to make inferences about population parameters based on sample data. It allows researchers to evaluate whether there is enough evidence to support or reject a specific hypothesis about a population parameter.

Here are the key steps and concepts involved in hypothesis testing:

Formulating Hypotheses:

  • Null Hypothesis (H0): This is the default assumption that there is no significant difference or relationship between variables. It represents a statement of no effect or no difference.
  • Alternative Hypothesis (H1​ or Ha​): This is the statement that contradicts the null hypothesis. It represents what the researcher is trying to provide evidence for.

Example:

  • H0: The mean weight of a certain species of birds is 50 grams.
  • H1​: The mean weight of a certain species of birds is not 50 grams.

Choosing a Significance Level (α):

The significance level (α) is the threshold for how much evidence against the null hypothesis is required to reject it. Common choices are 0.05 (5%) or 0.01 (1%). This determines the risk of making a Type I error (rejecting a true null hypothesis).

Collecting and Analyzing Data:

Data is collected through experiments or observations. Descriptive statistics and exploratory data analysis are used to understand the sample data.

Calculating a Test Statistic:

The test statistic is a numerical value that is used to assess the evidence against the null hypothesis. The choice of test statistic depends on the type of hypothesis test being performed (e.g., t-test, chi-squared test, etc.).

Determining the Critical Region (Rejection Region):

The critical region is the range of values of the test statistic that would lead to rejection of the null hypothesis. The critical values are determined based on the chosen significance level and the distribution of the test statistic.

Comparing the Test Statistic to Critical Values:

If the test statistic falls within the critical region, we reject the null hypothesis. If it falls outside the critical region, we fail to reject the null hypothesis.

Interpreting Results:

  • If we reject the null hypothesis, it suggests that there is enough evidence to support the alternative hypothesis.
  • If we fail to reject the null hypothesis, it means we do not have enough evidence to support the alternative hypothesis.

Type I and Type II Errors:

  • Type I Error (αα): Rejecting a true null hypothesis (false positive).
  • Type II Error (ββ): Failing to reject a false null hypothesis (false negative).

Effect Size and Power:

Effect size measures the practical significance of a result, while power is the probability of correctly rejecting a false null hypothesis.

Hypothesis testing is a critical tool in scientific research and decision-making, helping to draw valid conclusions from sample data about larger populations. It provides a structured framework for making objective judgments based on evidence.

There are several common hypothesis tests used in statistics to make inferences about population parameters based on sample data. Here are some of the most widely used ones:

These are some of the most commonly used hypothesis tests in statistics, but there are many others designed for specific situations and types of data. The choice of test depends on the nature of the data, the research question, and the assumptions that can be reasonably met.

--

--