Sitemap

Refresher on Statistics.

6 min readJul 9, 2024

--

Statistics is learning from data and making inferences about a population based on sample data. Finds applications in economics, psychology, medicine, social sciences and decision-making.

Mean is the equitable distribution of values. Add all the values and divide by the number of values to get the average. But what if you want to understand how much each value varies from the average (thereby from each other)? Variance is a way to measure how “spread out” the values in a data set are. However, it does so in a squared sense because it’s calculated by squaring the differences from the mean, the units of variance are also squared and can be hard to interpret. By taking the square root of the variance, we bring the measure back to the original units of the data, terming this quadratic mean as the Standard Deviation (SD).

Low variance — low SD — consistency between data points
High variance — high SD— large difference between data points

Q. Now, why is the difference from the mean squared while calculating variance?
1. Avoid negative values cancelling out: Squaring deviations make them positive, preventing positive and negative differences from cancelling out each other.
2. Emphasize larger deviations: Squaring emphasizes larger deviations, ensuring they have a greater impact on the variance, effectively capturing the spread of the data.

Mode for ungrouped data is the value that appears most frequently in a data set. The mode for grouped data can be estimated using the following formula:

  • For discontinuous grouped data, adjust class limits to create continuous intervals. Subtracting half of the gap from the upper limit and adding half of the gap to the lower limit.

Frequency Polygon is a line graph that connects the midpoints of the tops of the bars of a histogram.

Purpose:

  • To compare different frequency distributions.
  • To see the overall pattern of data

When extending the lines of a frequency polygon to the x-axis, the points you choose should represent the midpoints of hypothetical classes before the first class and after the last class.

Cumulative Frequency Curve helps in understanding the overall shape and spread of the data distribution. Also shows how cumulative frequencies increase. The cumulative frequency is half of the total frequency, identify it on the graph to mark the median (50% of the data lies below it). By plotting multiple cumulative frequency curves on the same graph, you can compare different data sets.

Pie Diagram: The degree of components ‘θ’ for each slice in a pie chart can be calculated using the following formula:

θ =​​ (Value of the component / Total value) × 360∘

Normal Probability Curve or bell curve or Gaussian distribution describes how data values are distributed around the mean value.

Normal Distribution
  • 68% of the data falls within ±1 standard deviation from the mean.
  • 95% of the data falls within ±2 standard deviations from the mean.
  • 99.7% of the data falls within ±3 standard deviations from the mean.

Asymptotic Behavior: The tails of the normal curve extend infinitely in both directions and approach the x-axis asymptotically.

Applications of the Normal Distribution

  • Robotics: Path and motion prediction of autonomous robotic systems.
  • Self-Driving Cars: In localization of vehicles to ensure safe trajectory following.
  • Quality: Normal distributions are used in quality control processes (Six Sigma) to monitor and improve processes.
  • Finance and Economics: used to model returns on investment and assess financial risks.
  • Stock returns: often modelled using the normal distribution to assess risk and return on investments.
  • Measurement Errors: Normal distributions model measurement errors in experiments and surveys.

Skewedness and Kurtosis

Skewness measures the asymmetry of a probability distribution. It indicates whether the data points tend to cluster more on one side of the mean than the other.

Skewness Measures

Zero Skewness:

  • In a perfectly symmetrical distribution, the left and right sides of the distribution are mirror images.

Positive Skewness:

  • Most data points are concentrated on the left side.
  • Mean > Median > Mode
  • The tail points towards the positive direction on the x-axis.
  • Example: Income distribution, where a few people have very high incomes.

Negative Skewness:

  • Most data points are concentrated on the right side.
  • Mean < Median < Mode
  • The tail points towards the negative direction on the x-axis.
  • Example: Age at retirement, where most people retire around a certain age, but a few retire earlier.

Kurtosis measures the sharpness of the peak. It indicates whether data points are more or less concentrated around the mean.

Types of Kurtosis

Platykurtic: Distributions with light tails and a flatter peak.

  • Fewer data points in the tails and around the mean.
  • Less extreme outliers.

Leptokurtic: Distributions with heavy tails and a sharp peak.

  • More data points in the tails and around the mean.
  • Greater outliers.

Mesokurtic: Distributions with normal tail behaviour.

Practical Applications and Examples

Skewness:

  • Asset returns can show positive or negative skewness, indicating potential asymmetric risk.
  • In manufacturing, skewness in quality measurements can indicate a need for process adjustments.
  • Skewness in environmental data, like pollutant levels, can indicate unusual patterns that need further investigation.

Kurtosis:

  • High kurtosis in asset returns indicates a higher risk of extreme values, important for risk management.
  • High kurtosis in quality measurements suggests more variability and potential for defects.
  • High kurtosis can point to frequent extreme environmental events, affecting policy decisions.

Hypothesis Testing uses sample data to evaluate a hypothesis about a population parameter. It determines whether there is enough evidence to reject a null hypothesis in favour of an alternative hypothesis.

  • Null Hypothesis (H0): A statement that there is no effect or no difference. It is the hypothesis that we seek to test.
  • Alternative Hypothesis (H1): A statement that contradicts the null hypothesis. It represents the effect or difference we suspect or hope to find.
  • Significance Level (α): The probability of rejecting the null hypothesis when it is true. Common choices are 0.05, 0.01, and 0.10.
  • P-value: The probability of obtaining a test statistic at least as extreme as the one observed, given that the null hypothesis is true.
  • Test Statistic: A standardized value calculated from sample data, used to decide whether to reject the null hypothesis.
  • Type I Error: Rejecting the null hypothesis when it is true, expressed in probability (α).
  • Type II Error: Failing to reject the null hypothesis when it is false expressed in probability (β).
  • Power of the Test: The probability of correctly rejecting the null hypothesis when it is false (1 — β).

The one-Tailed Test determines the significance of both the magnitude and direction of the observed difference between the two statistics. The rejection region for the null hypothesis (H0​) is entirely in one tail of the distribution based on the direction specified in the alternative hypothesis (H1​). This means the probability (α) of rejecting H0​ is concentrated in a single tail, and the likelihood (p) of H0​ being correct corresponds to the fractional area in that one tail.

One-Tailed and Two-Tailed Tests

A Two-Tailed Test assesses the significance of an observed difference in statistics without considering direction, placing rejection regions in both tails of the distribution. Here, each tail has a rejection region of α/2, so the total rejection probability α is split across both tails. The combined area of both tails represents the likelihood (p) of H0​ being correct.

--

--

No responses yet