Statistics in Brief for an Aspiring Data Scientists — Part II

Simple Statistics to Kickstart Your Data Science Journey

Kamireddy Mahendra
ILLUMINATION’S MIRROR
5 min readMar 10, 2024

--

“The Most Underestimated thing to excel in any field is to build a strong foundation or fundamentals. So keep strengthening fundamentals”

Before getting into this article I wish you would have already seen part I. If not kindly go through it first to help you understand the context of end-to-end statistics helpful for data science. i.e.

This article is the continuation of part I. In Part I, we learned about what to learn about descriptive statistics in PartI, and in this article, we are going to learn about what to know about inferential statistics in a precise way.

ii. Inferential Statistics

Inferential statistics are statistics used to derive approximations or guestimates about large sets of data and to draw conclusions on the data set based on hypotheses testing and various analytical methods, which I will share in this article.

a. Hypothesis Testing
b.
Regression Analysis
c.
Cluster Analysis
d.
Time Series Analysis
e.
Confidence Interval
f.
Correlation Analysis

a. Hypothesis Testing:

Hypothesis testing is a common statistical technique that we use in data science to support certain findings. This testing aims to give a solution about how probable an apparent effect is detected by chance given from a random data sample.

There are a few main types of hypothesis testing commonly used in data science, those are.

  • T-Tests (one-sample, two-sample):
  • ANOVA (Analysis of Variance):
  • Chi-square tests:

b. Regression Analysis:

  • Linear Regression
  • Logistic Regression

c. Cluster Analysis:

  • Hierarchical Clustering
  • Partitioning Clustering (e.g., K-means)

d. Time Series Analysis:

It is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, data analysts will record data points at regular intervals over a group of periods rather than recording them randomly.

e. Confidence Interval:

The confidence interval shows us the probability that a variable will fall between a couple of values all over the mean. This will help us to measure the degree of conviction in a sampling method.

f. Correlation Analysis:

It mainly focuses on finding whether a relationship exists between variables and then determining the magnitude and action of that relationship. Of these three types, one is Positive another one is Negative the last one is no correlations.

Therefore, Descriptive statistics provide a summary of the main features of a dataset, while inferential statistics involve making predictions or inferences about a population based on that data.

There are a few other important concepts that are correlated with statistics that are mostly used by data scientists as mentioned below.

  • Probability Distributions
  • Machine Learning Metrics

Probability Distributions:

Probability distributions play a crucial role in statistics and data science, providing models for the likelihood of different outcomes. Here are a few major types of probability distributions that are frequently used by data scientists:

a. Normal Distribution or (Gaussian Distribution):

Asymmetric, bell-shaped distribution characterized by its mean and standard deviation. Many natural phenomena, such as height and test scores, follow a normal distribution. Mean and standard deviation are used as parameters in this distribution.

b. Binomial Distribution:

Models the number of successes in a fixed number of independent Bernoulli trials. The number of trials and probability of success in each trial are used as parameters in this distribution.

c. Poisson Distribution:

Models the number of events occurring in a fixed interval of time or space, given a known average rate of occurrence. The average rate of occurrence is the parameter used in this distribution.

d. Exponential Distribution:

Models the time between events in a Poisson process, where events occur continuously and independently at a constant rate. The rate parameter is the one we used in this distribution.

These probability distributions serve as foundational tools in statistical modeling and are extensively applied in various data science tasks, including hypothesis testing, regression analysis, and machine learning. The choice of distribution depends on the characteristics of the data and the assumptions underlying the analysis.

We as data scientists will use a variety of metrics to evaluate the performance of machine learning models across different tasks such as classification, regression, and clustering. The choice of metrics depends on the specific problem and the goals of the analysis and project, and those are as discussed below.

Machine Learning Metrics:

Regression Metrics:

  • Mean Absolute Error (MAE)
    The average absolute difference between predicted and actual values.

MAE= (1/n) * Σ |Actual — Predicted|

  • Mean Squared Error (MSE)
    The average squared difference between predicted and actual values.

MSE=(1/n) * Σ (Actual — Predicted)²

  • Root Mean Squared Error (RMSE)
    The square root of the MSE provides a measure in the same units as the target variable.

RMSE=√MSE

  • R-squared (Coefficient of Determination)
    Measures the proportion of the variance in the dependent variable that is predictable from the independent variables.

R-squared=SSR/SST

where SSR is the sum of squared residuals, and SST is the total sum of squares.

These metrics provide a comprehensive understanding of the model’s performance in various machine-learning tasks. The selection of metrics depends on the specific goals and characteristics of the problem at hand.

Confusion Matrix:

A confusion matrix is used to define the performance of a classification problem in machine learning. This matrix gives us a visualization or summary of the performance of a classification algorithm. A confusion matrix is shown below.

Confusion Matrix, Image designed by Author.
  • Accuracy
    The proportion of correctly classified instances out of the total instances.

Accuracy=(True Positives + True Negatives) / Total

  • Precision
    The ratio of true positive predictions to the total predicted positives, emphasizes the accuracy of positive predictions.

Precision=True Positives / (True Positives + False Positives)

  • Recall (Sensitivity or True Positive Rate):
    The ratio of true positive predictions to the total actual positives highlights the ability to capture all positive instances.

Recall=True Positives / (True Positives + False Negatives)

  • F1 Score:
    The harmonic mean of precision and recall provides a balanced measure of a model’s performance.

F1 Score=2 * (Precision * Recall) / (Precision + Recall)

I hope these Part I and Part II articles help you as a mind map to learn different important concepts to become a data scientist.

Bring your hands together to create a resounding clap, fostering support and encouragement for me to share even more valuable content in the future.

Follow me and subscribe to catch any updates from me instantly.

Thank you :)

Reference: Data analytics with Python.

--

--

Kamireddy Mahendra
ILLUMINATION’S MIRROR

A Freelance Data Engineer & Data Analyst, Content Writer, Youtuber, Freelancer => Projects & Tutor (1-on-1 Training), Kaggle Notebooks Expert.