10 Most commonly used Statistical formulas in Data Science with Applications and Examples:

--

Statistics plays a crucial role in data science, providing the foundation for understanding, analyzing, and interpreting data. Data scientists use statistical methods to gain insights, make predictions, and make informed decisions based on data.

In data science, there are several commonly used statistical formulas that help in analyzing and interpreting data. Let’s explore the applications of some of the commonly used statistical formulas in data science with examples:

  1. Mean (μ): The mean is the average value of a set of numbers. It is calculated by summing all the values and dividing by the total number of values in the dataset.

Formula: μ = (x1 + x2 + … + xn) / n

The mean is widely used to measure the central tendency of a dataset. For example, in a survey, the mean can be calculated to determine the average age of the participants.

Example: Suppose we have a dataset of ages: [25, 30, 35, 40, 45]. The mean age can be calculated as follows:

μ = (25 + 30 + 35 + 40 + 45) / 5 = 175 / 5 = 35

2. Standard Deviation (σ): The standard deviation measures the dispersion or spread of a dataset around the mean. It quantifies the average amount by which data points deviate from the mean.

Formula: σ = √((Σ(xi — μ)²) / n)

Example: Let’s consider a dataset of exam scores: [80, 85, 90, 95, 100]. The standard deviation can be calculated as follows:

μ = (80 + 85 + 90 + 95 + 100) / 5 = 450 / 5 = 90

σ = √(( (80–90)² + (85–90)² + (90–90)² + (95–90)² + (100–90)² ) / 5) = √(250 / 5) = √50 ≈ 7.07

3. Correlation (ρ): Correlation measures the strength and direction of the linear relationship between two variables. It is always between -1 and +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation.

Formula: ρ = cov(x, y) / (σx * σy)

Example: Suppose we have two variables: study hours and exam scores for a group of students. By calculating the correlation coefficient, we can determine how closely the study hours and exam scores are related.

Let’s say the study hours and exam scores for five students are as follows:

Study Hours: [3, 4, 6, 5, 7] Exam Scores: [70, 80, 90, 85, 95]

Calculating the correlation coefficient:

ρ = cov(x, y) / (σx * σy)

σx = √(( (3–5)² + (4–5)² + (6–5)² + (5–5)² + (7–5)² ) / 5) = √(4 / 5) = √0.8 ≈ 0.89

σy = √(( (70–85)² + (80–85)² + (90–85)² + (85–85)² + (95–85)² ) / 5) = √(140 / 5) = √28 ≈ 5.29

cov(x, y) = ( (3–5) * (70–85) + (4–5) * (80–85) + (6–5) * (90–85) + (5–5) * (85–85) + (7–5) * (95–85) ) / 5 = -15

ρ = -15 / (0.89 * 5.29) ≈ -0.43

The negative correlation coefficient (-0.43) indicates a moderate inverse relationship between study hours and exam scores.

4. Hypothesis Testing (T-test): T-tests are used to determine if there is a significant difference between the means of two groups. It compares the means while considering the variability within the groups.

Formula: t = (x1 — x2) / √((s¹² / n1) + (s²² / n2))

Example: Consider two groups of students: Group A and Group B. We want to test if there is a significant difference in their average test scores.

Group A Scores: [80, 85, 90, 95, 100] Group B Scores: [75, 78, 85, 88, 92]

We can perform a two-sample independent t-test to compare the means of the two groups and assess if the difference is statistically significant.

5. Logistic Regression: Logistic regression is used for binary classification problems. It models the relationship between the predictor variables and the probability of belonging to a particular class using the logistic function.

Formula: p = 1 / (1 + e^-(β0 + β1 * x1 + β2 * x2 + … + βn * xn))

Logistic regression is widely used for binary classification problems, where the goal is to predict a binary outcome (e.g., yes/no, true/false).

Example: Let’s consider a scenario where we want to predict if a customer will churn or not based on their demographics and behavior data. Logistic regression can be used to model the relationship between the predictor variables (e.g., age, gender, purchase history) and the probability of churn (binary outcome).

By fitting a logistic regression model, we can estimate the coefficients (β0, β1, β2, …) for the predictor variables and calculate the predicted probabilities of churn for new customers.

6. Variance (σ²): The variance is the square of the standard deviation. It provides a measure of the variability of the dataset.

Formula: σ² = ((Σ(xi — μ)²) / n)

Example: Consider a dataset of exam scores: [80, 85, 90, 95, 100]. We have already calculated the mean (μ) and standard deviation (σ) in a previous example. The variance can be calculated as follows:

σ² = (( (80–90)² + (85–90)² + (90–90)² + (95–90)² + (100–90)² ) / 5) = (250 / 5) = 50

7. Covariance (cov): Covariance measures the relationship between two variables. A positive covariance indicates a direct relationship, while a negative covariance indicates an inverse relationship.

Formula: cov(x, y) = Σ((xi — μx)(yi — μy)) / n

Example: Suppose we have two variables: the number of hours studied and the corresponding exam scores for a group of students. The dataset is as follows:

Study Hours: [2, 3, 4, 5, 6] Exam Scores: [70, 75, 80, 85, 90]

The covariance can be calculated as follows:

cov(x, y) = ( (2–4) * (70–80) + (3–4) * (75–80) + (4–4) * (80–80) + (5–4) * (85–80) + (6–4) * (90–80) ) / 5 = -2

8. Z-score: The z-score measures the number of standard deviations a data point is from the mean. It is used to standardize values and compare them across different datasets.

Formula: z = (x — μ) / σ

Example: Suppose we have a dataset of exam scores: [80, 85, 90, 95, 100]. We have already calculated the mean (μ) and standard deviation (σ) in a previous example. Let’s calculate the z-scores for each score:

Z-score = (x — μ) / σ

For the score 85: Z-score = (85–90) / 7.07 ≈ -0.71

9. Confidence Interval: Confidence intervals provide a range of values within which a population parameter is estimated to lie. It quantifies the uncertainty around an estimated statistic.

Formula: CI = [μ — z * (σ / √n), μ + z * (σ / √n)]

Example: Let’s say we have collected a sample of test scores from a larger population. Suppose the sample mean is 85, and the sample standard deviation is 10. We want to calculate a 95% confidence interval for the population mean.

Using the formula for a confidence interval:

CI = [μ — z * (σ / √n), μ + z * (σ / √n)]

Here, n is the sample size, and z is the z-score corresponding to the desired level of confidence (in this case, 95% confidence). Let’s assume n = 100 and z = 1.96 (for a 95% confidence interval).

10.Simple Linear Regression: Simple linear regression is used to model the relationship between two variables by fitting a straight line. It estimates the slope (β1) and intercept (β0) of the line.

Formula: y = β0 + β1 * x

Example: Consider a dataset that relates the number of hours studied to the corresponding exam scores. The dataset is as follows:

Study Hours: [2, 3, 4, 5, 6] Exam Scores: [70, 75, 80, 85, 90]

By fitting a simple linear regression model, we can estimate the slope (β1) and intercept (β0) of the line that best represents the relationship between study hours and exam scores.

The regression equation would be:

Exam Scores = β0 + β1 * Study Hours

The values of β0 and β1 can be estimated using regression techniques, allowing us to make predictions for exam scores based on the number of study hours.

These examples illustrate the applications of the most used statistical formulas in data science. Each formula serves a specific purpose and is employed based on the nature of the data and the analysis objectives.

These are just a few examples showcasing the applications of statistical formulas in data science. Depending on the specific problem and analysis goals, other statistical formulas and techniques may be applied.

--

--