Statistics for Data Science

26 min readFeb 12, 2024

Introduction

In the realm of data science, statistics reign supreme. They serve as the cornerstone of every data scientist’s journey, shaping each analytical endeavour from inception to fruition. Before delving into the intricate world of algorithms and predictive modelling, one must embark on a voyage of exploration known as exploratory data analysis (EDA). This preliminary step involves deciphering the intricacies of the data landscape through the lens of statistical techniques. By peering beneath the surface, data scientists glean insights into the underlying distribution patterns, laying a solid foundation for subsequent analysis and interpretation.

What is Descriptive Statistics?

In the expansive domain of data science, descriptive statistics serves as a fundamental pillar, offering insights and elucidating patterns within datasets. Essentially, descriptive statistics encompasses a set of techniques and methodologies employed to summarize, visualize, and comprehend the characteristics of data. It provides a comprehensive snapshot of key features, including central tendency, variability, distribution, and shape, enabling data scientists to extract actionable insights and make informed decisions.

At its core, descriptive statistics aims to answer fundamental questions about the data, such as:

What is the typical value or central tendency of the dataset?
How much does the data vary or spread out around the central value?
What is the shape or distribution of the data?
Are there any notable patterns or trends within the data?

By leveraging descriptive statistics, data scientists can gain a deeper understanding of their datasets, identify outliers or anomalies, and uncover meaningful relationships between variables. These insights serve as a foundation for further analysis, guiding the development of predictive models, and informing data-driven decision-making processes across various industries and domains.

Descriptive statistics encompasses a variety of tools and measures that help to summarize and describe the characteristics of a dataset. Some examples are:

Measures of Central Tendency:

Mean: The average value of a dataset, calculated by summing all values and dividing by the number of observations.
Median: The middle value of a dataset when arranged in ascending order. It represents the value below and above which 50% of the data lies.
Mode: The most frequently occurring value(s) in a dataset.

Measures of Variability:

Range: The difference between the maximum and minimum values in a dataset, providing an indication of the spread of data.
Standard Deviation: A measure of the dispersion of data points around the mean. It quantifies the average deviation from the mean.
Variance: The average of the squared differences from the mean. It provides a measure of how much the values in a dataset differ from the mean.

Measures of Distribution:

Histogram: A graphical representation of the distribution of numerical data. It divides the data into bins and displays the frequency of observations within each bin.
Box Plot (Box-and-Whisker Plot): A graphical summary of the distribution of a dataset through five key summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
Frequency Table: A tabular representation of the frequency of values or categories in a dataset.

Measures of Shape:

Skewness: A measure of the asymmetry of the distribution of values in a dataset. Positive skewness indicates a right-skewed distribution, while negative skewness indicates a left-skewed distribution.
Kurtosis: A measure of the peakiness or flatness of the distribution of values in a dataset. It quantifies whether the distribution is more or less peaked than a normal distribution.

These examples illustrate the diverse array of descriptive statistics that data analysts and scientists utilize to gain insights into datasets and communicate their findings effectively.

When looking at data the first step of your statistical analysis will be to determine if the dataset you’re dealing with is a population or a sample. The terms “population” and “sample” refer to distinct concepts that play crucial roles in data analysis:

Population:

Population refers to the entire group of individuals, items, or elements that share a common characteristic and are of interest to the researcher. It represents the full set of observations that the researcher aims to study and draw conclusions about. Populations are hard to define and analyze in real life. It is easy to miss values when studying a population which will influence the analysis, as well as an analysis of the whole population is very expensive and time-consuming. For example, if a researcher is interested in studying the average income of all households in a country, the population would consist of every household within that country.

Sample:

A sample is a subset of the population selected for study. It is chosen in such a way that it represents the characteristics of the population from which it is drawn. Samples are often used in research and data analysis because it is often impractical or impossible to study an entire population due to factors such as time, cost, and feasibility. By analysing a representative sample, researchers can make inferences and draw conclusions about the population as a whole. For example, instead of surveying every household in a country to determine average income, researchers might select a random sample of households and use their data to estimate the average income for the entire population.

While the population represents the entire group being studied, the sample is a smaller subset of that population that is selected for analysis. The goal of sampling is to obtain a sample that accurately represents the population, allowing researchers to make valid inferences about the population based on the characteristics of the sample.

There are several common sampling techniques used to obtain a representative sample from a population. Each technique has its own advantages and is suitable for different scenarios. Here are some usual sampling techniques:

Simple Random Sampling:

In simple random sampling, every individual in the population has an equal chance of being selected for the sample.
This technique is straightforward and unbiased, making it ideal when the population is homogeneous and easily accessible.
For example, imagine assigning a unique number to each individual in the population and then using a random number generator to select the sample.

Stratified Random Sampling:

Stratified random sampling involves dividing the population into distinct subgroups, or strata, based on certain characteristics (e.g., age, gender, income level).
A random sample is then drawn from each stratum in proportion to its size within the population.
This technique ensures that each subgroup is represented in the sample, making it useful when the population exhibits significant variability across different strata.
For example, if studying a population of students, the population could be stratified by grade level, and then a random sample of students could be selected from each grade.

Systematic Sampling:

Systematic sampling involves selecting every nth individual from the population after randomly selecting a starting point.
This technique is efficient and easy to implement, making it suitable for large populations where a complete list of individuals is available.
However, it may introduce bias if there is a periodic pattern in the population.
For example, in a population of 1000 individuals, every 10th person could be selected after randomly selecting a starting point between 1 and 10.

Cluster Sampling:

Cluster sampling involves dividing the population into clusters (e.g., geographical areas, schools, households) and then randomly selecting a few clusters to include in the sample.
All individuals within the selected clusters are then included in the sample.
This technique is efficient and cost-effective when it is difficult or impractical to obtain a complete list of individuals in the population.
For example, if studying household income in a city, clusters could be defined as neighbourhoods, and a random sample of neighbourhoods could be selected for inclusion in the study.

These are just a few examples of common sampling techniques used in research and data analysis. The choice of sampling technique depends on various factors, including the characteristics of the population, the research objectives, and practical considerations such as time and resources.

Types of Data

In a dataset, data can be categorized into different types based on their nature and characteristics. These types of data play a crucial role in determining the appropriate statistical analysis methods. Here are the main types of data:

Categorical Data:

Categorical data represent characteristics or attributes and can be divided into distinct categories. It includes data such as:

Nominal Data: Categories without any inherent order or ranking. Each category is distinct, and there is no implied order among them. Example: colours of cars (red, blue, green).
Ordinal Data: Categories with a specific order or rank. The intervals between categories may not be uniform, but there is a clear sequence. Example: educational attainment levels (high school diploma, bachelor’s degree, master’s degree).

Numerical Data:

Numerical data represent quantities or measurements and can be further divided into two subtypes:

Discrete Data: Data that can only take on specific, separate values. These values are typically whole numbers and cannot be broken down into smaller parts. Example: the number of cars in a parking lot.
Continuous Data: Data that can take any real or fractional value within a range. These values are infinitely divisible and can take on any value within a given interval. Example: temperature, height, weight.

Understanding the type of data in a dataset is essential for choosing appropriate statistical techniques for analysis. Categorical data may require methods such as chi-square tests or logistic regression, while numerical data may be analysed using techniques such as correlation analysis or linear regression. Additionally, data type considerations influence data visualization techniques and data preprocessing steps.

Levels of Measurement

Levels of measurement, also known as scales of measurement, refer to the different ways in which variables can be classified and measured in statistical analysis. These levels provide insights into the properties of the data and determine which statistical techniques are appropriate for analysis. There are four main levels of measurement, two for Qualitative and two for Quantitative:

Qualitative Data is information that characterizes attributes in data but does not measure them. It can be divided into two types: Nominal or Ordinal.

Nominal Level:

At the nominal level of measurement, data are categorized into distinct, unordered categories or groups.
Nominal data represent qualitative attributes or characteristics without any inherent order or ranking.

Examples include:

Categories of colours (red, blue, green).
Types of cars (sedan, SUV, truck).
Gender (male, female, other).

Nominal data can be represented using labels or codes, but mathematical operations such as addition or subtraction are not meaningful.

Ordinal Level:

At the ordinal level of measurement, data are ranked or ordered according to some criterion, but the intervals between categories may not be uniform.
Ordinal data represent qualitative or categorical variables with a specific order or rank.

Examples include:

Educational attainment levels (high school diploma, bachelor’s degree, master’s degree).
Survey responses (strongly agree, agree, neutral, disagree, strongly disagree).
Socioeconomic status (low, middle, high).

While ordinal data convey information about relative rankings, the differences between categories are not necessarily equal or meaningful.

Quantitative Data measures attributes in the data. It can be divided into two groups : Interval and Ratio

Interval Level:

At the interval level of measurement, data are measured on a scale where the intervals between values are equal and meaningful, but there is no true zero point.
Interval data represent quantitative variables where arithmetic operations such as addition and subtraction are meaningful, but multiplication and division are not.

Examples include:

Temperature measured in Celsius or Fahrenheit.
Calendar dates (e.g., years, months, days).
Interval data allow for comparisons of the magnitude of differences between values, but ratios between values are not meaningful due to the absence of a true zero point.

Ratio Level:

At the ratio level of measurement, data are measured on a scale where both the intervals between values and the presence of a true zero point are meaningful.
Ratio data represent quantitative variables where arithmetic operations such as addition, subtraction, multiplication, and division are all meaningful.

Examples include:

Height, weight, and length measurements.
Counts of objects or events.
Time durations.

Ratio data allow for meaningful comparisons of both differences and ratios between values, making them the most versatile and informative level of measurement.

Understanding the level of measurement of variables in a dataset is crucial for selecting appropriate statistical analyses and interpreting the results accurately. Different levels of measurement require different statistical techniques, and using inappropriate methods can lead to misleading conclusions. Therefore, it is essential to consider the nature of the data and the objectives of the analysis when determining the appropriate level of measurement.

Measure of Central Tendency

Measures of central tendency are statistical measures that provide insights into the central or typical value of a dataset. They summarize the distribution of data by identifying the central point around which the data tend to cluster. The three main measures of central tendency are the mean, median, and mode.

Mean:

The mean, also known as the arithmetic average, is the sum of all values in a dataset divided by the total number of values.
It is calculated by adding up all the values and then dividing by the number of values.
Formula: Mean = (Sum of all values) / (Number of values)
The mean is sensitive to extreme values, also known as outliers, and can be heavily influenced by them.
It is commonly used when the distribution of data is approximately symmetrical and follows a normal distribution.

Example: Consider the following dataset representing the scores of students in a class test: 75, 82, 88, 92, 92, 92, 85, 88, 75, 92.

Sum of all values = 75 + 82 + 88 + 92 + 92 + 92 + 85 + 88 + 75 + 92 = 861
Number of values = 10
Mean = 861 / 10 = 86.1

Median:

The median is the middle value in a sorted list of data. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.
It is less sensitive to outliers compared to the mean, making it a robust measure of central tendency.
The median is suitable for datasets with skewed distributions or when extreme values are present.

Formula:

For odd number of values: Median = Middle value
For even number of values: Median = (Middle value + Next value) / 2

Example:

Using the same dataset as above, after sorting the values: 75, 75, 82, 85, 88, 88, 92, 92, 92, 92.
As the number of values is even (10), the median is the average of the two middle values, which are 88 and 88.
Median = (88 + 88) / 2 = 88

Mode:

The mode is the value that occurs most frequently in a dataset.
Unlike the mean and median, which are calculated based on numerical values, the mode can be applied to both numerical and categorical data.
A dataset may have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal).
The mode is useful for identifying the most common or typical value in a dataset.

Formula:

Mode = Value with the highest frequency

Example: Using the same dataset, we observe that the value 92 occurs most frequently (four times).

Mode = 92

Each measure of central tendency has its strengths and weaknesses, and the choice of which to use depends on the characteristics of the dataset and the research question being addressed. While the mean provides a measure of the average value, the median offers a robust measure less influenced by extreme values, and the mode identifies the most frequently occurring value. In practice, a combination of these measures is often used to gain a comprehensive understanding of the central tendency of the data.

The measures should be used together instead of independently. There is no best and using only one is not advisable. Moreover, in a normal distribution, these measures all fall at the same midline point. This means that the mean, mode and median are all equal.

Measures of Variability

Measures of variability, also known as measures of dispersion, quantify the spread or dispersion of data points around the central tendency. They provide insights into the degree of variability or dispersion within a dataset. The most common measures of variability are the range, interquartile range (IQR), variance, and standard deviation.

Range:

The range is the simplest measure of variability and represents the difference between the largest and smallest values in a dataset.

Formula: Range = Maximum value — Minimum value

Example:

Consider the following dataset representing the ages of students in a class: 12, 14, 15, 16, 18, 20.
Maximum value = 20
Minimum value = 12
Range = 20–12 = 8

Interquartile Range (IQR):

The interquartile range is a measure of variability that describes the spread of the middle 50% of the data. It is calculated as the difference between the upper quartile (Q3) and the lower quartile (Q1).

Formula: IQR = Q3 — Q1

Example:

Consider the same dataset of student ages. After sorting the values: 12, 14, 15, 16, 18, 20.
Q1 (25th percentile) = 14
Q3 (75th percentile) = 18
IQR = 18–14 = 4

Variance:

Variance measures the average squared deviation of each data point from the mean.

Formula: Variance = Σ((x — μ)²) / n, where Σ denotes the sum, x represents each data point, μ is the mean, and n is the number of data points.

Example:

Consider the following dataset representing exam scores: 75, 82, 88, 92, 92, 92, 85, 88, 75, 92.
Mean (μ) = 86.1 (calculated earlier)
Using the formula: Variance = ((75–86.1)² + (82–86.1)² + … + (92–86.1)²) / 10 = (130.41 + 20.25 + … + 38.44) / 10 = 51.29

Standard Deviation:

The standard deviation is a measure of variability that indicates the average distance of data points from the mean. It is the square root of the variance.

Formula: Standard Deviation = √(Variance)

Example: Using the variance calculated above, Standard Deviation = √(51.29) ≈ 7.16

Standard deviation is best used when data presents a unimodal shape. In a normal distribution, approximately 34% of data points fall one standard deviation away from the mean. Since a normal distribution is symmetrical, we have 68.2% of data points one standard deviation away from the mean. Around 95% of points fall between two standard deviations from the mean whereas 99.7% fall under three standard deviations.

With the Z-Score, you can check how many standard deviations below (or above) the mean, a specific data point is.

In summary, measures of variability provide information about the dispersion or spread of data points in a dataset. They complement measures of central tendency by giving insights into the consistency or variability of the data.

Measure of Asymmetry

The measure of asymmetry, also known as skewness, is a statistical measure that quantifies the degree of asymmetry in the distribution of data points around the mean. Skewness indicates whether the distribution is symmetric or skewed to one side.

There are three main types of skewness:

Positive Skewness (Right Skew):

In a positively skewed distribution, the tail of the distribution extends to the right, indicating that the majority of the data points are concentrated on the left side of the distribution.
The mean is typically greater than the median in a positively skewed distribution, as the presence of outliers pulls the mean towards the higher end of the distribution.
Example: Consider a dataset representing the income levels of individuals in a community. A few individuals have significantly higher incomes compared to the majority, resulting in a right-skewed distribution.

Negative Skewness (Left Skew):

In a negatively skewed distribution, the tail of the distribution extends to the left, indicating that the majority of the data points are concentrated on the right side of the distribution.
The mean is typically less than the median in a negatively skewed distribution, as the presence of outliers pulls the mean towards the lower end of the distribution.
Example: Consider a dataset representing the response times of a system to user requests. Most response times are relatively short, but a few exceptionally long response times skew the distribution to the left.

Symmetrical Distribution (Zero Skew):

In a symmetrical distribution, the data is evenly distributed around the mean, with no skewness in either direction.
The mean, median, and mode are all equal in a symmetrical distribution.
Example: A dataset representing the heights of adult males in a population may exhibit a symmetrical distribution if the data is evenly distributed around the mean height.

Skewness can be quantitatively measured using statistical formulas, such as Pearson’s moment coefficient of skewness or the sample skewness formula. These measures provide numerical values indicating the direction and degree of skewness in the distribution.

Skewness is a crucial measure in statistical analysis as it helps to understand the shape and symmetry of the distribution of data points. It provides valuable insights into the underlying characteristics of the dataset and informs further analysis and interpretation.

Modality:

Modality refers to the number of peaks or modes in a distribution. It describes the shape of the distribution by indicating whether it has one peak (unimodal), two peaks (bimodal), or more than two peaks (multimodal).

Unimodal distributions have a single peak, indicating that the majority of data points cluster around a central value.
Bimodal distributions have two distinct peaks, suggesting the presence of two different groups or clusters within the dataset.
Multimodal distributions have three or more peaks, indicating even more complexity in the underlying data.

Example: A unimodal distribution could represent the heights of adult males in a population, with most individuals clustered around the average height. A bimodal distribution could represent the scores on a test with two distinct groups of high and low achievers.

Covariance:

Covariance measures the extent to which two variables change together. It indicates the direction of the linear relationship between two variables and whether they tend to increase or decrease together.

A positive covariance indicates that as one variable increases, the other variable also tends to increase. Conversely, a negative covariance indicates that as one variable increases, the other variable tends to decrease.
However, covariance alone does not provide a standardized measure of the strength of the relationship between variables, as it is influenced by the units of measurement of the variables.
Formula: Covariance (X, Y) = Σ((xᵢ — μₓ)(yᵢ — μᵧ)) / n, where Σ denotes the sum, xᵢ and yᵢ represent individual data points, μₓ and μᵧ are the means of variables X and Y, and n is the number of data points.

Example: Consider two variables, X and Y, representing hours of study and exam scores, respectively. A positive covariance between these variables would suggest that students who study more tend to achieve higher exam scores, while a negative covariance would indicate the opposite.

Probability Distribution

A probability distribution is a mathematical function or model that describes the likelihood of various outcomes or events occurring in a given scenario. It provides a framework for understanding the probabilities associated with different possible outcomes of a random experiment or process.

In simpler terms, a probability distribution tells us how likely it is for different values to occur in a dataset or a random event. It assigns probabilities to each possible outcome, allowing us to understand the relative likelihood of each outcome happening.

Probability distributions can be classified into two main types: discrete and continuous.

Discrete Probability Distribution:

A discrete probability distribution is applicable when the variable of interest can only take on distinct, separate values.
Examples of discrete probability distributions include the binomial distribution, Poisson distribution, and geometric distribution.
The probabilities associated with each possible outcome are represented by a probability mass function (PMF), which gives the probability of each value occurring.

Continuous Probability Distribution:

A continuous probability distribution is used when the variable of interest can take on any value within a certain range.
Examples of continuous probability distributions include the normal distribution, uniform distribution, and exponential distribution.
The probabilities associated with continuous distributions are represented by a probability density function (PDF), which gives the probability density (likelihood per unit of value) at each point in the distribution.

Probability distributions play a crucial role in various fields, including statistics, mathematics, physics, finance, and engineering. They are used for modelling and analysing random phenomena, making predictions, and making decisions under uncertainty.

Understanding probability distributions allows us to:

Calculate probabilities of specific outcomes or events.
Calculate expected values, variances, and other summary statistics.
Make predictions based on probability models.
Assess risks and uncertainties in decision-making processes.

Probability distributions provide a formal framework for understanding and quantifying uncertainty, randomness, and variability in data and random processes. They are essential tools for statistical analysis and decision-making in a wide range of disciplines.

Normal Distribution

Normal distribution, also known as Gaussian distribution, is one of the most fundamental concepts in statistics and probability theory. It describes the probability distribution of a continuous random variable that is symmetrically distributed around its mean, forming a bell-shaped curve.

Key characteristics of the normal distribution include:

Symmetry: The normal distribution is symmetric around its mean, with the mean, median, and mode all being equal. This means that the curve is balanced, with the same amount of data on both sides of the mean.
Bell-shaped curve: The graph of a normal distribution forms a bell-shaped curve, with the highest point at the mean and the curve tapering off symmetrically on either side. The shape of the curve is determined by the mean and standard deviation of the distribution.
Mean and standard deviation: The mean (μ) determines the central tendency of the distribution, while the standard deviation (σ) measures the spread or variability of the data around the mean. The standard deviation controls the width of the bell curve: larger standard deviations result in wider curves, while smaller standard deviations result in narrower curves.

Empirical Rule: The normal distribution follows the empirical rule, also known as the 68–95–99.7 rule, which states that:

Approximately 68% of the data falls within one standard deviation of the mean.
Approximately 95% of the data falls within two standard deviations of the mean.
Approximately 99.7% of the data falls within three standard deviations of the mean.

Applications of the normal distribution include:

Natural phenomena: Many naturally occurring phenomena, such as heights, weights, IQ scores, and measurement errors, follow a normal distribution.
Statistical inference: The normal distribution is widely used in statistical inference, hypothesis testing, and confidence interval estimation.
Process control: Normal distribution is used in quality control and process monitoring to assess whether a process is operating within acceptable limits.
Financial markets: Asset prices, returns, and other financial metrics often exhibit a normal distribution, which is used in risk management and portfolio analysis.

The normal distribution is a fundamental concept in statistics, representing the distribution of data around a central value. Its properties and applications make it a powerful tool in various fields of study, providing insights into the behavior of random variables and facilitating statistical analysis and inference.

Univariate Analysis

Univariate analysis is a statistical method used to analyse and describe the distribution and characteristics of a single variable in a dataset. It focuses on understanding the properties, patterns, and trends within one variable without considering the relationships with other variables. Univariate analysis provides valuable insights into the distribution, central tendency, variability, and shape of the data, aiding in data exploration and decision-making processes.

Key components of univariate analysis include:

Central tendency measures:

Mean: The average value of the variable, calculated by summing all values and dividing by the total number of observations.
Median: The middle value of the variable when the data is sorted in ascending or descending order. It divides the dataset into two equal parts.
Mode: The most frequently occurring value in the dataset.

Measures of variability:

Range: The difference between the maximum and minimum values of the variable.
Variance: The average of the squared differences between each data point and the mean, measuring the spread of the data.
Standard deviation: The square root of the variance, providing a measure of the dispersion of the data around the mean.

Shape of the distribution:

Histogram: A graphical representation of the frequency distribution of the variable, showing the distribution’s shape and characteristics.
Box plot: A visual summary of the variable’s distribution, including the median, quartiles, and outliers.
Probability density function (PDF): A mathematical function representing the probability distribution of a continuous variable.

Examples of univariate analysis:

Examining the distribution of test scores in a classroom:
Calculate the mean, median, and mode of the test scores to understand the central tendency.
Construct a histogram to visualize the frequency distribution of scores and identify any patterns or clusters.
Calculate the standard deviation to measure the variability or spread of scores around the mean.
Analysing the distribution of monthly sales data for a retail chain:
Compute the mean, median, and mode of monthly sales to assess the typical performance.
Create a box plot to visualize the distribution of sales, including any outliers or extreme values.
Calculate the range to understand the variability in monthly sales figures.

Univariate analysis is a fundamental statistical technique for exploring and describing the characteristics of a single variable in a dataset. It provides valuable insights into the central tendency, variability, and distribution of data, facilitating data-driven decision-making and further analysis.

Bivariate Analysis

Bivariate analysis is a statistical method used to analyse the relationship between two variables in a dataset. Unlike univariate analysis, which focuses on understanding the characteristics of a single variable, bivariate analysis examines how two variables are related to each other. It helps in understanding the correlation, association, or dependency between the two variables and provides insights into their interactions.

Key components of bivariate analysis include:

Correlation coefficients:

Correlation coefficient: A statistical measure that quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where:

A correlation coefficient of 1 indicates a perfect positive linear relationship.
A correlation coefficient of -1 indicates a perfect negative linear relationship.
A correlation coefficient of 0 suggests no linear relationship between the variables.

Scatter plots:

Scatter plot: A graphical representation of the relationship between two variables, where each data point represents a pair of values for the two variables. It helps visualize the pattern, direction, and strength of the relationship between the variables.

Regression analysis:

Regression analysis: A statistical technique used to model the relationship between a dependent variable and one or more independent variables. It helps in understanding how changes in one variable affect the other(s) and allows for making predictions based on the relationship.

Examples of bivariate analysis:

Examining the relationship between study hours and exam scores:

Collect data on the number of study hours and corresponding exam scores for a group of students.
Calculate the correlation coefficient between study hours and exam scores to determine the strength and direction of the relationship.
Create a scatter plot to visualize the relationship between study hours and exam scores, with study hours on the x-axis and exam scores on the y-axis.
Conduct regression analysis to model the relationship between study hours and exam scores and make predictions about exam scores based on study hours.

Analysing the association between advertising spending and sales revenue:

Gather data on advertising spending and sales revenue for a company over a period of time.
Compute the correlation coefficient between advertising spending and sales revenue to assess the strength and direction of the association.
Plot a scatter plot to visualize the relationship between advertising spending and sales revenue.
Perform regression analysis to model the impact of advertising spending on sales revenue and predict future sales revenue based on advertising budgets.

Bivariate analysis is a powerful statistical technique for exploring the relationship between two variables in a dataset. It helps in identifying patterns, trends, and dependencies between variables, enabling data-driven decision-making and further analysis.

Multivariate Analysis

Multivariate analysis is a statistical technique used to analyse and understand the relationships between multiple variables simultaneously in a dataset. Unlike univariate and bivariate analysis, which focus on a single variable or the relationship between two variables, respectively, multivariate analysis considers the interactions among multiple variables. It aims to uncover patterns, trends, and associations that may not be apparent when analysing variables individually or in pairs.

Key components of multivariate analysis include:

Factor analysis:

Factor analysis is a statistical method used to identify underlying factors or latent variables that explain the correlations among observed variables. It helps in reducing the dimensionality of the dataset by identifying common patterns or themes among variables.

Cluster analysis:

Cluster analysis, also known as clustering, is a technique used to group similar observations or objects into clusters based on their characteristics or attributes. It helps in identifying natural groupings or segments within the dataset and can be used for segmentation, classification, or pattern recognition.

Discriminant analysis:

Discriminant analysis is a statistical technique used to determine which variables discriminate or differentiate between two or more groups or categories. It helps in identifying the key features or variables that contribute most to the separation of groups and can be used for classification or prediction purposes.

Principal component analysis (PCA):

Principal component analysis is a dimensionality reduction technique used to transform a high-dimensional dataset into a lower-dimensional space while preserving most of the variability in the data. It helps in identifying the most important features or components that explain the variation in the dataset.

Variance analysis:

Variance analysis is a method used to understand the differences or variations in the dataset across multiple variables or groups. It helps in identifying sources of variation and can be used to compare means, distributions, or patterns among different groups.

Examples of multivariate analysis:

Market segmentation:

Analysing customer data from a marketing survey to identify distinct segments or groups of customers based on demographics, purchasing behaviour, and preferences. Cluster analysis can be used to group similar customers into segments, while discriminant analysis can identify the key features that differentiate between segments.

Predictive modelling:

Building a predictive model to forecast sales revenue based on multiple factors such as advertising spending, market conditions, and competitor activities. Regression analysis or machine learning algorithms can be used to analyse the relationships between predictor variables and the target variable (sales revenue) in a multivariate context.

Exploratory data analysis:

Conducting an exploratory analysis of a dataset containing multiple variables, such as demographic information, product attributes, and customer satisfaction scores. Principal component analysis can be used to reduce the dimensionality of the dataset and visualize the relationships between variables in a lower-dimensional space.

Multivariate analysis is a powerful statistical technique for analysing complex datasets with multiple variables. It helps in uncovering patterns, trends, and relationships that may not be evident when examining variables individually, enabling more comprehensive insights and informed decision-making.

Inferential Analysis

Inferential analysis is a statistical method used to make inferences, predictions, or generalizations about a population based on sample data. Unlike descriptive statistics, which focus on summarizing and describing characteristics of a dataset, inferential statistics involve drawing conclusions about the broader population from a smaller sample. It allows researchers to make judgments, predictions, or hypotheses about the population parameters using sample data.

Key components of inferential analysis include:

Hypothesis testing:

Hypothesis testing is a statistical method used to evaluate hypotheses about population parameters based on sample data. It involves setting up a null hypothesis (H0) and an alternative hypothesis (Ha), collecting sample data, calculating test statistics, and determining whether the evidence supports rejecting or failing to reject the null hypothesis.

Confidence intervals:

Confidence intervals are a range of values within which a population parameter is likely to fall, along with a specified level of confidence. They provide a measure of uncertainty around the estimated population parameter and help in assessing the precision of sample estimates.

Analysis of variance (ANOVA):

Analysis of variance is a statistical technique used to compare means across multiple groups or populations to determine if there are significant differences between them. It helps in assessing the impact of categorical variables on a continuous outcome variable.

Chi-square tests:

Chi-square tests are statistical tests used to assess the association or independence between categorical variables. They compare the observed frequencies of categorical variables with the expected frequencies under the null hypothesis to determine if there is a significant relationship.

Examples of inferential analysis:

Testing a new drug’s effectiveness:

Researchers want to test the effectiveness of a new drug in reducing blood pressure. They conduct a randomized controlled trial (RCT) where participants are randomly assigned to receive either the new drug or a placebo. After administering the treatment, researchers collect data on participants’ blood pressure levels. By comparing the mean blood pressure levels between the treatment and control groups and conducting hypothesis testing, researchers can infer whether the new drug has a significant effect on blood pressure reduction in the broader population.

Assessing the impact of an educational intervention:

A school district implements a new teaching method to improve students’ math scores. Researchers conduct a study where some schools implement the new teaching method (treatment group) while others continue with the existing method (control group). After a semester, researchers collect math test scores from students in both groups. By comparing the mean test scores between the treatment and control groups and conducting hypothesis testing, researchers can infer whether the new teaching method has a significant impact on students’ math performance district-wide.

Analysing customer satisfaction data:

A company collects data on customer satisfaction scores before and after implementing a new customer service training program. By comparing the mean satisfaction scores before and after the intervention and conducting hypothesis testing, the company can infer whether the training program has led to a significant improvement in overall customer satisfaction levels.

Inferential analysis plays a crucial role in drawing conclusions, making predictions, and generalizing findings from sample data to the broader population. It helps researchers make informed decisions, assess the effectiveness of interventions, and understand relationships between variables in a broader context.

This article is first published in

https://www.xennialtechguy.id/posts/statistics-for-data-science/

Statistics for Data Science

Introduction

What is Descriptive Statistics?

Types of Data

Levels of Measurement

Measure of Central Tendency

Measures of Variability

Measure of Asymmetry

Probability Distribution

Normal Distribution

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Inferential Analysis

Written by Wibowo Tangara