From Raw Data to Informed Decisions: Statistics in Data Science and How to Master It
The importance of statistics in the quick-paced field of data science cannot be underscored. Data scientists can uncover hidden patterns, draw important conclusions, and extract useful insights from large datasets thanks to statistics. Understanding the relevance of statistics is crucial to successfully navigate the field, regardless of whether you are an experienced data professional or a curious enthusiast eager to delve into the world of data science.
When you work with data, you’ll uncover interesting findings and patterns. However, when you need to explain your project to a client or someone without technical knowledge, simply sharing your observations might not be enough to convince them of your results. That’s where statistics come into play, and they are essential in data science. Statistics provides the tools to analyze the data more deeply, draw meaningful conclusions, and present solid evidence to support your findings. By using statistics, you can back up your insights with proper data inference, making it easier for others to trust and believe in the results you’ve obtained from the data. Ultimately, statistics acts as a crucial pillar in data science, giving you the confidence and credibility to communicate the significance of your work to a broader audience.
This blog will take you on an insightful journey through the significance of statistics in data science. We’ll look at its fundamental concepts and approaches, as well as how statistical tools enable data-driven discoveries. We’ll cover a variety of topics, from fundamental concepts to advanced approaches, each of which will reveal new levels of understanding in the field of data science.
Firstly, What is Statistics?
Statistics is the science of collecting, organizing, analyzing, and interpreting data for better decision-making. It is a methodical approach to dealing with information to discover relationships, patterns, and trends in the data.
Statistics can be divided into two types, Descriptive and Inferential Statistics.
Descriptive Statistics
This form of statistics entails summarizing and presenting data in an understandable manner. It helps us in comprehending the fundamental characteristics of a dataset. Averages (mean, median, mode), measurements of dispersion (standard deviation, range), and visual representations such as histograms or pie charts. It provides a snapshot of the data and allows us to obtain a fast glimpse of how the data appears.
Inferential Statistics
This type of statistic goes beyond just summarizing the data. It makes conclusions or predictions about a wider population based on sampling data. Inferential statistics allows us to make conclusions based on a representative sample rather than examining the full population, which may be difficult or impossible. Inferential statistics employs techniques such as hypothesis testing, confidence intervals, and regression analysis to generate these predictions and derive significant inferences about the population.
More onto Descriptive Statistics
Population
The term “population” refers to the complete group or set of persons, objects, or events of interest to a certain study. It encompasses all of the components with which you wish to draw inferences or make predictions because they have similar properties. The population is the whole collection that you would ideally like to evaluate, although it might be too huge or impracticable in many circumstances to investigate every single member.
For example, if you are researching the heights of all adult men in a nation, the population would be all adult males in that country.
Sample
A “sample” is a subset of a population chosen to represent a broader group. You can draw conclusions or make conclusions about the entire population by evaluating the sample. The vital task in data science is to make sure that the sample is representative, and that it appropriately reflects the features and variety of the entire population, in order to ensure that the results are true and applicable to the larger context.
For example, measuring the height of all males in the nation may be difficult or time-consuming, you may prefer to gather data from a smaller group, that smaller group is known as a sample.
Sampling Techniques
- Simple Random Sampling: Sampling where every member of the population (N) has an equal chance of being selected for the sample (n).
- Stratified Sampling: Sampling where the population is split into non-overlapping groups, also known as strata (layering).
- Systematic Sampling: Sampling where from N population, we pick every nth observation from the population for the sample.
- Convenience Sampling: This Sampling is a non-probability sampling strategy in which the researcher chooses a sample based on what is most accessible or easily available.
Variables
A variable is a property that can take on any value.
There are mainly two types of variables:
- Quantitative Variable: It is a Numeric Data Type.
- Qualiatative / Categorial Variable: It is Non-numeric Data Type.
Quantitative Variable is further classified into two types:
- Discrete Variable: Variable that has distinct, countable value.
- Continuous Variable: Variable with any value within a range, and the number of possible values within that range is infinite.
Variable Measurement Scales
- Nominal: Nominal data is like putting things into different boxes without any specific order or value attached. For example, colors of cars (red, blue, green) — you can’t say one color is “bigger” or “higher” than the other.
- Ordinal: Ordinal data has a natural order, but the exact differences between the values are not meaningful. It’s like ranking things from “lowest” to “highest” but without knowing how much higher one is from another. An example would be rating satisfaction levels with options like “Very Dissatisfied,” “Neutral,” and “Very Satisfied.”
- Interval: Interval data has a consistent scale with meaningful differences between values, but there is no true zero point. For example, Scores on tests like the SAT or GRE are interval data. The difference between a score of 1500 and 1600 is the same as the difference between 1600 and 1700.
- Ratio: Ratio data has a meaningful zero point, and the ratios between values are meaningful. For example, Measuring speed in meters per second, kilometers per hour, or miles per hour. If a car is traveling at 80 km/h and another car is traveling at 40 km/h, the first car is moving twice as fast as the second car.
Frequency Distribution in Statistics
Frequency distribution is a method of summarizing and displaying data to discover patterns and variations. It involves counting the number of times each value or category appears in a dataset and displaying the results in a table or graph.
For example, let’s say we have a dataset of test scores of students:
85, 78, 90, 92, 78, 85, 90, 88, 85, 92
To create a frequency distribution, we list each unique score and count how many times it appears:
Score: 78, Frequency: 2
Score: 85, Frequency: 3
Score: 88, Frequency: 1
Score: 90, Frequency: 2
Score: 92, Frequency: 2
This frequency distribution tells us that there are three students with a score of 85, two students with scores of 90 and 92, and one student each with scores of 78 and 88. Frequency distributions are useful for gaining insights into the central tendency, variability, and shape of a dataset.
Histogram
A histogram is a graph that depicts the distribution of the values of a numeric variable as a series of bars. Each bar typically covers a numeric value range known as a bin or class and the height of a bar indicates the frequency of data points with a value within the associated bin.
Bargraph
A bar chart, also known as a bar graph, is a type of chart or graph that displays categorical data using rectangular bars with heights or lengths proportional to the frequency of values they represent. The bars can be plotted horizontally or vertically. A vertical bar chart is also known as a column chart.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load the Iris dataset from scikit-learn
iris = load_iris()
# Create a pandas DataFrame with the data
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target_names[iris.target]
# Extract the petal length data for each species
setosa_petal_length = data[data['species'] == 'setosa']['petal length (cm)']
versicolor_petal_length = data[data['species'] == 'versicolor']['petal length (cm)']
virginica_petal_length = data[data['species'] == 'virginica']['petal length (cm)']
plt.hist(setosa_petal_length, alpha=0.5, label='Setosa', bins=10)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.title('Histogram of Petal Length for Iris Dataset')
plt.legend()
plt.show()
average_petal_length = data.groupby('species')['petal length (cm)'].mean()
species = average_petal_length.index
plt.bar(species, average_petal_length)
plt.xlabel('Species')
plt.ylabel('Average Petal Length (cm)')
plt.title('Average Petal Length for Each Species in Iris Dataset')
plt.show()
# Data Description: The Iris dataset is popular in machine learning and data science, often used for classification tasks. It contains information about three species of iris flowers: Setosa, Versicolor, and Virginica. Each species has four features: sepal length, sepal width, petal length, and petal width.
Measures of Central Tendency
The Measure of Central Tendency are statistical measures that provide information about a center or average of the data. They are necessary for summarizing and comprehending the central value around which the data points tend to cluster. The mean, median, and mode are the three main measures of central tendency.
1. Mean
Mean is the most commonly used measure of central tendency. It is calculated by summing up all the values in the dataset and dividing the sum by the total number of data points. Mean is represented as:
Mean = (Sum of all values) / (Number of data points)
The mean is sensitive to extreme values, known as outliers. When outliers are present, the mean can be skewed significantly, pulling the average towards the outlier value.
2. Median
Median is the middle value of a dataset when it is arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value itself. If the dataset has an even number of values, the median is the average of the two middle values. The median is not affected by extreme values or outliers, making it a robust measure of central tendency.
3. Mode
Mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). A dataset can have no mode, meaning all values occur with the same frequency.
Measures of Dispersion
Measures of dispersion, also known as measures of variability or spread, are statistical metrics that quantify how far data points in a dataset differ from one another and from the mean, median, or mode. These measures supplement the insights derived from measures of central tendency by providing valuable information about the spread or distribution of data points. Some popular dispersion measures are:
1. Range
The range is the most basic measure of dispersion and is determined as the difference between the dataset’s maximum and minimum values. It provides a general approximation of data dispersion but is impacted by extreme values (outliers) and might not accurately represent the overall distribution.
2. Variance
The variance is a numerical statistic measure that quantifies the spread or dispersion of a set of data points. It indicates how far individual data points differ from the dataset’s mean (average). A large variance implies that the data points are further apart from the mean, whereas a low variance shows that the observation points are closer to the mean.
Population Variance (σ²) = Σ (xi — μ)² / N
Sample Variance (s²)= Σ (xi — x̄)² / (n — 1)
- xi represents each individual data point in the sample
- x̄ is the sample mean
- n is the total number of data points in the sample
- N is the total number of data points in the population
Key Interview question: Why are the denominators different in the formulas for population variance and sample variance in statistics?
The sample variance formula divides the sum of squared differences between each data point (xi) and the sample mean (x̄) by one less than the number of data points in the sample (n — 1). The (n — 1) adjustment is known as Bessel’s correction. It is applied to make the sample variance an unbiased estimator of the population variance, especially for small sample sizes. The adjustment accounts for estimating the population mean (μ) from the sample mean (x̄) and reduces one degree of freedom from the calculation.
3. Standard Deviation
The square root of the variance is the standard deviation. Because of its usefulness in comprehending the spread of data in the original unit of measurement, it is one of the most often used metrics of dispersion. A higher standard deviation indicates greater variability in the data, while a lower standard deviation indicates less variability.
Standard Deviation (σ) = √(Variance)
4. Mean Absolute Deviation
Mean Absolute Deviation is the average of the absolute differences between each data point and the mean. It provides a measure of the average distance of data points from the mean. Like the standard deviation, MAD is in the original unit of measurement.
5. Percentiles
A percentile is a statistical measure used to describe the position of a specific value within a dataset, relative to the entire distribution of the data. It represents the percentage of data points that fall below or equal to a given value.
6. Quartiles
A quartile is a type of percentile that divides a dataset into four equal parts, each containing 25% (one-fourth) of the data.
First Quartile (Q1): Also known as the 25th percentile, Q1 represents the value below which 25% of the data points lie.
Second Quartile (Q2): The second quartile is the same as the median, which is the middle value of the dataset when it is sorted. It represents the value below which 50% of the data points lie and divides the data into two equal halves.
Third Quartile (Q3): Also known as the 75th percentile, Q3 represents the value below which 75% of the data points lie. In other words, 75% of the data points are less than or equal to Q3.
7. Interquartile Range (IQR)
It is the difference between the third quartile (Q3) and the first quartile (Q1). The IQR provides information about the spread of the middle 50% of the data and is a robust measure that is not affected by extreme values or outliers.
IQR = Q3-Q1
Lower Fence = Q1–1.5(IQR)
Upper Fence = Q3 + 1.5(IQR)
Key Question: How to identify Outliers?
Any value below the lower fence and any value above upper fence is treated as outlier in the data distribution.
8. Coefficient of Variation
The Coefficient of Variation is the standard deviation to mean ratio expressed as a percentage. It can be used to compare the variability of two or more datasets with varying units or scales. A lower CV indicates relative consistency and stability, whereas a higher CV indicates greater variation.
Five Number Summary
The five-number summary is a concise way to describe the distribution of a dataset using five key statistics. It provides a summary of the data’s central tendency, spread, and presence of outliers. The five numbers are:
- Minimum: The smallest value in the dataset, representing the lower extreme of the data.
- First Quartile (Q1): The 25th percentile of the dataset, marking the point below which 25% of the data falls. It is also known as the lower quartile.
- Median (Q2): The middle value of the dataset when it is sorted. It divides the data into two equal halves, with 50% of the data points below and 50% above the median.
- Third Quartile (Q3): The 75th percentile of the dataset, marking the point below which 75% of the data falls. It is also known as the upper quartile.
- Maximum: The largest value in the dataset, representing the upper extreme of the data.
Box Plot
A box plot, also known as a boxplot, is a graphical method for displaying the locality, spread, and skewness groups of numerical data through their quartiles. A box plot may include lines (called whiskers) extending from the box to indicate variability outside the upper and lower quartiles; thus, the plot is also known as the box-and-whisker plot and the box-and-whisker diagram. Outliers can be plotted as individual points beyond the box-plot’s whiskers. Non-parametric box plots show variation in samples of a statistical population without making assumptions about the underlying statistical distribution.
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_iris
# Load the Iris dataset from scikit-learn
iris = load_iris()
# Create a pandas DataFrame with the data
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target_names[iris.target]
# Create box plots for each feature in the Iris dataset
plt.figure(figsize=(10, 6))
sns.boxplot(data=data.drop('species', axis=1), orient='v', palette='Set3')
plt.title('Box Plot for Iris Dataset Features')
plt.ylabel('Feature Values')
plt.xticks(rotation=45)
plt.show()
# In the above box plot, the "Sepal Width" feature exhibits some potential outliers as evident from data points beyond the whiskers of the box plot.
Finally, a solid understanding of statistics is the foundation for effective data science initiatives. By delving into these critical areas, you will gain the knowledge and confidence to make meaningful data-driven decisions and unearth unique insights in the immense sea of data. So, let us go on this enlightening journey to realize the full potential of statistics in data science!
Can’t wait to post the next part of this article. Follow for more.