The Journey into Time Series: Unveiling Descriptive Analytics

Data Mastery Series — Episode 12: The Art of Forecasting (Part 3)

Donato_TH
Donato Story
8 min readJul 9, 2023

--

If you are interested in articles related to my experience, please feel free to contact me: linkedin.com/in/nattapong-thanngam

I hope you’ve enjoyed my tale about embarking on a data analysis and Data Science journey so far. It’s crafted to emulate real-world scenarios to help foster a deeper understanding of the concepts. In your future roles as data scientists or analysts, mastering descriptive analytics will be crucial. For instance, in a cosmetic company with a myriad of products, each exhibiting unique sales patterns, your understanding of descriptive analytics can facilitate efficient communication and instill confidence in your colleagues.

Continuing our Tale

Robust Roast: A Tale of Beans and Bytes — Part 3 ( Image by Author)
Robust Roast: A Tale of Beans and Bytes — Part 4 ( Image by Author)

What about you? Do terms like “Mean”, “Mode”, “Variance”, “Standard deviation”, and others sound familiar? Let’s plunge into the theory.

Note:

  • All of the code and raw files are available in my GitHub repository [Link]
  • This content is designed to provide an overview for beginners in data science and data analytics. It’s not meant for in-depth theoretical explanations, but it does introduce fundamental concepts and statistical measures. After grasping the basics here, you’re encouraged to explore further into the topics that catch your interest.

Descriptive analytics act like a potent lens, revealing hidden patterns and trends in data. By summarizing these aspects, we can glean vital insights to guide our decision-making process. The core facets of descriptive analytics are ‘Measures of Centrality’, ‘Measures of Variation’, ‘Measures of Localization’, and ‘Measures of Symmetry’, each providing a unique perspective on our data.

Recap Trend of SKU A, B, C ( Image by Author)

Measures of Centrality

Centrality measures describe the ‘center’ of a dataset. Common measures include the mean (average), median (middle value), and mode (most common value).

  • Mean: The mean, often referred to as the average, is calculated by adding all data points in a set and then dividing by the number of points. The result is the mean.
    - For instance, if x1, x2, …, xn are the values in a data set, the mean (μ) is calculated as: Mean (μ) = (x1 + x2 + … + xn) / n
  • Median: The median is the middle value in a set when the data points are arranged in ascending order. If there is an odd number of observations, the median is the middle number. If there is an even number of observations, the median is the average of the two middle numbers.
    - For example, if our set is {1, 2, 3, 4, 5}, the median is 3 (middle value). If our set is {1, 2, 3, 4}, the median is (2 + 3) / 2 = 2.5 (average of two middle values).
  • Mode: The mode is the most frequently occurring value in a dataset. A dataset may have one mode, more than one mode, or no mode at all.
    - For example, in the set {1, 2, 2, 3, 4}, the mode is 2 because it occurs more often than the other numbers.
Mean, Median, and Mode of SKU A (Image by Author)
For SKU A: Mean =  43.80, Median =  44.0, Mode =  38

Measures of Variation

These statistical tools illustrate the dispersion and distribution of a dataset. Recognizing the degree of data dispersion is vital for interpreting the stability or volatility of the subject under examination.

  • Range: The simplest measure of variation, the range, is the difference between the highest and lowest value in a dataset. It provides a swift overview of data spread but can be significantly affected by outliers.
    - Formula: Range = Max(x) — Min(x)
    - Use: The range is useful for providing a quick sense of data spread but does not account for how data is distributed within this range.
The range of SKU B (Image by Author)
Range of SKU B: 62 - 23 = 39
  • Variance: Variance measures how far a dataset is spread out from its mean. It quantifies dispersion within a dataset, with larger numbers indicating a greater spread.
    - Formula: Variance = Sum[(x — μ)²] / n
    - Use: Variance is ideal for comparing the spread of two different datasets and understanding the reliability of the mean.
  • Standard Deviation: Standard deviation is the square root of the variance. This widely-used measure of variability is popular because it returns to the original units of measure of the data.
    - Formula: Standard Deviation = sqrt(Variance)
    - Use: Standard deviation is instrumental in understanding the average distance of each data point from the mean. A low standard deviation suggests data points are close to the mean, indicating a more stable and reliable dataset.
  • Coefficient of Variation: The Coefficient of Variation is a relative variability measure. It shows variability extent in relation to the population mean and is expressed as a percentage.
    - Formula: CV = (Standard Deviation / Mean) * 100
    - Use: CV is useful when comparing variation degrees from one data series to another, even if the means are dramatically different.
Variance of SKU B: 71.27
Standard Deviation of SKU B: 8.44
Coefficient of Variation of SKU B: 20.11%

Measures of Localization

These measures give us more granular insight about the distribution of data within the dataset.

  • Percentile: The percentile is a measure that tells us what percent of the total values the current data point represents. It’s useful when we want to understand the relative standing of a value within a data set.
  • Decile: Deciles are similar to percentiles, but they split the data into ten equal parts instead of a hundred. They give us a more generalized view of the distribution.
  • Quartile: Quartiles divide the data into four equal parts, and they are especially useful when we want to understand the spread and center of our dataset in broad strokes.
The percentile, decile and quartile of SKU C (Image by Author)

Measures of Symmetry

These measures help us understand the symmetry and structure of our data distribution.

  • Skewness: Skewness measures the degree and direction of asymmetry. A dataset is symmetric if it looks the same to the left and right of the center point. Skewness measures the relative size of the two tails.
    - Formula: Skewness = E[(x — μ)³] / σ³
    - Interpretation:
    — A skewness value > 0 means that there is more weight in the right tail of the distribution (right-skewed or positively skewed).
    — A skewness value < 0 means that there is more weight in the left tail of the distribution (left-skewed or negatively skewed).
    — A skewness value = 0 indicates that the values are evenly distributed on both sides of the mean (no skewness).
  • Kurtosis: Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. It identifies whether the tails of a given distribution contain extreme values.
    - Formula: Kurtosis = E[(x — μ)⁴] / σ⁴
    - Interpretation:
    — A kurtosis value > 3 indicates a distribution with heavier tails and a sharper peak than the normal distribution (leptokurtic).
    — A kurtosis value < 3 suggests a distribution with lighter tails and a flatter peak than the normal distribution (platykurtic).
    — A kurtosis value = 3 is equivalent to the normal distribution (mesokurtic).
Skewness of SKU C: -0.042
Kurtosis of SKU C: -0.599

Special Note: Powerful Python Libraries For EDA

Practically, Python offers a rich suite of tools specifically designed to aid in descriptive analytics.

  • .describe() Function: A convenient function that offers a quick statistical summary of your data — it churns out details like count, mean, standard deviation, and more, providing a brief snapshot of your dataset.
Describe() function (Image by Author)
  • Python Libraries For EDA: Python’s wealth extends to specialized libraries built for Exploratory Data Analysis (EDA). These libraries come equipped with tools that facilitate detailed analysis and visualization of your data.

1. YData Profiling (Previously Pandas Profiling): Use this tool for Exploratory Data Analysis (EDA) to present necessary information such as number of variables, observations, missing value percentages, duplicate value percentages, data profiling of each variable, scatter plots, correlations, and more.

Overview YData Profiling report (Image by Author)
Overview YData Profiling report (Image by Author)
Overview YData Profiling report (Image by Author)

2. D-Tale: When using this library, it opens a separate worksheet (new tab). This tool boasts many useful functions such as describe, clean columns, summarize data, time series analysis, duplicate detection, missing analysis, feature analysis, correlations, charts, and more.

Overview D-Tale report (Image by Author)
Overview D-Tale report (Image by Author)

3. SweetViz: This library presents data profiling of the overall dataset and each variable. It also features an “Associations” button to display a correlation chart.

Overview SweetViz report (Image by Author)
Overview SweetViz report (Image by Author)

4. DataPrep: This tool offers an overview report, data profiling of each variable, a correlation chart, and missing value analysis.

Overview DataPrep report (Image by Author)
Overview DataPrep report (Image by Author)

The concept of descriptive analysis is not limited to Python alone; it can be applied in various tools such as Excel, PowerBI, Tableau, and more. While tools provide support, knowledge remains the core foundation.

A solid understanding of descriptive analytics is the key to unlocking the true potential of data. These measures illuminate hidden patterns and trends, leading to informed decision-making and a competitive edge. Python’s vast toolset enhances this process, transforming raw data into valuable insights. All of the code and raw files are available in my GitHub repository [Link]

As we move forward, ‘Time Series Visualization’ awaits us. This upcoming chapter will enable us to track our data’s journey over time, revealing even more profound insights. Stay tuned!

Thank you for taking the time to read this article! If you found it enjoyable, we recommend checking out these other articles for your reading pleasure.

Data Science

26 stories

Dashboard

3 stories

Donato_Journey

5 stories

Course_Review

3 stories

Please feel free to contact me, I am willing to share and exchange on topics related to Data Science and Supply Chain.
Facebook:
facebook.com/nattapong.thanngam
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate