CRISP-DM: The Art of Data Understanding

Data Mastery Series — Episode 3: Data Understanding

Donato_TH
Donato Story
6 min readFeb 2, 2023

--

If you are interested in articles related to my experience, please feel free to contact me: linkedin.com/in/nattapong-thanngam

CRISP-DM framework (Image by Author)

Data analysis is an important aspect of modern businesses as it helps make informed decisions based on the data. However, effective data analysis requires a clear understanding of the data and its characteristics. This is where the concepts of understanding data needs, data profiling, and data visualization come into play. In this article, we will explore the key elements of understanding data needs, data profiling, and popular graphs used for data understanding. We will discuss how these concepts help in making sense of the data and enabling effective data analysis.

1. Understanding data (Image by Author)

1. Understanding data:

  • Understanding Meaning & Symbol: Before analyzing data, it is important to understand what the data represents. This includes understanding the meaning of each variable and its symbol, as well as any units or scales used in the data. This helps ensure that the data is interpreted correctly and used in the right context.
  • Identifying Available & Needed Data: Identifying what data is available and what data is needed is a crucial step in the data analysis process. It helps ensure that the right data is collected and used for the analysis. The available data can be from internal sources, such as company databases, or external sources, such as public data sets.
  • Exploratory Data Analysis: Exploratory data analysis is the first step in analyzing data. It involves reviewing and summarizing the data, identifying patterns, trends, and relationships, and visualizing the data. This helps in understanding the overall structure and characteristics of the data and identifying any issues that may impact the accuracy and reliability of the data.
  • Understanding Data Characteristics: Understanding the characteristics of the data, such as its quality and completeness, is important in ensuring that the data is fit for analysis. This includes reviewing the data for missing values, outliers, and anomalies, and checking the data’s accuracy and consistency. The data’s quality and completeness impact the results of the analysis, so it is important to ensure that the data is of high quality and complete before proceeding with analysis.
2. Keyword of Data Profiling (Image by Author)

2. Keyword of Data Profiling

Data profiling is a process of examining and summarizing the characteristics of a data set. It is an important step in understanding the data and ensuring its quality and reliability. Data profiling helps in identifying any issues or patterns in the data, such as missing values, outliers, and anomalies, and helps in determining the distribution of variables. In this article, we will explore the key elements of data profiling and how they contribute to effective data analysis.

  1. Number of Variables: Data profiling starts with identifying the number of variables in the data set. This helps in determining the scope of the data and the complexity of the analysis. The number of variables also determines the type of visualizations and methods used for analysis.
  2. Categorical/Numeric: The type of data, either categorical or numeric, plays a significant role in the analysis. Categorical data is non-numeric data that can be divided into categories, such as gender, region, or product type. Numeric data is data that can be quantified, such as income, height, or weight. Understanding the type of data helps in selecting the appropriate methods for analysis.
  3. Count and Count Distinct: Count and count distinct are basic measures of data that give an idea of the number of observations in the data set and the number of unique observations, respectively. These measures are important in identifying the size of the data and its diversity.
  4. Missing Values: Missing values can impact the accuracy and reliability of the data analysis. Data profiling helps in identifying the number of missing values and their distribution in the data set. This information is important in determining the methods used for handling missing values, such as imputation or removal of observations.
  5. Min/25%/50%/75%/Max: These statistics represent the minimum, first quartile, median, third quartile, and maximum values for each variable. They provide an overview of the distribution of values for each variable and can be useful for identifying outliers and anomalies.
  6. Mean and Standard Deviation: Mean and standard deviation are measures of central tendency and dispersion, respectively. Mean provides an idea of the average value of the data, while standard deviation indicates the spread of the data. These measures are important in understanding the distribution of the data and identifying any outliers.
  7. Outliers and Anomalies: Outliers and anomalies are values that fall outside the expected range for a variable. These values can impact the accuracy and reliability of the data, so it is important to identify and understand them.
  8. Distribution of Variables: Distribution of variables is an important aspect of data profiling. Understanding the distribution of values can help determine the shape and spread of the data and can be useful for identifying outliers and anomalies.
3. Popular graphs for data understanding (Image by Author)

Visualizing data is an essential step in the data analysis process. It allows us to gain insights into the data, identify patterns and relationships, and make informed decisions based on the findings. There are many different types of graphs that can be used to present data, and each has its own strengths and weaknesses. In this article, we’ll explore the most popular graphs for data understanding and the best use cases for each.

  1. Bar chart: Bar charts are ideal for comparing categorical data, such as the number of sales made by different departments or the number of people who voted for different political parties. They provide a clear and straightforward representation of data and allow for easy comparison of different categories.
  2. Histograms: Histograms are used to represent the distribution of a single variable. They show the frequency of data points within certain ranges and provide insights into the shape of the data distribution. This is particularly useful for understanding the distribution of continuous variables, such as weight or height.
  3. Line chart: Line charts are ideal for representing numerical data over time. They are often used to show trends and patterns in data, and they provide a visual representation of changes in data over a specific time period.
  4. Scatter plot: Scatter plots are used to explore the relationship between two variables. They show the relationship between two numerical values by plotting data points on a coordinate plane. This allows us to see how two variables are related, and whether there is a linear or non-linear relationship between them.
  5. Box plot: Box plots are used to represent the distribution of multiple variables. They show the median, quartiles, and outliers of a dataset, and provide a clear representation of the distribution of the data. This is particularly useful for understanding the distribution of data with many outliers or multiple peaks.
  6. Correlation Matrix: Correlation matrices are used to represent the relationship between multiple variables. They show the correlation between each pair of variables in a dataset, and provide a quick and easy way to identify which variables are strongly related to each other. This is particularly useful for understanding the relationships between variables in a large dataset.

In conclusion, it is crucial to understand the strengths and weaknesses of each of these popular graphs in order to make informed decisions about the best way to visualize your data. The choice of graph will depend on the type of data being analyzed and the insights being sought. The following figure presents a basic guide to selecting the right graph for data understanding.

6. Basics of graph selection for data understanding (Image by Author)

Please feel free to contact me, I am willing to share and exchange on topics related to Data Science and Supply Chain.
Facebook:
facebook.com/nattapong.thanngam
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate