Are you ready to be a Data Analyst?
Data Inspection
Once the business objective has been established, a data analyst will then set out to understand their data. This phase will introduce more technical work involving the initial data collection. There are three major areas:
- Collect initial data: You already identified your data sources, here you will explore your company’s databases, reports, or extract external data through web scraping to build your initial dataset.
- Determine data availability: Identify how often this data is gathered or updated. Also, how you would be able to access it for future use.
- Explore data and characteristics: Where you take a first look to identify important variables, data types, and the format of your data. You will also determine if you need to gather more data, data enrichment, and identify the initial data that will be used for analysis.
Data Pre-processing & Preparation
- Data validation: You want to ensure that your data follows the business logic or rules. This can be investigated by exploring the ranges or formatting of your variables. Mistakes can occur, especially when data is entered manually. Essentially, you verify whether your data makes sense. For example, if you have a customer table with an age of 150, this indicates an error.
- Missing value treatment: Missing data is a common issue when data cleaning. There are multiple methods that include deletion, ignoring the missing data, assigning a missing data column, and imputation. More details on these methods will be explained in
- Removing duplicates: Duplicates will lead to misleading numbers and will cause incorrect conclusions. Outlier treatment: Where you will identify outliers and determine how your will treat them. An outlier is a data point that is significantly different from most of the other data points. They are normally the result of normal variation of a process or errors. There are multiple treatment methods including ignoring them, imputation, deletion, or transformation.
- Data normalization: When data is moved through different tools and phases during the data pipeline, the data types may involuntarily convert. Here you will fix the formatting, convert units of measurements, or standardize categorical data.
- Feature engineering: Where you transform variables to better represent the data. This can involve binning, aggregations, or combining variables.
The EDA Process
The Exploratory Data Analysis (EDA) process is a flexible framework that will help you understand your dataset’s structure, peculiarities, and patterns rather than being a rigid set of guidelines. Typically, the process starts with formulating your interest-driven questions and determining the data required to provide answers. The real journey begins once you have the data in your possession.
- Data cleaning typically comes first in the process. You will fill in any missing values, eliminate duplicates, and fix any errors here. If you skip this step, your subsequent analyses may contain noise and error. It’s important to remember that data cleaning can be an iterative process revisited as your investigation progresses.
- Data summarization, the next step, involves using descriptive statistics like mean, median, variance, and standard deviation. This is a great place to start looking for the first patterns or anomalies that require additional investigation.
- The next stage is data visualization, which is essential for comprehending the underlying structure of the data. When compared to just numerical summaries, visuals like histograms, box plots, and scatter plots help the reader understand the data more quickly and intuitively. They enable you to support or refute initial hypotheses and support the development of new ones.
- Insight Generation is the last step, bringing everything together. You ought to be able to resolve your initial queries by fusing the numerical summaries and the visual depictions, and you ought to be able to come up with actionable or further-researchable insights. Additionally, you can point out areas where more information might be required for a more complete understanding. Source.
Following the previous concepts, I’ve solutioned a case study about sales of a company.
Case of Study:
You work as a data analyst for ABC, an online retailer that offers everything from fashion to electronics. For the last six months, the company’s sales have been declining, and the management is worried about the viability of the enterprise going forward. The management wants to understand the underlying factors contributing to the decline in sales. They are particularly interested in:
- Identifying the categories of products that are performing well or poorly.
- Understanding customer behavior, including spending patterns and frequency of purchases.
- Evaluating the effectiveness of various sales channels. Your task is to conduct an exploratory data analysis to uncover insights that can help the company reverse the declining sales trend.
Solution:
Don’t forget follow me,
Portfolio: yesnersalgado.me