Data Analysis in 10 Easy Steps
The key steps in the process of turning data into insights
Data Analysis is the process of collecting, transforming, cleansing, and organizing data to discover new information, draw conclusions, make predictions, and support decision-making.
Data Analytics has various approaches and is used in different businesses, science, and industries. In today’s data-driven world, it plays a huge role in driving informed decision-making.
That said, are you looking to start a data analysis project but don’t know how to go about it? Or are you a data analyst working at a company?
Well, then this article is for you!
Here’s a TLDR version of this article
10 Steps in Data Analysis
1. Define the question
2. Define the ideal data set
3. Obtain data
4. Clean the data
5. Exploratory Data Analysis
6. Statistical Prediction/modeling
7. Interpret results
8. Challenge results
9. Communicate results
10. Build a Data Product
Want the details? Read on 👇
Subscribe to our AI Newsletter — Deep Grit.
Want to read it first? Check out our latest issue!
1. Define the question
Often in data analysis work, your results are based on the requirements of the clients and stakeholders, so having a good understanding of the data at hand is essential.
To do that, you have to start asking the right questions before even starting any analysis.
Defining the question will then help reduce the noise in your dataset and help you focus on the right features of the data.
As a result, this narrowing down of your observation is useful for simplifying your problem.
Defining the question is the most powerful dimension reduction tool you can employ.
Let’s say you want to use marketing data to predict the sales for next month. You should be asking questions such as “What data do you have?”, “How should I sample the data to make sure it is representative?”, “How should I impute the missing variables in the data”, “What’s the best model to use to predict sales for the data we have?”
Without asking the right questions, it will be dangerous and even pointless to randomly apply machine learning techniques to the data. Doing so will produce misleading conclusions and results, and you risk losing credibility as a data analyst.
2. Define ideal data set
There are six different types of Data Analysis, so you need to find data that helps achieve that for each of them.
Here’s a brief description of what your data should be for the specific type of data analysis.
Descriptive— Whole population
Exploratory— Random, sampled with many variables
Inferential— Right population, randomly sampled (sampling mechanism important)
Predictive— Training & test data set from the same population
Causal— Data from a randomized study (experimental data)
Mechanistic— Data from all components of the system
Once you define what data you need, you should also question whether that data is obtainable.
If you want to analyze COVID-19 worldwide statistics, you need case data from the entire world (descriptive). Or, if you wanted to determine whether website design A or design B is more effective, it would be an inferential analysis, and data will be randomly sampled.
3. Obtaining data
There are tons of sources online such as Kaggle and using Google Dataset Search to find the datasets you need. These platforms provide a plethora of free datasets for you to choose from, and most of the time, they are enough.
In some cases, the data you want might not exist, and you have to figure out how to get it yourself. One common way is to scrape the internet with Python using libraries like Scrapy or Beautiful Soup or no-code tools such as Octaparse.
This is a perfectly fine approach, but also make sure you follow best practices and do not violate any rules.
Want to learn web scraping? Check out “Scraping 100+ Free Data Science Books with Python” for a guide to scraping websites.
4. Cleaning the data
Once the data is processed and organized, it might contain missing values, duplicates, or errors.
To counter this, it’s time to do what every Data Scientist dreads — Data Cleaning — a task that is said to be one of the most time-consuming tasks of a data scientist.
Common tasks in data cleaning include deduplication, imputing missing values, record matching, etc., and are identified through analytical techniques using tools such as Pandas or Excel.
Outlier detection techniques can deal with quantitative data such as price and sales count, which have a high chance of being input incorrectly.
For textual data, you can also use fuzzy matching techniques to turn similar categories into one. Ex: New York, N.Y., NYC → New York City
5. Exploratory Data Analysis (EDA)
Once the dataset is cleaned, it can be analyzed, a process known as EDA.
EDA is essential for data analysts to understand the hidden information contained within the data.
The process of EDA commonly starts with visualizing the descriptive statistics of the data, which summarizes the characteristics of a data set. There are three main types: distribution, central tendency, and variability.
An important thing to note is EDA plots do not need to look nice or colorful; they’re only for you to understand the data and are not for presentation purposes.
EDA also includes figuring out any relationships between predictors in the data, which is useful for building the model later.
Throughout this process, you might need to do more data cleaning after discovering more errors in the data or even collect more data to help answer your question better.
6. Statistical Prediction/Modeling
Data analysts won’t be delving into machine learning most of the time, but many tools exist today that allow them to build simple ML models in seconds.
From the results of EDA and the question of interest, you should determine which features should and should not be used for modeling.
For example, if you wanted to predict height and found out that weight has a high correlation with the target variable — height — you should use weight to make a better prediction.
The exact method you use, such as the type of machine learning algorithm, should also depend on your goal. Do you want to predict customer churn? Your model should be a classification model. Or if you want to predict house prices in New York City, it’s a regression problem instead.
Any transformation or processing you do should also be accounted for in your model, and you should think about how that will affect the model prediction and how you should interpret the results.
7. Interpretation of results
From the EDA and prediction process, you should be interpreting your results by using appropriate language such as “X correlated with Y,” “certain variables may be associated with the target variable,” “The obtained R² value tells us that…”, “This model had an accuracy score of …”, etc.
If you’re doing inferential statistics, it’s crucial to interpret all coefficients in your analysis and relate them specifically to the problem you’re solving. Examples of coefficients include p values, R², confidence intervals.
If you’re using machine learning models, instead of leaving them as black boxes that spit out predictions, you can utilize techniques to explain the “why” behind their reasoning for their predictions. This is called interpretable machine learning.
8. Challenging of results
Once you have your results, a good scientist will challenge all steps in the analysis.
Before bringing your findings to your stakeholders, you should ensure that your actions and choices are scientific and unbiased.
Questions that you can ask yourself include:
- Is this technique up to date with the industry, or are there better ways to solve this?
- Are there other models or methods I can use to analyze this data?
- Have I utilized the data to the best ability? Is there any more data I can collect to provide a more conclusive answer?
- and many more…
9. Communicate results
Once all that’s done, it’s time to share the hard work you have been doing!
You can use different formats to report your results, such as using Tableau to make charts and graphs and making slides for presenting your analysis.
Below is a good template:
- Start with your question and your problem statement
- Summarize your analysis into a story.
- Include the only essential analysis that adds value to your story and addresses the challenge.
- Add “pretty” figures that contribute to the story.
- Conclude with a summary of the important findings and any further techniques that can be explored to better answer the question.
In general, you should be giving clear explanations about your actions, explaining why you did what you did, along with uncertainties and assumptions with your analysis.
10. Data Product
Most of the time, if you’re working in a company, your analysis work will act as a tool to increase the productivity and efficiency of other members.
For example, you analyzed social media metrics and marketing data and communicated all your results in a presentation. The marketing lead loved your work, and you are tasked to build a dashboard for the marketing team to find out what kind of social media posts are getting more engagements, who their best customers are, and so on.
Or, if you love analyzing data for fun, you can turn your analysis work into a data product using Streamlit, a Python tool that allows you to turn your analysis into an interactive web app!
That’s all for this article, let me know what you think about it and if you think it’s missing any steps!
Here are some resources for those who want to learn more about data analysis!
- Google Data Analytics Certificate
- Data Analysis with Python for Excel Users
- Python For Data Analysis 3rd edition
Thanks for reading!
Liked this article? Here are some articles you may enjoy:
- Using Data Science to Predict Viral Tweets
- Sentiment Analysis on Reddit Tech News with Python
- The Missing Library in your Machine Learning Workflow
If you like these articles, be sure to follow the bitgrit Data Science Publication for more!
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!