Understand Data Analysis Process Step by Step

SONALI SHAW
8 min readAug 7, 2019

--

Often, when we talk about Data Analysis projects, nobody seems to be able to come up with a solid explanation of how the entire process goes. From gathering the data, all the way up to the analysis and presenting the results.

In this post, I break down the Data Analysis framework, taking you through each step of the project lifecycle.

At 1st let us understand,What is Data Analysis and why it is important?

In simple words Data analysis is a process of collecting data and organizing it in a manner where one can draw a conclusion.

In technical terms Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making.

Data analysis is important in business to understand problems facing an organisation, and to explore data in meaningful ways. Data in itself is merely facts and figures. Data analysis organizes, interprets, structures and presents the data into useful information that provides context for the data.

I am going to explain Data analysis with the help of Titanic dataset whose link is given below.

Now we have the dataset so we can start data analysis process.

Data Analysis Process includes 5 steps.Which are

  1. Asking Right Questions
  2. Data Wrangling or Data Munging or Data pre-processing
  3. Exploratory Data Analysis(EDA)
  4. Drawing Conclusions
  5. Communicating Results

The very 1st step of Data Analysis Process is

Asking Right Questions:

For asking questions you need data.So there can be two scenario,firstly you will get the company’s data from your manager or you will collect it from any site like kaggle and your manager will tell you the right exact questions .In second scenario, you will get the data but they will not tell you the questions , they will just give you a open ended type problem like here is my company last year data ,analyse it and tell me by analysis that this year our company will make more profit or not.

Now you will have to ask proper questions which will give you a clear idea about you dataset .So your questions should be clear and concise.

Questions you can ask:

  1. What features will contribute to my analysis?
  2. What features are not important for my analysis?
  3. Which of the features have a strong correlation?
  4. Do I need data preprocessing?
  5. What kind of feature manipulation/engineering is required?

Example: As we have already a Titanic dataset which contain 12 columns and 891 rows and each rows contain information of one passenger.Now,You have to make a Titanic Survivor Predictor.For that you have to ask the above questions form the dataset. There are no such rules that you have to follow, only one rule that you have to follow,is asking right questions from your data.If you are thinking how can you ask better questions.Firstly it depends on subject matter experties. It is obvious If you know very well about your domain you can ask better questions. Secondly it depends on your experience like if you worked many times on same type of domain then you can easily correlate between data.So your experience and knowledge will help you to ask better questions.

Here you can see all the 12 rows,now we are going to see the survival correlation with others.

Like this you have to ask another questions.

2nd step of Data Analysis Process is

Data Wrangling/Munging:

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

In simple words if your data is not good for analysis then you have to clean the data which is known as data cleaning or munging.

There are 3 steps for data wrangling:

2a. Gathering Data:

In this step, you will need to query databases, using technical skills like MySQL to process the data. You may also receive data in file formats like Microsoft Excel. If you are using Python or R, they have specific packages that can read data from these data sources directly into your data science programs.

The different type of databases you may encounter are like PostgreSQL, Oracle, or even non-relational databases (NoSQL) like MongoDB. Another way to obtain data is to scrape from the websites using web scraping tools such as Beautiful Soup.

Another popular option to gather data is connecting to Web APIs. Websites such as Facebook and Twitter allows users to connect to their web servers and access their data. All you need to do is to use their Web API to crawl their data.

And of course, the most traditional way of obtaining data is directly from files, such as downloading it from Kaggle or existing corporate data which are stored in CSV (Comma Separated Value) or TSV (Tab Separated Values) format. These files are flat text files. You will need to use special Parser format, as a regular programming language like Python does not natively understand it.The above mentioned link is of CSV file.

2b. Assessing Data:

Once you’ve gotten your data, it’s time to get to work on it. Start digging to see what you’ve got and how you can link everything together to answer your original goal.

To get some ideas about your data you will ask some basic questions from the data like:

i)Finding the number of rows/columns(shape)

ii)Data types of various columns(info())

iii)Checking for missing values (info())

iv)Check for duplicate data(is_unique)

v)Memory occupied by the dataset(info)

vi)High level Mathematical overview of the data(describe)

The describe() function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

You can see some missing values in Age column.How do you come to know about that?You come to know about that only looking by count row.

So by these basic code you can know about your data and you can also get some ideas how you can work on this data.

3c. Cleaning Data:

The next step (and by far the most dreaded one) is cleaning your data. You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.

Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning can take up to 80% of the time spent working on a project. So it’s going to suck a little bit, but as long as you keep focused on the final goal, you’ll get through it.

Example:

In Titanic dataset you have to think which column is important for the prediction and which you want to drop.In my opinion,to predict the survivors you do not need the ‘Name’,‘Ticket’,‘Fare’ and ‘Embarked’ columns,so we are going to drop these columns.

Before dropping any column you have to think about its importance with the topic.If you are 100% sure that this column is not necessary then only you will drop the column.

In Age column of Titanic dataset ,you will see ‘NaN’ which you have to replace by float datatype because for analysis you need all data of Age column in same datatype like float.

In Sex column you can see male and female is mentioned but it takes lots of space in compare to integer data type.So you can replace male by 0 and female by 1.

In 5th row you can see the Age value is NaN so now we are going to replace it by 0.0.

You have to remove duplicate data from the dataset if you find it.So now you have to think how you can clean your data and make it more suitable for data analysis.

3rd step of Data Analysis Process is

Exploratory Data Analysis:

It refers to the critical process of performing initial investigation of data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representation.

Here you majorly do two things:

The above Agumenting Data operations are called as Feature Engineering.

4th step of Data Analysis Process is

Drawing Conclusions:

Now you know your data set very well .You can make prediction from it.In Data Analysis we conclude or predict by Descriptive Statistics.

Now ,what is Descriptive Statistics?

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population.

From Titanic dataset you can predict many things like:

Does being a female increases your chance of survival?

Which class is more safer for passengers?

Like this you can draw many conclusion with the help of graph.

Now we are finally going to our last step of Data Analysis which is

Communicating Results:

It is also named as Data Storytelling.

This step is very very important for any data analyst.Here you have to tell about your data analysis to your Teammates,CEO,Boss or anyone else.You can tell about your analysis by drawing graphs and more ways to present like

Here you need good communication skill,also you have to be affirmative and confident.

Conclusion:

The fun part of data analysis is that it is not a linear process.So you can jump from 1 step to any of rest steps.

The interpretation of result should be done very carefully otherwise it may be misleading.

--

--