Analyze Your Data

Ujjainee

Published in

CampusX

6 min readAug 7, 2019

The Step by step learning and understanding of the Data Analysis Process.

What is Data Analysis?

Data analysis is the process of evaluating data using analytical and statistical tools to discover useful information and aid in business decision making. It includes the process of inspecting, cleansing, transforming and modeling to discovering useful information. There are several data analysis methods including data mining, text analytics, business intelligence and, data visualization.

What are the steps for analyzing data?

Asking Questions
Data Wrangling
Exploratory Data Analysis
Drawing Conclusion
Communicating Results

Asking Questions:

When you get data, the first thing you do is ask a question to yourself to understand the data. Consider the following data set ( the data of the ship Titanic, containing the passenger’s name to their sex and all other necessary information) is given:

After reading the file by,

data=pd.read_csv(‘titanic.csv’)

data.head(10)

After getting this dataset what would be your questions from here? Well, the questions below may arise:

Who is going to survive?
What is the main objective of we have to get?
What is the total number of males and females were there?
Is there any correlation between the given data and survival chances?

How can you ask questions more related to the topic or kind of better questions?

What exactly do you want to find out?

Where will your data come from?

Which statistical analysis techniques do you want to apply?

Who are the final users of your analysis results?

What data visualizations should you choose?

How can you create a data-driven culture?

We will try to understand the analyzing process by considering the question to be “Which category of people survived the most?”

Data Wrangling:

Raw data may be collected in several different formats. The collected data must be cleaned and converted so that data analysis tools can import it. The data analyst must aggregate these different forms of data and convert it into a form suitable for the analysis tools. It is what 80% of what a data scientist does. It’s where most of the real value is created. There are three main steps which are being followed during this process:

Gathering Data : That is from where you are getting the data. You might be getting data directly at times, but then there will be times when you have to get the data from CSV files, API’s, by Web Scraping and even through Database directly.
Assessing Data: This is about the commands you perform when you get a data set. This can be done by using operations like info(), shape etc. to know the data more.

From the above picture, we can easily get the idea of how useful an info() operation is. We get way too much information in this step including how many values are missing and where. We can get a rough idea that there was a number of people whose age is missing. Therefore, if we try to find a relation of survival with age we need to perform more operation on it, which we will see ahead.

Cleaning Data: It includes removing or changing the missing data, removing duplicates or incorrect data types.

From the above code, we can see that in this process we are removing the columns which won’t help us in any way. That is we remove the unnecessary information to filter out the needed for better and more accurate results.

Exploratory Data Analysis:

Exploratory Data Analysis(EDA) refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

Explore: It is about finding the correlation and covariance, plotting graphs, doing univariate and multivariate analysis on the given data.

From the above graphs from titanic dataset, we can see that men have a high probability of survival when they are between 18 and 30 years old, which is also a little bit true for women but not fully. For women the survival chances are higher between 14 and 40. For men the probability of survival is very low between the age of 5 and 18. Another thing to note is that infants also have a little bit higher probability of survival.

Augment: When the data is not processed properly or isn’t sufficient enough to work on, then we do this process. Augmenting data is about removing the outliers or merging graphs or even adding new columns.

Drawing Conclusion:

Inferential Statistics: Inferential statistics allows you to make inferences about the population from the sample data.

Descriptive Statistics: A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand.

Communicating Results:

One of the most important skills for data scientists to have is being able to clearly communicate results so different stakeholders can understand. Since data projects are collaborative across functions and data science results are often incorporated into a larger final project, the true impact of a data scientists’ work depends on how well others can understand their insights to take further action.

Result: By performing the above steps we could figure out that mostly women and children (below 18) survived as they were given more priorities.

You would mainly have to convey your results to your audience through the following processes:

In-Person
Reports
PPTs/Slides
Blog Post

Who is your audience?

Your team manager: This is the first line of review for any work you do or show to other stakeholders. Your manager may or may not be technical, but they certainly will be communicating with other teams/stakeholders.
Line-of-business (LOB) stakeholders: This person could be a product manager, business analyst, or a VP of customer support.
Data engineers/engineering team: The team that’s working to deploy the projects.

Conclusion:

The most surprising thing from the above steps is that the processes can be done at random, i.e. you don’t have to follow the process one after another, you can be at the result stage and still go back to exploring the data to find out more.
Considering the Titanic Dataset, we can conclude that after performing the steps of the analysis process we could figure out many things. We started at a point where we were just given data and ended up getting some results and more idea about the Dataset.