The Data Analysis Process

Surabhi Basak
CampusX
Published in
5 min readAug 7, 2019

In our day to day life, we always analyze many situations, like the pattern of sugar intake in our family helps us to conclude that aged people have less sugar as compared to younger ones. Even, facebook knows to suggest new friends for us, and Google can complete our search before we have even typed the third letter. All these happens by the process of data analysis. In this article, I will let you know the process of data analysis.

Data analysis is the process of evaluating data using analytical and statistical tools to discover useful information and aid in decision making. It is a process of collecting, transforming, cleaning, and modelling data to discover the required information. The results so obtained are communicated, suggesting conclusions, and supporting decision-making. Data visualization is at times used to portray the data for the ease of discovering the useful patterns in the data.

The data analysis process includes five steps. They are:

  1. Asking Questions: Whenever you get a dataset, ask the right question to it.
  2. Data Wrangling: This is the most important step, also known as data munging. Generally, the data is not in the correct format, so we need to clean the data for the processing.
  3. Exploratory Data Analysis: After cleaning the data, we explore the data, by drawing patterns, graphs and correlation among the data attributes.
  4. Drawing Conclusions: This is also an important step in data analysis.
  5. Communicating Results: After the whole process, we communicate the result with our team.

So, when you already know about the five steps, let us see the dataset.

This is a CSV dataset of Titanic which has the details of passengers, in which class they were travelling, and whether they survived or not. So, at first, we can check the amount of data, by

titanic=pd.read_csv(‘titanic.csv’)

titanic.shape

which gives output as (891,12), which means there are 891 rows and 12 columns. Lets see what are the columns present in the dataset.

titanic.columns

The columns are:

  1. PassengerId: it stores the passenger ID.
  2. Survived: it stores 1 if the passenger survives else 0.
  3. Pclass: it stores the class in which the passenger is travelling as 1,2,3.
  4. Name: it stores the name of passengers.
  5. Sex: it stores either male or female.
  6. Age: it stores the age of the passenger.
  7. SibSp: it stores 1 if there is sibling, else 0.
  8. Parch: it stores 1 if the passenger is travelling with parents or children.
  9. Ticket: it stores ticket.
  10. Fare: it stores fare
  11. Cabin: it stores cabin name.
  12. Embarked: it stores where passengers will get off.

So, now when you know a bit about what the data is about, let’s begin with asking questions.

Asking Question

Firstly, we need to think about which features will contribute to my analysis? I think PassengerId, Name, Ticket and Cabin are not required for the analysis, hence we will remove them during exploratory data analysis.Next, we can think do Age has any relation with Survived.
Again, as we all know, in Titanic, women and children were saved at first. So our question can be, do the survived passenger has more number of women and children?
Finally, we can also check whether there is any relation between embarked and Survived.
But before all these, we first need to clean the data.

Data Wrangling

First, lets remove the columns that donot contribute to the process of data analysis. So, we will write,

titanic.drop(columns=[‘PassengerId’,’Name’,’Ticket’,’Cabin’],inplace=True)

Then we will be left with

Next, we will remove the data, where the values are missing. So, we will write,

titanic=titanic.dropna()

Next, we will change the Sex column, male to 1 and female to 0. So, we will write,

titanic[‘Sex’].replace({‘male’:1,’female’:0},inplace=True)

Next, if we see the embarked column, there are three kinds of data,

titanic[‘Embarked’].value_counts()

Since Q is for less amount of data as compared to S and C, so we can remove Q, and then change S to 1 and C to 0.

titanic[‘Embarked’].replace({‘S’:1,’C’:0},inplace=True)

Now we will check the datatypes.

titanic.info()

Now, we can see that all the data is in numerical form, so lets check the correlation among each column.

Exploratory Data Analysis

titanic.corr()

Now, from this, we can observe:

  1. There is no correlation between survived and Pclass, Sex, Age, embarked.
  2. There is no correlation between Pclass and Survived, Age, Fare.
  3. There is no correlation between Sex and Survived, Sibsp, Parch, Fare.
  4. There is no correlation between Age and Survived, Pclass, Sibsp, Parch, embarked.
  5. There is no correlation between Sibsp and Sex, Age.
  6. There is no correlation between Parch and Sex, Age.
  7. There is no correlation between Fare and Pclass, Sex, embarked.
  8. There is no correlation between embarked and Survived, Age, Fare.

As we all know, when Titanic was sunking, females were saved first, so, we plot a graph by

sns.barplot(x=”Sex”, y=”Survived”, data= titanic)
plt.show()

0 is female and 1 is male

Again, we can observe that Survived also depends on Pclass.

sns.barplot(x=”Pclass”, y=”Survived”, data= titanic)
plt.show()

To see how Survived depends on Age,

sns.boxplot(x=”Sex”,y=”Age”,hue=”Survived”,data=titanic)

0 is female and 1 is male(Sex)

Drawing conclusion

After the whole process of exploratory data analysis, I can conclude that whether a person will survive or not, depends on the class in which he/she was travelling, sex and age.

Communicating Result

After the whole process we need to communicate the result through blogs, presentation with our team.

Conclusion

The fun part about the whole thing is we start with any step and jump from one step to another. Data Analysis is an important job after getting the raw data, and following these five steps will help you to do the whole process. Finally I will end by quoting the lines by Jim Bergeson, “Data will talk to you if you are willing to listen”.

Thank You

--

--