Process of Data Analysis

Want to become a Data Analyst, see how a Data Analyst works in a project.

Vishal Malo
6 min readAug 3, 2019
Photo by Carlos Muza on Unsplash

Before starting, let us know what is Data Analysis?

According to Wikipedia,

Data Analysis is a process of inspecting, cleansing, transforming and modeling data to discover useful information, informing conclusions and supporting decision-making.

In simple words, we can say Data Analysis is the process in which we are changing the data or analyzing the data, i.e., a sort of playing with the data, to draw some conclusions which would help your company to run properly.

Now, there are five steps in the process of Data Analysis:-

  1. Asking questions: We need to ask the right questions from the provided dataset to proceed to the other steps.
  2. Data Wrangling: The most important and the maximum time-consuming step of all. In simplified terms, we can call it as Data Preprocessing. The provided data is not always in the right format. There are many errors in the given data. So, to proceed, we need to clean the dataset and bring it into a proper form.
  3. Exploratory Data Analysis: Now, after getting the cleaned data, the thing we do is to explore the data, find out patterns, plot graphs, find correlations to have a basic idea of the given data or in simple words to get attracted with the data to do further analysis.
  4. Drawing Conclusions: So, as we have already done our Exploratory Data Analysis, we have a clear understanding of our data. Hence, we would be able to conclude our data.
  5. Communicating Results: Last but not least, as we are aware of our findings and we have drawn some conclusions, we need to communicate with the team and make them understand.

Now, let us understand the complete thing with a given dataset.

Asking Questions

There are two scenarios of asking questions:

Suppose, the manager has provided us with the data and some questions and we need to answer those questions.

The second case would be the manager has provided us with the data and no proper questions. For example, we are working on any marketing website, so we need to analyze the data and need to know how to gain profit from the given data.

The second case is a bit more difficult compared to the first case as in the second case we need to ask the proper questions rather than getting provided with questions as in the first case.

Titanic Dataset

In the “Titanic” dataset the proper questions could be:

  1. What features will contribute to our analysis?
  2. What features are not required for our analysis?
  3. Which of the features have a strong correlation?
  4. Do we need to do Data Correction/Data Wrangling?
  5. What kind of manipulation/engineering is required?

Now a question could be raised, how can someone ask the proper questions to the dataset?

For that, you need Subject Matter Expertise — A person with a good amount of knowledge on the given dataset would be able to ask good questions compared to someone who does not know the topic. Suppose, a person who follows cricket would be able to work on cricket related datasets compared to someone who does not know about it. Secondly, Experience — Now this is not something new, a person develops with experience. Experience always matters in every field and is the same here too. On working with many many datasets, the person will have absolutely no problems with working on new datasets.

Data Wrangling

The second and most important step is Data Wrangling. There are three parts of Data Wrangling:

Data Gathering: Sometimes we need to collect the datasets from different sources such as different APIs, through web-scraping and from different databases. Here, the “Titanic” database is collected from kaggle.com.

Accessing Data: We try to get involved in the data using different functions so that we know what kind of data it is, how many types of data are present, whether any data is missing or not, etc. Some basic operations used in the “Titanic” dataset are:

Finding the number of rows and columns knowing basic things about the data checking whether the data in the columns are unique or not
A high-level mathematical overview of the data

Cleaning Data: After accessing the data, the main part comes which is to clean the data. Suppose if any data is missing in any respective column, either we put the mean of other values in there or we completely remove the entire row of the missing data. If any two or more than two rows have the same data, we remove all the rows except one. If any data is present in an incorrect datatype, we change it to a correct one. And finally, the columns which are not at all required for analysis are removed.

In the “Titanic dataset, nothing much cleaning was required. Only some columns were needed to be removed and the datatype of a column was needed to be changed.

Exploratory Data Analysis

Two main parts of Exploratory Data Analysis is:

Exploring Data: It is playing with the data, doing different calculations, plotting graphs(data visualization), etc. In “Titanic”, Exploring Data is done as follows:

Finding Correlation and Covariance
Graph for Fare vs P-class

Augmenting Data: This involves removing outliers, adding new columns, merging two dataframes, etc. In the “Titanic” dataset augmenting data was not required.

Drawing Conclusions

Okay, now all the above steps are completed, here comes the part of drawing conclusions. After completing the above steps, a person can draw several conclusions from the final data. Drawing conclusions can be using inferential statistics or descriptive statistics or using different Machine Learning algorithms to predict different things. Here, in the “Titanic” dataset, we use different Machine Learning algorithms to predict the persons who would have survived during the drowning of the “Titanic”.

Communicating Results

Now, everything is done, comes the final step Communicating Results. Also known as Data Storytelling, it is the part in which we submit our analysis to others may be our colleague, any data scientist, our CEO, etc. Here, communication skills are most important as we explain the complete analysis to others. We can also prepare reports, write blog posts, prepare presentations, etc.

Conclusion

This concludes the full process of Data Analysis in which every step is very much important. Analytical and communication skills are mostly required in Data Analysis. Now, there is a fun part too:

All the processes are interrelated to each other. Suppose, after Exploratory Data Analysis while Drawing Conclusion, we find another question that was not answered and hence the steps are again repeated. Suppose, while Communicating Result any question was countered, so again the complete process is repeated. Therefore, we can say that Data Analysis is based on trial and error method and is not at all linear.

The best part of Data Analysis is Exploratory Data Analysis for me because with all the advanced codes it is a bit easy to do compared to others, though the complete process of Data Analysis is very interesting. So, future Data Analysts be ready to work with these awesome data and do those awesome kinds of stuff.

--

--

Vishal Malo

Computer Science and Engineering Student, Graduating 2021 | A Machine Learning Beginner