Adventures of a Data scientist.

Arpit Jain
Grey Matter AI
Published in
7 min readApr 12, 2021

--

Journey through the world of Data Science.

One of the hottest skills which you can have in the 21st century is Data Science. But ever wondered what does a Data Scientist do?

In this article, I will try to explain in brief what are the different tasks and roles a Data Scientist has to play. I’ll be taking the analogy of cooking for a better understanding.

Data Science Process

There is no defined way one can approach a Data Science problem. The list of task discussed below is the general process which has been tried and tested by the majority of Data Scientist across the globe.

Don't be a slave to the process. the steps may change based on the problem statement you have in hand.

Step 1: Setting up the Goal

Before you start anything, one thing that should be crystal clear is the end goal. What are you trying to achieve is the most important aspect to start any task. What, Why, and How are the most important questions, that are needed to be answered in this step. Traveling without a destination in mind is called wandering and not a journey.

Analogy: Consider you are in your kitchen and want to cook something, how can you start before you decide what you would like to eat today?

Step 2: Gathering Data

The next step and the most important step in the process is to gather data. All the subsequent steps in the process depend on this step. If correct data is not collected, we will not be able to achieve the goal state no matter how complex the algorithm we write. Data can be considered as a raw product of the whole process.

Data can be collected from either internal or external systems. If data is not available internally, don't shy away from collecting data from an external system. Data can be bought from specific organisations as well. There are a lot of companies out there who have already collected data that you may require. Although data is considered an asset more valuable than oil by certain companies, more and more governments and organisations share their data for free with the world.

Few open source data repositories

When we talk about data, how do we store it is a major question to be answered. There are many ways to store the data. You can store the data as an SQL table, NoSQL format, JSON object, etc. If you are dealing with stream data, there are many cloud options available to opt for.

Quality of data is again an important factor to be considerate of. In the complete process, most of your time will be spent cleaning, analysing, and making sense out of the data in hand. It is very important to keep a check on the quality of data being used. If you do a good job at this step, the rest of the journey will be very smooth.

Analogy: To start with the cooking process, we need to accumulate the required raw material. Consider you are planning to prepare “Paneer Butter Masala”, a common dish for all vegetarians in India. This most important step in the preparation is to collect the correct ingredients needed. If the quality of Paneer used is not up to the mark, do you think the final dish will come out as you expect?

Step 3: Data Preparation

The information received from the data retrieval phase is likely to be “a diamond in the rough” Your task now is to sanitise and prepare it for use in the modelling and reporting phase.

Data preparation can be divided into different sub-tasks like —
Data Cleaning
Transformation
Combining data from various sources

Many a times the format in which data is received is not suitable for the task you want to perform. The data needs to be transformed in the required format using suitable steps. Your project may require you to collect data from various sensors connected to multiple IoT devices. In which case, it becomes very essential for you to combine data from different sources and maintain the structure.

Data cleansing is a subprocess of the data science process that focuses on removing errors in your data so your data becomes a true and consistent representation of the processes it originates from

Common issues faced in the data cleaning process and their possible resolutions

Sometimes there is a need to use advanced processes to clean data which include Outlier detection, checking skewness in data, etc for which detailed EDA is required. A good practice is to mediate data errors as early as possible in the data collection chain and to fix as little as possible inside your program.

Analogy: So now you have collected all the ingredients to prepare the dish you are planning for. Before you start you need to wash all the vegetables to be used, check for the quality of all the ingredients and also chop them in the desired slices. This process may not be limited to these actions just like the data preparation step.

Step 4: Exploratory Data Analysis (EDA)

Once you have cleaned up your data, the next step is to dive deep into your dataset. Graphical representation of the dataset is always the best way to visualise the data. Looking at 1 million rows with 100 columns is far more tedious than just visualising the dataset in various graphical forms. EDA can help you understand your data better. Understanding the relationship, different feature attributes have with their target attributes and with other feature attributes will help you model the data in a better way.

EDA can be divided into two major categories, Univariate analysis, and Multi-variate analysis. Univariate analysis is when you understand variations in a specific feature attribute or the variation it has with the target attribute. Graphs that can be used in this are histogram, line chart, bar graph, etc. In Multi-variate analysis, you take into consideration multiple variables at a time and understand the relationship between those. Graphs that can be used in this are scatterplots, box-plots, etc.

The techniques we described in this phase are mainly visual, but in practice, they’re certainly not limited to visualisation techniques. Tabulation, clustering, and other modelling techniques can also be a part of exploratory analysis. Even building simple KNN models can be a part of this step.

Step 5: Data Modelling

Now with all the data gathering, cleanup and exploration done, you come to the most important step of actually modelling your data. This is where the fun starts. This is the step where you start to see your goals being achieved.

This step can be divided into 3 major sub-tasks:

  1. Model selection: Where you select the model to be used based on the problem statement. For example, if the end goal is to classify the images between a set of classes, we can make use of a simple logistic regression model to accomplish this. You can choose from pre-trained open-source models or you can build your own model from scratch, it totally depends on case to case basis. You can find more details on this topic here.
  2. Model execution: Once you have chosen the model, you’ll need to implement the code to train your model and infer results out of the model. Luckily, most programming languages, such as Python, already have libraries such as Scikit-learn which have most of the techniques already implemented. This takes away the tedious task of implementing mathematical modelling.
  3. Model evaluation: You’ll be building multiple models from which you then choose the best one based on multiple criteria. Working with a holdout sample helps you pick the best-performing model. The most commonly used model evaluation method used to evaluate a classification model is the Confusion Matrix”. This matrix gives you a better understanding of how the model has performed on a set of input sets provided. You can evaluate accuracy, precision, recall rate, etc. using this matrix.

Step 5: Presentation and Automation

Finally, it's time to reap what you have sowed. The time has come to brag about the results you got from the complete process. After you’ve successfully analysed the data and built a well-performing model, you are ready to present the results to the stakeholders.

To save all your repetitive efforts of performing the same tasks again and again, it is important to automate the complete process, so that if there is any addition to the dataset you used, you don't have to go through the whole process again.

Conclusion: Planning every detail of the data science process upfront isn’t always possible, and often you’ll iterate between the different steps of the process. Again there is no sequence of steps that are needed to achieve the goal you desire. The process may vary from case to case.

Hope this article was helpful and was able to generate a spark of curiosity inside your mind.

--

--