The Data Science Process.

Published in

ml.careers

5 min readSep 10, 2020

“Things get done only if the data we gather can inform and inspire those in a position to make a difference.” — Mike Schmoker

Everything has a start and an end, between the initialization and the termination, a process has to take place. Data science is a process that involves numerous steps that enable us to make sense of the data we have. These several steps can turn raw, unorganized, meaningless data into an organized meaningful dataset that tells a story. The number one fact in data is that data is never clean.

Every process always aims at a particular goal. In this case, the data science process always aims at achieving a given goal. For this to take place the following steps have to be followed.

Set the goal
Data scraping
Data cleaning/ cleansing
Data exploration
Data modeling
Data visualization

Set the goal.

In every process, a definite goal will always make the process easy to implement and work on. The very first step in the data science process is to identify the problem that requires your solving. With the ultimate goal set, you can identify the kind of data needed in solving the problem.

An example is if you are trying to understand the causes of climate change the kind of data needed is based on the weather patterns within the past years. It would be unreasonable for the team to gather their data in a financial institution.

The Data Science process is dependent on the ultimate goal set.

Data Scrapping

TIt is the process of getting the data. According to techopedia data scraping is defined as a system where a technology extracts data from a particular codebase or program. Data scraping provides results for a variety of uses and automates aspects of data aggregation.

Data scraping, also known as data extraction or web scraping is the process of extracting data from web pages. Scraping tools and software are used to access the Web with the Hypertext Transfer Protocol, collect useful data, and get it extracted as per the requirements. The scrapped information is then saved in a central database or gets downloaded on your hard drive for further uses.

The following are tools that will get you started a data scraping process.

scraping bee
Scrapy
scraper API
Octoparse
Parsel hub

Now that we have scraped the data we need to work on our climate change project. Let us jump on to the next step.

Data cleaning/ cleansing.

This is the third step of the Data Science process it is the most important step when it comes to working with data. Data is always messy and it requires to be cleaned or sorted out. To perform analysis on the data you will need to clean your data to have viable results from the data.

According to Wikipedia Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset or a database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data”

Things to look for in the data during the data cleaning process.

Corrupted values. Such as invalid entries.
Timezone differences, perhaps your database doesn’t take into account the different time zones of your users.
Missing values, there are cases where you will find the null values in a data set. If you decide to work with such data, be sure to get incorrect values.
Date range errors, in some cases you’ll have data that make no sense at all, such as data registered from before sales started.
Repetitive data.

Given that data is never clean, One is always required to sort out the data to eliminate the errors in untidy data. This process will determine the results of the next steps.

Data Exploration

In simple terms, this is figuring out the relationship in the data. It is getting the general view of the data that will help you in getting the result. It generally refers to the user being able to find his or her way through large amounts of data and gather necessary information.

Data exploration is an approach that is similar to the data analysis, whereby a data analyst or data scientist uses visual exploration to understand a data-set and the characteristics of the data. These characteristics may include the size of data, completeness of the data, the correctness of the data, the possible relationships among the data elements.

Though might be regarded as similar to Data mining. Let us get the definition of data mining Wikipedia describes data mining as a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining has an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use. It is similar to Data exploration.

Almost to the end.

Data Modeling

Data modeling is the representation of data structures in a table. It is a process used to analyze data in order to bring out the relationship between the data. This will help you in defining the data, getting the relationship, ensure consistency and quality of data.

Data modeling comes in three types.

Conceptual Data models. This is a model that describes what a system contains. It serves the purpose of defining business concepts and rules.
Logical Data Models. This is a model that describes how a system should be. It serves the purpose of creating the rules and structures of data.
Physical Data Models. This is a model that describes how a system should be implemented its purpose is to implement the database system.

Now that we have modeled the data let us now get the meaning out of the data this is where Data visualization comes in this is the last step.

Data visualization

This is the graphical representation of data. It is a process that involves the use of images to get the relationship in the data. It is the use of visual elements like charts, graphs, timelines, and maps, data visualization is an accessible way to see and understand trends, outliers, correlations, and patterns in the data.ower to discover solutions that have a positive impact in sectors like Medicine, Meteorological, Communication sectors just to mention.

Data visualization will enable you to

Make fast and better decisions based on the observations.
Improved insights. With visualization capturing the relationship in the data is.
Discover patterns faster
Discover relationships in the data set.
Ask relevant questions.

Being able to make sense of the data marks the end of the Data Science Process.

Summary

The data science process is a number of steps that aims at getting the meaning out of the data and help us to solve problems. In this case

“The goal is to turn data into information, and information into insight.” — Carly Fiorina

Having to go through the data science process is a path of discovery that will change the world. With data we have the power to discover solutions that have a positive impact in sectors like Medicine, Meteorological, Communication sectors just to mention.

Hope you liked our article leave a comment a like if you liked our article.

#happylearning #keeplearning

Africa Data School