DS Demystified (What is Data Science ?)

Published in

TheCyPhy

5 min readFeb 18, 2020

Data Science has been established as an important stream for scientific field and paradigm driving research evolution in the field of statistics, computing science and intelligence science. Data science carries practical transformation in science, business, social science, public sector, and lifestyle.

Evolution of Data Science

To understand what is Data Science lets first go through the history of Data Science. Data Science started with statistics and has evolved to include concepts such as Artificial Intelligence and Machine learning.

There are many different dates and timelines to trace the growth of Data Science and its current impact on the Data Management industry.

In 1962 Tukey wrote a paper “The Future of Data Analysis” and in this paper, he introduced term Data Analysis. In the discussion of this paper Mallow define statistics as “Statistics concerns the relation of quantitative data to a real-world problem, often in the presence of variability and uncertainty.

In 1974 Peter Naur authored the “Concise Survey of Computer Methods” and gave a definition of Data Science as:

“The Science of dealing with data, once they have been established, while the relation of the data to what they presented is delegated to other fields and sciences”

In 1977, The IASC also known as The International Association for Statistical Computing was formed with the mission “It is the mission of the IASC to link traditional statistical methodology, modern computer technology and the knowledge of domain experts in order to convert data into information and knowledge”.

The modern definition of “data science” was first sketched during the second Japanese-French statistics symposium organized at the University of Montpellier (France) in 1992.

Definition

“Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data” Source:https://en.wikipedia.org/wiki/Data_science

Data science is an interdisciplinary field, to learn data science one needs to be good in Mathematics and Statistics to analyze and visualize data. Machine learning is the heart of data science, is used to form models.

Important tasks in Data Science

Data Collection
Storage of Data
Data Cleaning
Data Exploration
Data Modelling

Data Collection

In data science one of the important tasks is data collection. Before collecting data we should think about the question that we want to answer and the environment in which data scientist works. A question could be descriptive, exploratory, inferential, predictive, causal, Mechanistic. Data collection involves the exercise of collecting and measuring data about the feature (variable) using which problem can be solved. At this stage data could be in raw form that means data might be noisy, not suitable for analytics.

Storage of Data

Data could be structured, unstructured and semi-structured. Structured data those data which are in the tabular form like transactional and operational data. Structured data stored in relational databases. Unstructured data, which are not in the tabular form like images, videos and audio clips. Unstructured data can not be stored in tabular form. To store unstructured and semi-structured data probably we can use a distributed file system like HDFS.

Data Cleaning

The goal of data cleaning is to process raw data and obtain tidy data. According to an article in The Newyork times named “ For Big Data Scientist, Janitor Work is key hurdle to insight”

a data scientist spends 50 to 80 per cent of the time in the cleaning of data. This is because data could have inconsistencies in the column, meaning that some column might have NaN, Na values and outliers. A data scientist has to get rid of these values.

Data Exploration

Data Exploration or Exploratory Data Analysis (EDA) is a process of analyzing data to obtain some basic understanding of data. EDA helps to uncover the structure of data, distribution of values, presence of outliers and interrelationship between data sets. Visual representation of data help us to understand the relation or association between the variables this process is known as feature extraction. Feature extraction helps in data modelling.

Data Modelling

Data modelling is one of the interesting tasks in data science but data cleaning and data exploration is equally crucial to build good data models. So it’s always a good idea to invest more time on data cleaning and exploration stage.

In a data set, not all variables are essential to predict your model. So you have to reduce the dimensionality of your data set.

In data modelling, we perform to important task first is to train your model and second test your model. To train and test model its always good to split the data set into sixty per cent and forty per cent. Sixty per cent data to train the model and forty per cent to test the model.

References

[1] https://en.wikipedia.org/wiki/Data_science

[2]https://www.jstor.org/stable/25471205seq=1#metadata_info_tab_contents

[3] https://link.springer.com/journal/41060

[4] https://www.sciencedirect.com/topics/computer-science/data-exploration

DS Demystified (What is Data Science ?)

Written by Vikash