Introduction To Data Science

Rashal Ismath
Analytics Vidhya
Published in
6 min readNov 30, 2020

--

What is Data Science?

Data science is the field of study that uses mathematics, programming, and domain knowledge to extract meaningful insights from data. Data scientists would apply machine learning algorithms to data such as numbers, texts, images, videos, and more to produce artificial systems that perform tasks that require human intelligence. Using these systems, we extract insights from the data and use them to make better decisions in various situations.

data science, introduction to data science

We live in a world that is full of data. From traffic cameras to big corporations produces tons of data every day every minute. Just one click on the Facebook post may save lots of information about that click.

Have you ever wondered how Amazon, eBay suggests items for you to buy or How Gmail filters your emails in the spam and non-spam categories? If you want to study this field, it’s time to start wondering about these kinds of stuff more.

So how do they do it?

Data science is all about using data to solve these kinds of problems. The problem could be decision making, such as identifying which email is spam and which is not. Or a product recommendation such as which movie to watch? , Or predicting the outcome such as who will be the next President of the USA?
This is how you get YouTube video recommendations as well. You watch a video, YouTube will automatically put a record in their database that you watched this video. Next time when you visit YouTube you will get the same types of videos you watched earlier. Which makes you happy since you get your favorites on the front page. So this decision YouTube made to show that particular type of video you like helps them to make user experience or lure you into YouTube more.

So, the core job of a data scientist is to understand the data, extract useful information out of it, and apply this in solving the problems.

Data Science is the future of Artificial Intelligence. Therefore, it is very important to understand what is Data Science and how can it add value to your business.

Who is a Data Scientist?

Data Scientist: The Sexiest Job of the 21st Century

Data scientists are those who crack hidden problems with their strong expertise in a certain scientific discipline. They have knowledge in areas like mathematics, statistics, and computer science. They make a lot of use of computer algorithms in finding solutions and reaching conclusions that are crucial for an organization’s growth and development. Data Scientists present the data in a much more useful form as compared to the raw data available to them from structured as well as unstructured forms.

So to put in simplest words, you would get data from somewhere, if it’s in an organization most probably from their own data warehouses, and apply some math and programming to the data so that they can find answers for problems within the organization. These answers might be in the form of graphs, dashboards, presentations and etc.

job of a data scientist

The lifecycle of a Data Science Project

Life cycle of data science
  1. Define Problem Statement

Understanding what we want to solve is the starting point of a data science project. It is a brief description of the problem that you are going to solve.
Examples
1) I want to increase the revenues
2) I want to recommend products on my website for the customers.
3) I want to predict stock prices

2. Data Collection and preparation

We need to collect the data that are relevant and help to solve our problem.
Depending on the problem we are trying to solve, we will be collecting new data or otherwise, we will be using data that are available opensource.

When we have a unique problem and no related researches has been done on the subject before, we will have to collect new data on the subject.
For example,
we want information on the average time that employees spend in the cafeteria across companies. There is no public data available on these. But you can collect the data through various methods such as surveys, interviews of employees, and by monitoring the time spent by employees in the cafeteria. This method is time-consuming.

The other method is to use the data which is readily available or collected by someone else. These data can be found on the internet, in news articles, government census, magazines, and so on. This method is less time-consuming.
One of the famous places to collect data on the internet is kaggle.com where you will be able to find thousands of data sets.

So after we got the data, before we start to analyze the data, which is the next step, we will have to clean this data. Here we will be doing various operations on the data like eliminating missing values from the data set we have to perform better statistical analysis.

3. Exploratory Data Analysis

This is the most exciting and important step as it helps us to build familiarity with the data and extract useful insights. If we skip this step then we might end up using inaccurate models and choosing the insignificant variables in our model. ( Here model is basically a mathematical algorithm that will be used to solve our problem ).

In this step, we will be using descriptive statistics concepts such as central value measures and variability measures to understand the data. Also, visualization methods such as graphs and plots are really important in this phase since it helps us to understand data better.

4. Building the model

Modeling means formulating every step and gather the techniques required to achieve the solution.

Here we will be using probability and inferential statistics to build relationships between variables in the data to solve our problem. So in this step, your math skills will be put to the test. But most of the time these calculations involved to achieve the results are packed and already written for us to use as software libraries. ( Software library is a collection of programming files that are put to gather. The library consists of one or more algorithms in it ). But having math skills here helps us to choose the correct algorithm and use parameters wisely on those algorithms.

5. Data Communication

This is the final step where we present the results from our analysis to the stakeholders. We explain to them how we came to a specific conclusion and our critical findings. Most often we need to present our findings to a non-technical audience, such as the marketing team or business executives. We need to communicate the results in a simple to understand manner. So here we will be using graphs and presentations to convey our results. Then these stakeholders will be using our insights to make business decisions widely within the organization.

So that’s basically it for introduction to data science. Hopefully, in the next post we will talk about, whats artificial intelligence (AI), what’s machine learning (ML), and what’s the difference between data science and AI and ML.
After that, we will start the data science journey starting from some mathematics.

So thank you for reading this post, I hope you enjoyed it (I’ll get better at writing).

--

--