Data Science Tutorial For Beginners — Learn Data Science from Scratch!

Sahiti Kappagantula
Edureka
Published in
13 min readJun 5, 2017
Data Science Tutorial — Edureka

Want to start your career as a Data Scientist, but don’t know where to start? You are at the right place! In this article on Data Science Tutorial, we will be discussing the concepts related to data science in-depth.

The following topics will be covered in this article:

  1. Why Data Science?
  2. What is Data Science?
  3. Who is a Data Scientist?
  4. Job Trends
  5. How to solve a problem in Data Science?
  6. Data Science Components
  7. Data Scientist Job Roles

Why Data Science?

It’s been said that Data Scientist is the “Sexiest Job of the 21st century”. Why? Because over the past few years, companies have been storing their data. And this being done by each and every company, it has suddenly led to data explosion. Data has become the most abundant thing today.

But, what will you do with this data? Let’s understand this using an example:

Say, you have a company which makes mobile phones. You released your first product, and it became a massive hit. Every technology has a life, right? So, now its time to come up with something new. But you don’t know what should be innovated, so as to meet the expectations of the users, who are eagerly waiting for your next release?

Somebody, in your company, comes up with an idea of using the user-generated feedback and pick things which we feel users are expecting in the next release.

Comes in Data Science, you apply various data mining techniques like sentiment analysis etc and get the desired results.

It’s not only this, you can make better decisions, you can reduce your production costs by coming out with efficient ways, and give your customers what they actually want!

With this, there are countless benefits that Data Science can result in, and hence it has become absolutely necessary for your company to have a Data Science Team. Requirements like these led to “Data Science” as a subject today, and hence I am writing this article on Data Science Tutorial for you. :)

What is Data Science?

The term Data Science has emerged recently with the evolution of mathematical statistics and data analysis. The journey has been amazing, we have accomplished so much today in the field of Data Science.

In the next few years, we will be able to predict the future as claimed by researchers from MIT. They already have reached a milestone in predicting the future, with their awesome research. They can now predict what will happen in the next scene of a movie, with their machine! How? Well it might be a little complex for you to understand as of now, but don’t worry by the end of this blog, you shall have an answer to that as well.

Coming back, we were talking about Data Science, it is also known as data-driven science, which makes use of scientific methods, processes and systems to extract knowledge or insights from data in various forms, i.e either structured or unstructured.

What are these methods and processes, is what we are going to discuss in this Data Science Tutorial today.

Moving forward, who does all this brainstorming, or who practices Data Science? A Data Scientist.

Who is a Data Scientist?

As you can see in the image, a Data Scientist is the master of all trades! He should be proficient in math, he should be acing the Business field, and should have great Computer Science skills as well. Scared? Don’t be. Though you need to be good in all these fields, but even if you aren’t, you’re not alone!

There is no such thing as “a complete data scientist”. If we talk about working in a corporate environment, the work is distributed among teams, wherein each team has their own expertise. But the thing is, you should be proficient in at least one of these fields. Also, even if these skills are new to you, chill! It may take time, but these skills can be developed, and believe me it would be worth the time you will be investing. Why? Well, let’s look at the job trends.

Data Scientist Job Trends

Well, the graph says it all, not only there are lot of job openings for a data scientist, but the jobs are well-paid too! And no, our blog will not cover the salary figures, go google!

Well, we now know, learning data science actually makes sense, not only because it is very useful, but also you have a great career in it in the near future.

Let’s start our journey in learning data science now and begin with, the next topic i.e. How to solve a problem in Data Science.

How to solve a problem in Data Science?

So now, let’s discuss how should one approach a problem and solve it with data science. Problems in Data Science are solved using Algorithms. But, the biggest thing to judge is which algorithm to use and when to use it?

Basically, there are 5 kinds of problems which you can face in data science.

Let’s address each of these questions and the associated algorithms one by one:

Problem 1

Is this A or B?

With this question, we are referring to problems that have a categorical answer, as in problems which have a fixed solution, the answer could either be a yes or a no, 1 or 0, interested, maybe or not interested.

For Example:

Q. What will you have, Tea or Coffee?

Here, you cannot say you would want a coke! Since the question only offers tea or coffee, and hence you may answer one of these only.

When we have only two type of answers i.e yes or no, 1 or 0, it is called 2 — Class Classification. With more than two options, it is called Multi-Class Classification.

Concluding, whenever you come across questions, the answer to which is categorical, in Data Science you will be solving these problems using Classification Algorithms.

The next problem in this article, that you may come across, maybe something like below.

Problem 2

Is this weird?

Questions like these deal with patterns and can be solved using Anomaly Detection algorithms.

For Example:

Try associating the problem “is this weird?” to this diagram,

What is weird in the above pattern? The red guy, isn’t it?

Whenever there is a break in the pattern, the algorithm flags that particular event for us to review. A real-world application of this algorithm has been implemented by Credit Card companies wherein, any unusual transaction by a user is flagged for review. Hence implementing security and reducing human’s effort on surveillance.

Let’s look at the next problem in this Data Science Tutorial, don’t be scared, deals with maths!

Problem 3

How much or How many?

Those of you, who don’t like maths, be relieved! Regression algorithms are here!

So, whenever there is a problem that may ask for figures or numerical values, we solve it using Regression Algorithms.

For Example:

What will be the temperature for tomorrow?

Since we expect a numeric value in the response to this problem, we will solve it using Regression Algorithms.

Moving along in this Data Science Tutorial, let’s discuss the next algorithm.

Problem 4

How is this organized?

Say you have some data, now you don’t have any idea, how to make sense out of this data. Hence the question, how is this organized?

Well, you can solve it using clustering algorithms. How do they solve these problems? Let’s see:

Clustering algorithms group the data in terms of characteristics which are common. For example in the above diagram, the dots are organized based on colors. Similarly, be it any data, clustering algorithms try to apprehend what is common between them and hence “clusters” them together.

The next and final kind of problem in this Data Science Tutorial, that you may encounter is as below.

Problem 5

What should I do next?

Whenever you encounter a problem, wherein your computer has to make a decision based on the training that you have given it, it involves Reinforcement Algorithms.

For Example:

Your temperature control system, when it has to decide whether it should lower the temperature of the room, or increase it.

How do these algorithms work?

These algorithms are based on human psychology. We like being appreciated right? Computers implement these algorithms, and expect being appreciated when being trained. How? Let’s see.

Rather than teaching the computer what to do, you let it decide what to do, and at the end of that action, you give either a positive or a negative feedback. Hence, rather than defining what is right and what is wrong in your system, you let your system “decide” what to do, and in the end give a feedback.

It’s just like training your dog. You cannot control what your dog does, right? But you can scold him when he does wrong. Similarly, maybe patting him on the back when he does what is expected.

Let’s apply this understanding in the example above, imagine you are training the temperature control system, so whenever the no. of people in the room increase, there has to be an action taken by the system. Either lower the temperature or increase it. Since our system doesn’t understand anything, it takes a random decision, let’s suppose, it increases the temperature. Therefore, you give a negative feedback. With this, the computer understands whenever the number of people increase in the room, never increase the temperature.

Similarly, for other actions, you shall give feedback. With each feedback your system is learning and hence becomes more accurate in its next decision, this type of learning is called Reinforcement Learning.

Now, the algorithms that we learnt above in this Data Science Tutorial involve a common “learning practice”. We are making the machine learn right?

What is Machine Learning?

It is a type of Artificial Intelligence that makes the computers capable of learning on their own i.e without explicitly being programmed. With machine learning, machines can update their own code, whenever they come across a new situation.

Concluding in this article, we now know Data Science is backed by Machine Learning and its algorithms for its analysis. How we do the analysis, where do we do it. Data Science further has some components which aid us in addressing all these questions.

Before that let me answer how MIT can predict the future because I think you guys might be able to relate it now. So, researchers at MIT trained their model with movies and the computers learnt how humans respond, or how do they act before doing an action.

For example, when you are about shake hands with someone you take your hand out of your pocket, or maybe lean in on the person. Basically, there is a “pre-action” attached to everything we do. The computer with the help of movies was trained on these “pre actions”. And by observing more and more movies, their computers were then able to predict what the character’s next action could be.

Easy ain’t it? Let me throw one more question at you then in this article! Which algorithm of Machine Learning must have implemented in this?

Data Science Components

1. Datasets

What will you analyze on? Data, right? You need a lot of data which can be analyzed, this data is fed to your algorithms or analytical tools. You get this data from various researches conducted in the past.

2. R Studio

R is an open source programming language and software environment for statistical computing and graphics that is supported by the R foundation. The R language is used in an IDE called R Studio.

Why is it used?

Programming and Statistical Language

Apart from being used as a statistical language, it can also be used a programming language for analytical purposes.

Data Analysis and Visualization

Apart from being one of the most dominant analytics tools, R also is one of the most popular tools used for data visualization.

Simple and Easy to Learn

R is a simple and easy to learn, read & write

Free and Open Source

R is an example of a FLOSS (Free/Libre and Open Source Software) which means one can freely distribute copies of this software, read it’s source code, modify it, etc.

R Studio was sufficient for analysis, until our datasets became huge, also unstructured at the same time. This type of data was called Big Data.

3. Big Data

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Now to tame this data, we had to come up with a tool, because no traditional software could handle this kind of data, and hence we came up with Hadoop.

4. Hadoop

Hadoop is a framework which helps us to store and process large datasets in parallel and in a distributed fashion.

Let’s focus on the store and process part of Hadoop.

Store

The storage part in Hadoop is handled by HDFS i.e Hadoop Distributed File System. It provides high availability across a distributed ecosystem. The way it functions is like this, it breaks the incoming information into chunks, and distributes them to different nodes in a cluster, allowing distributed storage.

Process

MapReduce is the heart of Hadoop processing. The algorithms do two important tasks, map and reduce. The mappers break the task into smaller tasks which are processed parallely. Once, all the mappers do their share of work, they aggregate their results, and then these results are reduced to a simpler value by the Reduce process.

If we use Hadoop as our storage in Data Science it becomes difficult to process the input with R Studio, due to its inability to perform well in a distributed environment, hence we have Spark R.

5. Spark R

It is an R package, that provides a lightweight way of using Apache Spark with R. Why will you use it over tradition R applications? Because, it provides a distributed data frame implementation that supports operation like selection, filtering, aggregation etc but on large datasets.

Take a breather now! We are done with the technical part in this article, let’s look at it from your job perspective now. I think you would have googled the salaries by now for a data scientist, but still, let’s discuss the job roles which are available for you as a data scientist.

Data Scientist Job Roles

Some of the prominent Data Scientist job titles are:

  • Data Scientist
  • Data Engineer
  • Data Architect
  • Data Administrator
  • Data Analyst
  • Business Analyst
  • Data/Analytics Manager
  • Business Intelligence Manager

The Payscale.com chart in this Data Science Tutorial below shows the average Data Scientist salary by skills in the USA and India.

The time is ripe to up-skill in Data Science and Big Data Analytics to take advantage of the Data Science career opportunities that come your way. This brings us to the end of Data Science tutorial article. I hope this article was informative and added value to you.

If you wish to check out more articles on the market’s most trending technologies like Python, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Data Science.

1.Machine Learning in R for Beginners

2.Math And Statistics For Data Science

3.Linear Regression in R

4.Machine Learning Algorithms

5.Logistic Regression In R

6.Classification Algorithms

7.Random Forest In R

8.Decision Tree in R

9.Introduction To Machine Learning

10.Naive Bayes in R

11.Statistics and Probability

12.How To Create A Perfect Decision Tree?

13.Top 10 Myths Regarding Data Scientists Roles

14.Top Data Science Projects

15.Data Analyst vs Data Engineer vs Data Scientist

16.Types Of Artificial Intelligence

17.R vs Python

18.Artificial Intelligence vs Machine Learning vs Deep Learning

19.Machine Learning Projects

20.Data Analyst Interview Questions And Answers

21.Data Science And Machine Learning Tools For Non-Programmers

22.Top 10 Machine Learning Frameworks

23.Statistics for Machine Learning

24.Machine Learning Interview Questions And Answers

25.Breadth-First Search Algorithm

26.Linear Discriminant Analysis in R

27.Prerequisites for Machine Learning

28.Interactive WebApps using R Shiny

29.Top 10 Books for Machine Learning

30.Unsupervised Learning

31.10 Best Books for Data Science

32.Supervised Learning

Originally published at www.edureka.co on June 5, 2017.

--

--

Sahiti Kappagantula
Edureka

A Data Science and Robotic Process Automation Enthusiast. Technical Writer.