Introduction to Data Science

Nupur Kapur
Analytics Vidhya
Published in
6 min readFeb 23, 2020

“Data Scientist: The Sexiest Job ever”. “Pursue data science: one of the top fields to pursue a career in.”

I am sure you must have heard a lot about this buzz word data science and wonder what is data science. Today, we will discuss what is data science, why do we need it and what are its use cases in real life.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

But why do we need it?

If we talk about 15 years back, the data used to be small and in a structured format. It used to be easy to analyze such data by using simple analysis tools like Excel. But now, as the customers are increasing day by day, we have a huge amount of data in an unstructured format and it became difficult to store it. Big data solved the problem of storage of data by using tools like Hadoop. Simple BI tools failed to handle and analyze such hard and unstructured data. Data science made it possible to analyze huge data using machine learning and deep learning algorithms and draw meaningful insights from the it, helping various industries to manage and understand data.

What goes into a Data Science Project?

Source

Step 1: Business Understanding

This step is the starting of any project. In this step, questions are asked to understand the requirements and the objective of the project. The requirements are noted down and the goal of the project is understood by asking various questions.

Step 2: Data Mining

This step involves gathering the data to begin the project. Either the company provides its customer data or the data is scraped from various sites by using scraping tools like Scrapy, BeautifulSoup, etc.

Step 3: Data Cleaning

The next step is to clean the data. It involves dealing with inconsistencies within the data and handling missing values. We can delete the rows with missing values. If the number of missing values is large and deleting rows will lead to huge data loss, we can replace the missing values with either the mean or median of the column.

Step 4: Data Exploration

We form a hypothesis in this step about our problem by visually analyzing data. We use analysis tools like Tableau, PowerBI, etc. to understand the data and relationships between various features and how the features impact each other.

Step 5: Feature Engineering

In this step, we use all the important features in the dataset which influence our results the most and also make new features from the raw data so that better results can be derived. We can combine two features or remove some features for better analysis and better results.

Step 6: Predictive Modelling

This step involves training our machine learning models on the enhanced version of the dataset and evaluates the performance of the model by seeing how accurate results it can give and later using it to make predictions.

Step 7: Data Visualization

In the last step of the life cycle, the results are communicated to the stakeholders using various plots and interactive visualizations. The plots are made such that it is easy to explain the end-user the findings of the project.

Use-cases of Data Science

Healthcare

Data science is being used extensively in healthcare these days. It is being used in various healthcare domains like studying and analyzing pictures of lungs and decide whether the patient has pneumonia or not. It has led to more accurate results as the deep learning algorithms learn from the previous examples and can study complex patterns which are not possible for humans. The algorithms are also extensively used to detect whether a person has cancer at much earlier stages and offer personalized treatment.

Other areas where data science is being used in health care:

  • Medical Imaging
  • Genomes
  • Drug Discovery
  • Monitoring Patient’s Health

Recommender Systems

Source

Shopping websites, social media platforms, and media-service provider companies like Netflix and Amazon Prime are using recommendation systems to provide their customers personalized access to their platform. Shopping websites like Amazon and Flipkart try to understand your likes and dislikes and recommend items according to your liking or it tries you to recommend other user’s choices if you and the other user have the same likes. Netflix and Amazon Prime recommend movies according to your genre of liking or recommend other user’s choices if you and the other user have the same interests.

Fraud and Risk Detection

Fraud is a billion-dollar business that happens every day and it increases every year. Frauds happening daily include credit card transactions, cell phones, tax return claims, etc. To prevent frauds, expert systems can be used which look for frauds based on rules. Pattern recognition can be used to study patterns and behavior of all the frauds that have happened in the past and look for those patterns in the future to prevent fraud. Neural networks can also be used for fraud detection since they can study complex and non-linear patterns and can give valuable insights.

Targeted Advertising

Has it ever happened to you that you start seeing all the advertisements on the social media sites and the websites you visit regarding what you searched minutes before? That is called target advertising. All the advertisements relating to your search start appearing on the sites you visit later. Such tactics help their business since users find it easy to look at those sites and buy products from that company rather than searching every site on the internet. A user’s laziness becomes their business idea and in this way, these companies are earning billions !!!

Transport

Data Science is being extensively used in the making of self-driving cars. In the above video, we see that the driver was just sitting and relaxing while the car reached the location itself without the driver’s help. Data science has enhanced the driving experience as the driver can sit back while the car automatically takes him to the location. The self-driving car can make turns itself. The safety of the passenger is also enhanced since it keeps monitoring the road for any upcoming dangers and takes measures to prevent any accident by reducing the speed of the car or take a turn carefully if there is a deep cut.

Pre-requisites of Data Science

Image Credit: https://www.corpnce.com/data-science-courses-bangalore/

The above diagram mentions the skills required to become a data scientist. Maths and statistics are required to understand the data and complex variables in the algorithms. Computer programming skills are important as it is important to understand the algorithms and tune the variables in the algorithm if needed. You should know all the machine learning algorithms and understand which machine learning algorithm should be used in a particular case. You also need to have the domain knowledge to understand the business problem properly. Also, you need to have good communication skills to be a data scientist because a Data Scientist needs to be a good storyteller.

Thank you for reading my next blog and if there are any recommendations feel free to leave a comment!

Feel free to connect with me on LinkedIn https://www.linkedin.com/in/nupur-kapur-nk/

--

--

Nupur Kapur
Analytics Vidhya

Data Science Enthusiast. I am quite inquisitive to learn about new technologies. I love attending tech meetups to learn about developments in the field of AI.