Let’s learn about Cutting Edge Technologies

What is Data Science? [2022]

Let’s unpack the what, why, and how of data science. I hope that after reading this, you’ll have a better grasp of how data is being used around you, how you can use data and be a working professional; interesting, right so let's dive into it.

Abhishek Kukreja
10 min readFeb 6, 2022

--

First thing first, Let’s ask Google!

If we Google “What is data science?”, we’ll see a huge amount of confusing information, is it like the science of data, or do we use normal science and then add data to it or you can also get the right information, it’s google.

Making data work for you

But data science is actually simple. It’s a set of methodologies for taking in thousands of forms of data that are available to us today and using them to draw meaningful conclusions. Data is being collected all around us. Every like, click, email, credit card swipe, tweet, or even this article is a new piece of data that can be used to better describe the present or better predict the future (sounds fun !!).

What can data do?

  • Describe the current state of the organization or process.
  • If we have data on what has happened previously, we can increase efficiency by automatically detecting a new event that is unexpected or abnormal. In simple words, detect anomalous events.
  • Diagnose the causes of events and behavior
  • Data can also diagnose the causes of observed events and behaviors, for instance, your activity on Spotify or Netflix, or Google also. Rather than determining correlations between small numbers of events, data science techniques help us understand complex systems with many possible causes.
  • Finally, data can predict future events (like astrology hell yeah ), such as forecasting population size. We can use new techniques to take various causes into account and predict potential outcomes. Further, we can evaluate the probability of our prediction mathematically to clarify our level of uncertainty.

So Why now?

So now we know what data science is. The next question is why is it so popular? The answer is pretty obvious: the name is so cool right; nope because we’re collecting more data than ever before. Suppose that you visit a car dealership and fill out some information(you can also consider this as a normal search on amazon or any e-commerce platform where you give your email address and show interest in a product, you’ll find the ad of that same product on third-party applications and you’ll be like damn how this app knows what I want, its recommendation algorithm working its magic ). So let’s get into more details without being technical.

All of that data(like email in this case) is automatically entered into a computer, and combined with the data from hundreds of dealerships into one big database. Once we have that data, it’s easy to use the email address that you provided when you bought that car to tie your car purchase data with your data from social media or web browsing. Suddenly, we have a very complete picture of everyone who purchased a car in the last year: their ages, their likes, and dislikes, their friends and family(I know this is scary as it concerns your privacy but this post is just for fun). This additional data can be used to predict what price you can pay for your car, what other purchases you’re likely to make, or how best to sell you insurance for that new car. Data is everywhere, and it is incredibly valuable information for businesses, organizations, and governments.

“Data is the new oil.”
— Clive Humby

The data science workflow[Easy Peasy]

So, how do we start to use data?

In data science, we generally have four steps to any project (mostly).First, we collect data from many sources, such as surveys, web traffic results, geotagged social media posts, and financial transactions. Once collected, we store that data in a safe(privacy haha) and accessible way.

At this point, data is in its raw form, so the next step is to prepare data. This includes “cleaning data”, for instance finding missing or duplicate values, and converting data into a more organized format.

Then, we explore and visualize the cleaned data. This could involve building dashboards to track how the data changes over time or performing comparisons between two sets of data.

Finally, we run experiments and predictions on the data. For example, this could involve building a system that forecasts temperature changes or performing a test to find which web page acquires more customers.

Applications of data science

So now we have learned the definition of data science and the steps in a data science workflow. Now let's understand how data science can be applied to real-world problems.

Learning through case studies is the best way

Let’s take a deep dive into three exciting areas of data science: traditional machine learning, the Internet of Things, and deep learning.

First Case study: fraud detection

Photo by Mark O'Flynn on Unsplash

Suppose you work in fraud detection at a large bank. You’d like to use data to determine the probability that the transaction is fake.

To answer this question, you might start by gathering information about each purchase, such as the amount, date, location, purchase type, and cardholder’s address. You’ll need many examples of transactions, including this information, as well as a label, that tells you whether each transaction is valid or fraudulent. Luckily, you probably have this information in a database. These records are called “training data”, and are used to build an algorithm. Each time a new transaction occurs, you’ll give your algorithm information, like amount and date, and it will answer the original question: What is the probability that this transaction is fraudulent?

What do we need for machine learning?

Before we can answer that question, let’s walk through our example and highlight what we need for machine learning to work its magic. First, a data science problem begins with a well-defined question.

Our question was “What is the probability that this transaction is fraudulent?” .

Next, we need some data to analyze. We have months of old credit card transactions and associated metadata, like date and location, that have already been identified as either fraudulent or valid. Finally, we need additional data every time we want to make a new prediction.

We need to have the same type of information on every new purchase so that we could label it as “fraudulent” or “valid”.

Second Case study: smartwatch

Now, suppose you’re trying to build a smartwatch to monitor physical activity. You want to be able to auto-detect different activities, such as walking or running. Your smartwatch is equipped with a special sensor, called an “accelerometer”, that monitors motion in three dimensions. The data generated by this sensor is the basis of your machine learning problem. You could ask several volunteers to wear your watch and record when they are running or walking. You could then develop an algorithm that recognizes accelerometer data as representing one of those two states: walking or running.

Internet of Things (IoT)

Your smartwatch is part of a fast-growing field called “the Internet of Things”, also known as IoT, which is often combined with Data Science. IoT refers to gadgets that are not standard computers but still have the ability to transmit data. This includes

  • smartwatches
  • Internet-connected home security systems
  • Electronic toll collection systems
  • Building energy management systems, and much, much more.

IoT data is a great resource for data science projects!

Third Case study: Image recognition

Let’s tackle another example. A key task for self-driving cars(like Tesla) is identifying when an image contains a human. What would the dataset be for this problem?

We could express the picture as a matrix of numbers where each number represents a pixel. However, this approach would probably fail if we fed the matrix into a traditional machine learning model. There’s simply too much input data!

I don’t know why he’s looking like Abraham Lincoln to me.

We need more advanced algorithms from a subfield of machine learning called deep learning. In deep learning, multiple layers of mini-algorithms, called “neurons”, work together to draw complex conclusions, you can think it just like our brains. Deep learning takes much, much more training data than a traditional machine learning model, but is also able to learn relationships that traditional models cannot. Deep learning is used to solve data-intensive problems, such as image classification or language understanding.

Roles in data science [How to be a working professional in this field]

You might be surprised to learn that there isn’t a single job within data science. Generally, there are four jobs: Data Engineer, Data Analyst, Data Scientist, and Machine Learning Scientist. Let’s explore each one.

Data engineer

Data engineers control the flow of data:

  • They build custom data pipelines and storage systems.
  • They design infrastructure so that data is not only collected but easy to obtain and process.
  • Within the data science workflow, they focus on the first stage: data collection and storage.
A yellow border means their area of focus in most cases and blue background means that they work with that stages as well.

Data engineering tools

  • Data engineers are proficient in SQL, which they use to store and organize data.
  • They also use one of the following programming languages like Java, Scala, or Python to process data.
  • They use Shell on the command line to automate and run tasks.
  • Finally, data engineers, now more than ever, need to be comfortable with cloud computing to ingest and store large amounts of data like AWS, GCP, and Azure.

Data analyst

  • Data analysts describe the present via data.
  • They do this by exploring the data and creating visualizations and dashboards.
  • To do these tasks, they often have to clean data first. Analysts have less programming and stats experience than the other roles.
  • Within the workflow, they focus on the middle two stages: data preparation and exploration and visualization.
A yellow border means their area of focus in most cases and blue background means that they work with that stages as well.

Data analyst tools

  • SQL, is the same language used by data engineers, to query data.
  • While data engineers build and configure SQL storage solutions, analysts use existing databases to retrieve and aggregate data relevant to their analysis.
  • Data analysts use spreadsheets to perform simple analyses on small quantities of data. Analysts also use Business Intelligence, or BI tools, such as Tableau, Power BI, or Looker, to create dashboards and share their analyses.
  • More advanced data analysts may be comfortable with Python or R for cleaning and analyzing data.

Data scientist[The sexiest job of the 21st century ]

  • Data Scientists have a strong background in statistics, enabling them to find new insights from data, rather than solely describing data.
  • They also use traditional machine learning for prediction and forecasting.
  • Within the workflow, they focus on the last three stages: data preparation and exploration and visualization, and experimentation and prediction.
A yellow border means their area of focus in most cases and blue background means that they work with that stages as well.

Data scientist tools

  • Similar to analysts, data scientists have strong skills in SQL.
  • Data scientists must be proficient in at least Python on R.
  • Within these languages, they use popular data science libraries, such as pandas or tidyverse(in R). These libraries contain reusable code for common data science tasks.

Machine learning scientist[The hottest one according to me ]

  • Machine learning scientists are similar to data scientists, but with a machine learning specialization.
  • Machine learning is perhaps the buzziest part of Data Science; it’s used to extrapolate what’s likely to be true from what we already know.
  • These scientists use training data to classify larger, unrulier data, whether it's to classify images that contain a car, or create a chatbot.
  • They go beyond traditional machine learning with deep learning. Within the workflow, they do the last three stages with a strong focus on prediction.
Yellow border means their area of focus in most cases and blue background means that they work with that stages as well.

Machine learning tools

  • Machine learning scientists use either Python or R to create their predictive models.
  • Within these languages, they use popular machine learning libraries, such as TensorFlow, Pytorch to run powerful deep learning algorithms.

So that’s it, if you understood something out of it, I feel powerful haha, jokes aside thanks for giving this blog a read. Let’s learn more about this bleeding-edge technology in upcoming blogs and believe in the process and obviously yourself.

You can also subscribe to my YouTube channel where I post videos about data science related stuff in Hindi(Indian Language). The name of the channel is “The Coding LaaLa”.

--

--

Abhishek Kukreja

I love data, so anything related to it fascinates me , be it Machine Learning, Data Science, Deep learning and Cloud Computing; so sharing my learning journey⚡.