What is Data Science & AI and how to start?

Mohamad Ashour
7 min readFeb 7, 2023

--

What the Future may look with AI.

Data science combines the scientific method, math and statistics, specialized programming, advanced analytics, AI, and even storytelling to uncover and explain the business insights buried in data.

The Data Scientist rule is related to everything about data, whether it is structured data or ready for processing and unstructured data, the main goal of the data scientist is to search for the patterns in the data and discover insights, such as the patterns that are repeated, and this is often through the development of models, mathematical models, statistical models, machine learning, deep learning and all different ways to build models that will help to anticipate things that will happen in the future.

Data scientists generally have a mathematical or statistical background with computer science, which of course makes them rare and in demand because combining these skills is not easy.

Data Science Project Life Cycle

It’s important to first note that the data science lifecycle may look a little different to everyone. There are a few different interpretations, although they all generally resemble the following structure:

Data science life cycle.

1. Define and understand the problem

A problem cannot be solved if you don’t know what the problem is.

It’s important to identify and understand the problem you try to solve and ask relevant questions that help you to understand the problem and help you to solve it efficiently.

One of the key questions that executives should be asked is how solving the problem will benefit the company (or its customers) and how the problem fits into the other processes of the company.

2. Data collection

If you asked a relevant and right question and have a clear idea of the problem you’re trying to solve, you also should collect and gathering right data.

Some data collection techniques.

it’s always a good idea to collect more data than you think you’ll need. you can use Kaggle to get data, web scraping and databases to collect data to solve your problem.

Kaggle:

Kaggle is an online community platform for data scientists and machine learning enthusiasts.

Kaggle: https://www.kaggle.com

Web scraping

Web scraping is the process of using bots to extract content and data from a website.

Source: Web scraping basics: A developer’s guide to reliably extract data | by Zyte | Medium

3. Data cleaning and preparation

As I described in the previous step, collecting more data than you think you’ll need is always valuable but, in most times, you will need to handle errors in data and combine datasets and columns to extract new features that relevant to your goal and solution.

Data science is all about working smart, not hard. This means that in order to produce the right models in step five of the process, you need to properly clean and prepare the data you plan on using.

In most cases your will find many errors in your data such as:

1- Missing Values.

Missing values represented by “?”.

2 - Outliers.

Outlier represented by green point.

3 - Duplicated Values.

Duplicated Values.

4 - Unreverent symbols like % or & or $, you need to remove.

Unreverent symbols like $.

5 -Imbalanced Data.

Imbalanced Data is a classification data set with skewed class proportions is called imbalanced.

Imbalanced Data.
Imbalanced Data.

4. Exploratory data analysis

This is arguably the first “fun” step in the data science project lifecycle as you finally get to write some code and see what all of the data you’ve painstakingly cleaned is trying to tell you.

EDA is used to summarize the main characteristics of a data set and is often completed by developing data visualizations. These visualizations will help you quickly see patterns and spot anomalies in the data.

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task.

EDA

5. Model building and deployment

You’ve now reached the second and final “fun” step of the data science project lifecycle. Now is the time to split your data set into train and test sets that will be used to develop your machine learning models.

Here is where you’ll determine whether you need to create a supervised or unsupervised machine learning model.

Supervised models are used to classify unseen data and forecast future trends and outcomes by “learning” patterns in the training data.

Supervised learning and unsupervised learning.

Examples of supervised machine learning include:

  • Classification, identifying input data as part of a learned group.
  • Regression, predicting outcomes from continuously changing data.
Classification & Regression

Unsupervised models are used to find similarities within data, understand relationships between different data points within a set, and perform additional data analyses.

Examples of unsupervised machine learning include:

  • Clustering, grouping together data points with similar data.
  • Association, understanding how certain data features connect with other features.

Artificial Intelligence

AI

Artificial intelligence is the simulation of human intelligence processes by machines, especially computer systems.

Humans can learn, see, speak, classify, walk, look and many operations, so the goal of ai is to simulate the human tasks and do it in a good accuracy.

Artificial Intelligence is a Function Approximation

AI try to help people and software engineers to define algorithms to solve complex problems.

The traditional approach to system development (Traditional Programming) work as:

  • We give the computers inputs and algorithms to know the output.

The Machine Learning (AI) approach to system development work as:

  • We give the computers inputs and outputs, and computer give us the algorithms that solved the problem.
  • The AI approach is useful when we have a complex problem and big data because the Traditional Programming depends on programmers to define algorithms and this process is complex and hard for engineers.

Data Science Projects & Applications

Let us discuss the top data science applications of 2023.

ChatGPT.

ChatGPT is an advanced AI chatbot trained by OpenAI which interacts in a conversational way.

ChatGPT Interface.

YouChat

YouChat is a ChatGPT-like AI search assistant that you can talk to right in your search results. It stays up to date with the news and cites its sources so that you can feel confident in its answers. Plus, the more you interact with YouChat, the more it improves.

YouChat Interface.
YouCaht Welcome Message.

Midjourney

Midjourney is an independent research lab that produces an artificial intelligence program under the same name that creates images from textual descriptions.

Midjourney Generation images.

Final thoughts

While the data science project lifecycle may seem obvious, it’s not often taught in online certificates or courses. This leaves a huge disconnect between the technical skills you learn and the reality of how they’ll be used in the workplace.

However, by learning the basic structure described above, you’ll become a more well-rounded data scientist, you’ll be able to answer any lifecycle questions that may get thrown at you in an interview, and you’ll be better positioned to help your team prepare and deliver a vital data science project.

Subscribe to get my stories.

#datascience #data #AI

--

--

Mohamad Ashour

Mainly interested in the field of machine learning and data analysis with strong knowledge of many programming languages, data processing, and data mining