Data Science: Intro to Machine Learning

Friska Ayu Listya Irawan
8 min readSep 20, 2022

--

Hai everyone 😄

This is the last post that I include as a final project for MySkill Data Science Bootcamp class. Different from the Data Analysis Bootcamp class, in this class, I learn a new topic which is Machine Learning. Now, I will try to write what I learn as an intro to learning Machine Learning.

Design by macrovector

Definition

Machine Learning (ML) can be defined generally as :

“An approach to achieve artificial intelligence through systems that can learn from experience to find patterns in a set of data” — Jason’s Machine Learning 101

In ML, algorithms are ‘taught’ how to identify features and patterns in huge amounts of data so as to arrive at predictions and decisions based on new data. The quality of the algorithm will determine how much more accurate the predictions and decisions will become as it analyses additional data.

Machine learning computers can operate without the programmer telling them what to do. Programmers are a vital part of computer programs because the computer needs instructions. Yet, machine learning computers, can change and improve the algorithms on their own.

Machine Learning comes from the subfield of Artificial Intelligence (AI). The goal of machine learning is to analyze the structure of data and fit that data into models. The models help Data Scientists and ML Engineers to understand the data and to use it.

Why is Machine Learning important?

Due to factors such as increasing varieties and volumes of available data, affordable data storage, and computational processing that is more powerful and cheaper — there has been a resurging interest in machine learning.

All these factors make it possible to automatically and quickly create applications that can process larger, more sophisticated data and yield swifter, more accurate outcomes — even on a much bigger scale. And by developing accurate applications, a company is better positioned to identify lucrative opportunities and/or avoid hidden risks.

Who’s using Machine Learning and what’s it used for?

Today, machine learning is used in a wide range of applications. Some of them are :

  • Customer relationship management. CRM software can use machine learning models to analyze emails and prompt sales team members to respond to the most important messages first. More advanced systems can even recommend potentially effective responses.
  • Business intelligence. BI and analytics vendors use machine learning in their software to identify potentially important data points, patterns of data points, and anomalies.
  • Human resource information systems. HRIS systems can use machine learning models to filter through applications and identify the best candidates for an open position.
  • Self-driving cars. Machine learning algorithms can even make it possible for a semi-autonomous car to recognize a partially visible object and alert the driver.
  • Virtual assistants. Smart assistants typically combine supervised and unsupervised machine learning models to interpret natural speech and supply context.

Types of Machine Learning

Classical machine learning is often categorized by how an algorithm learns to become more accurate in its predictions. There are four basic approaches: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. The type of algorithm data scientists choose to use depends on what type of data they want to predict. Two types that are commonly used are Supervised learning and Unsupervised learning.

Supervised learning

In this type of machine learning, data scientists supply algorithms with labeled training data and define the variables they want the algorithm to assess for correlations. Both the input and the output of the algorithm are specified.

Supervised machine learning requires the data scientist to train the algorithm with both labeled inputs and desired outputs. Supervised learning algorithms are good for the following tasks:

  • Binary classification: Dividing data into two categories.
  • Multi-class classification: Choosing between more than two types of answers.
  • Regression modeling: Predicting continuous values.
  • Ensembling: Combining the predictions of multiple machine learning models to produce an accurate prediction.

Unsupervised learning

This type of machine learning involves algorithms that train on unlabeled data. The algorithm scans through data sets looking for any meaningful connection. The data that algorithms train on as well as the predictions or recommendations they output are predetermined.

Unsupervised machine learning algorithms do not require data to be labeled. They sift through unlabeled data to look for patterns that can be used to group data points into subsets. Most types of deep learning, including neural networks, are unsupervised algorithms. Unsupervised learning algorithms are good for the following tasks:

  • Clustering: Splitting the dataset into groups based on similarity.
  • Anomaly detection: Identifying unusual data points in a data set.
  • Association mining: Identifying sets of items in a data set that frequently occur together.
  • Dimensionality reduction: Reducing the number of variables in a data set.

Step by Step to Build Machine Learning

Step by step to build machine learning
  • Understand the problem you want to solve. Do you need Machine Learning? What data do you have?
  • Analyze and process data. Which data can be used? Which data should be discarded? At this stage, analysis and data processing is carried out.
  • Create machine learning models. Which type of machine learning is right for the data you have?
  • Train program (Machine Learning Model). The model is trained by providing pre-processed data.
  • Evaluate the model. What’s wrong with Models? Why is the accuracy low?
  • Improve models. Based on the evaluation, which part of the model should be changed? At this stage, the Machine Learning Model is updated.
  • Repeat all processes. we can repeat all processes to add new data or change the method

Understand the problem

First, you need to understand what problems you will solve. If you have labels in the data, it means you should use Supervised Learning. If you want to classify the data, then do the Classification. If you want to make predictions, then use the Regression. If the data you have does not have a label, use Unsupervised Learning. If you want to group data, then do Clustering. Or if you want to summarize data, then do Dimensionality Reduction.

After successfully finding the Machine Learning problem, then proceed with the form of your data. Use Structured Data (data in the form of tables. Examples are date, phone number, home address, name, etc) or Unstructured Data (image, sound, video, message text, email, and comment files).

Analyze and process data

After understanding the machine learning problem, the process continues with data analysis. Which data we should use and which data we should discard. Find the correlation between the data you have to help carry out the analysis, it can be done with data visualization.

At this stage, we also process data, known as Data Preprocessing. The computer cannot understand text, images, or sounds. Then the data that we have must be converted into numbers so that it can be read by computers.

After that, proceed with data separation. The data we have will be separated into two parts, Training data and Test data. If you have a large amount of data, you can split the data into three parts, Training data, Validation data, and Test data.

Create machine learning models

If your data is ready, then the process is continued by creating a Machine Learning Model. Machine Learning models are made by applying mathematical formulas.

However, in making Machine Learning, especially with the Python Programming Language, is already available and we just have to choose which formula we will use for each of our problems.

Of course, we have to understand how the formula works so that we can choose the right formula for each different problem.

Some of the Machine Learning formulas are Support Vector Machine (SVM), K-Nearest Neighbours (K-NN), K-Means, Logistic Regression, etc. For now, you don’t have to worry, because you don’t have to memorize all of it right now.

Train program (Machine Learning Model)

After the model has been successfully created with the mathematical formula that we have chosen, the next step is to train the model. The model must be trained many times, so we can get maximum accuracy.

The first time the model is practiced, the model definitely makes a lot of mistakes so the accuracy of the model is very low, but calculations will occur every time you practice the model, so if you choose the right mathematical formula, the accuracy can increase. The program will continue to look for patterns in the personal data so that the accuracy during practice increases.

During the model training process. It often happens that the model remembers too many training data patterns, as a result, the model has very high accuracy during training, but low accuracy when tested. This is called overfitting, while the term for a model with low accuracy on all data is called underfitting, which means the model does not learn anything.

Overfitting can occur because the amount of training data is too little or because the model sees too much training data. Meanwhile, underfitting can occur because the mathematical formula we choose is too simple or not suitable for the problem we have.

After the training process is complete, it is time to test the model on test data that the model has never seen before, to prove whether the model is worth putting into real-world work or not.

Evaluate the model

An evaluation or analysis of the results and process of model training is carried out. The results of model training can be checked in a matrix called Error Metric. Based on the Error Metric, we can analyze the model and find the error.

Improve models

After we know the error, it’s time for us to evaluate the error. Then we have to update the model to improve accuracy. Improve the model, it can be done in various ways, such as adding training data or tuning the model. This method is known as Tuning Hyperparameters.

Repeat all processes

After we successfully improve the model, we can repeat all the processes from the beginning to improve the model continuously. That is by adding data, reprocessing existing data, training models with Tuning Hyperparameters, or maybe by trying different mathematical formulas, and re-evaluating the models we have.

The Advantages and Disadvantages of Machine Learning

When it comes to advantages, machine learning can help enterprises understand their customers at a deeper level. By collecting customer data and correlating it with behaviors over time, machine learning algorithms can learn associations and help teams tailor product development and marketing initiatives to customer demand.

Some companies use machine learning as a primary driver in their business models. Google, for example, uses machine learning to surface the ride advertisements in searches.

But machine learning comes with disadvantages. First and foremost, it can be expensive. Machine learning projects are typically driven by data scientists, who command high salaries. These projects also require software infrastructure that can be expensive.

There is also the problem of machine learning bias. Algorithms trained on data sets that exclude certain populations or contain errors can lead to inaccurate models of the world that, at best, fail, and at worst, are discriminatory. When an enterprise bases core business processes on biased models it can run into regulatory and reputational harm.

That's all for today, it’s just the beginning to learn Machine Learning deeper. I will not stop learning (but sorry, won’t promise to write it here 😜). Give feedback, or we can discuss it together.

Thank youuuu 😆

Reference :

--

--