Towards Machine Learning - Terminologies

Published in

Tech Blog

6 min readDec 24, 2022

The oil of the future is machine learning. Every field (medical, finance, marketing, etc.) needs machines’ intelligence. The world of ML is broad and diving deeply into it is impressive. While you are diving in, it will trigger your curiosity.

This article is the first of the “Towards machine learning” series, which explains the main concepts of the machine learning domain. This article defines the top words used in the ML field.

Machine learning? What is it?

The field of study gives computers the ability to learn without being explicitly programmed, which means learning from experience and using data.

Example: Samuel wrote a checkers playing program, and the fantastic thing about this playing program was that when it played thousands of games against itself, by watching what positions tended to win and what positions tended to lose, the checkers playing program learned over time what were the good and bad positions.

What are the machine learning types?

There are three main types of machine learning:

Supervised learning: In this type, the machine learning algorithm is trained on labeled data. We gave the algorithm a data set with the correct answers, and the algorithm’s task was to produce more of these correct answers.
Unsupervised learning: This is often related to unlabeled data. It is about trying to find some structure in the data.
Reinforcement learning: Learn to behave by carrying out specific actions and observing the rewards that result from those actions.

What is a machine learning model?

A machine learning model is a program or mathematical representation that predicts or classifies values based on a given data set.

How does machine learning work?

Training and evaluation are the two main phases of machine learning models.

1. Training

Given enough inputs (data), machine learning can perform tasks with high accuracies, like recognizing objects or making recommendations.

Training a machine learning model (Reference: Feedzai )

In the training phase, the programmer provides some input data with the expected output. The computer, using the machine learning algorithm, uses this information to build a model that represents the patterns that it detected in the training input data.

2. Evaluation

In evaluation, the machine learning system uses the model developed to predict the output for input data using the algorithm that the model contains.

Evaluating a machine learning model (Reference: Feedzai)

Machine learning life-cycle

1. Gathering data

This step is to identify and collect data related to the problem we are working on. In this stage, we need to gather data from different sources, such as files, the internet, social media, etc. The effectiveness of the output will depend on the quantity and quality of the data gathered.

2. Data preparation

Data preparation is organizing our data and preparing it for machine learning training.

This step can be separated into two processes:

Data exploration: It is used to fully understand the nature of the data we must deal with. We must recognize the features of the data, its format, and its efficiency. Thus, a clearer understanding of data results in much more successful performance.
Data pre-processing: The next step is to prepare the data for analysis.

3. Data wrangling

Data wrangling is the process of cleaning and converting raw data into valuable data.

The data we have collected don’t need to be always useful, as some of the data may not be. The collected data may have various issues, including:

Wrong data
Duplicated data
Invalid data
Null data
Noise

So, we should clean the data with several techniques.

4. Data analysis

Now the cleaned data are prepared on the analysis step.

Select the technique for analyzing data.
Building the model.
Review the result.

This step aims to build a model that analyzes the data using algorithms that will be covered in the article.

5. Training the model

In this phase, we train the model to improve its performance for a better outcome. We use training datasets to train the model using machine learning algorithms. Training a model is required so that it can understand the patterns and features.

6. Testing the model

Once the model has been trained on a given dataset, then we test the model. In this phase, we check the model’s accuracy by providing a test dataset.

7. Deploying the model

So, the last step is to deploy the model in the real system. If the above-prepared model produces an accurate result, it is ready to be used.

What are data sets?

A dataset is a collection of data in which data is arranged. It can hold data from any source. But it should be clean, structured, and formatted clearly to be understood.

Types of data

Numerical data: which are quantitative data, like prices.
Categorical data: data stored into groups like male/female.
Ordinal data: These data can be compared; they are used for ranking like a letter grade.

Machine learning data set types

In machine learning, the data set could be split into two or three groups:

Training set: In supervised learning training set is the subset of original data used to train the machine learning model. Hence, the model learns from this data. But in unsupervised learning, there is no training data set, and outcomes are unknown.
Validation set (not always needed): A validation data set tunes the machine learning models.
Testing set: Testing data is used to check the model’s accuracy and is like unseen data.

The training dataset is generally larger compared to the testing dataset. The general ratios of splitting train and test datasets are 80:20, 70:30, or 90:10.

Sources of datasets

1. Kaggle datasets

URL: https://www.kaggle.com/datasets

Kaggle is one of the best sources to reach high-quality datasets. You go and search for any dataset you need and download it.

2. AWS Open Data

URL: https://registry.opendata.aws/

We can use AWS resources to search for datasets and download them.

This source provides several types of datasets with examples and ways to use the dataset.

3. Microsoft research open data

URL: https://msropendata.com/

Microsoft offers a free dataset for different areas. This dataset can be downloaded to our local machine or used on cloud infrastructure.

4. UCI repository

URL: https://archive.ics.uci.edu/ml/index.php

This repository contains databases. It classifies the datasets as tasks of machine learning.

5. Google dataset search engine

URL: https://datasetsearch.research.google.com/

This source provides online datasets that are freely available for use.

Scikit learn dataset

URL: https://scikit-learn.org/stable/datasets.html

This source provides toy and real datasets. These datasets can be obtained from sklearn.datasets package and using dataset API.

Summary

This article explained the main concepts of machine learning and illustrated several popular data set providers.

After knowing the main how what machine learning is and how it works, in the next article, we will dive into the main algorithms for traditional machine learning.