Introduction to Machine Learning

Abe Vallerian
Tokopedia Data
Published in
6 min readDec 5, 2018
source: https://vignette.wikia.nocookie.net/epic-rap-battles-of-cartoons/images/9/9f/Doraemon.png/revision/latest?cb=20180427160304

Hi guys! I’m Abe from Data Scientist team. Natan (Data Scientist Lead) and I participated in Machine Learning Bootcamp held by Google in Singapore. The course was very exciting and interesting, so I want to share with you guys about the knowledge I learned.

What is Machine Learning (ML)?

I believe many of you have heard about Machine Learning because it is one of the hottest terms you can find out there. It is very closely related to Artificial Intelligence (AI). At this point, you may think about Doraemon or J.A.R.V.I.S. from Ironman movie, because they were built based on AI. And yeah, these are very intelligent machines. However, the current technology is still very far away to get there. Machines are getting smarter, but they are still not that smart. By the way, JARVIS is also the name of Tokopedia’s Data Scientist team.

source: https://vignette.wikia.nocookie.net/marvelcinematicuniverse/images/b/b0/JuARaVeInSy.png/revision/latest?cb=20120722164138

So, what is Machine Learning? Machine Learning is a way for machines to learn something without explicitly programmed. As programmers, we can create any programs to do anything we want. Let’s say we want to program a chatbot. An approach to do this is by writing all possible responses of all different inquiries. However, I’m sure that you will think that it is impossible, right? That’s why we need Machine Learning. We need to somehow ‘teach’ the machine to learn proper responses on its own without any specific rules.

Weak vs Strong AI

In general, AI is classified into 2 different categories:

  1. Weak AI is a ML model trained to do a specific task. This is the most common AI that you can find nowadays. For example, in Tokopedia we have different models to predict product category, classify customers’ review, and provide product recommendation based on customers’ purchase history.
  2. Strong AI is AI that can do many specific tasks at once. Doraemon is the perfect example for this category. He can read, listen, talk, walk, understand things, or even think. If we want to build something like Doraemon, we need to combine many different weak AI models together. However, combining those models as an integrated system is one of the biggest challenges we face in practice.

What’s Needed to Build an ML Model?

3 things building ML

So, after understanding all of these things, what do we need to build an ML model? In general, we need to define these 3 things:

  1. Task. It is basically a problem that we want to solve using our model.
  2. Performance Measure. Here, we define proper metrics to investigate how well the model performs.
  3. Experience. It is something that the model can learn from. We usually call it as training data. For example, babies like to learn things around them by hearing what things are called.

Let’s take an example. Suppose that we want to build a model to recognize handwritten texts from an image. The task is obvious, which is to recognize texts from given handwriting images. In terms of performance measure, we can use the percentage of correct prediction as the metric. After that, the model will learn patterns from the given training data (experience), which consist of handwriting images with their corresponding text or word as the label.

ML Workflow

ML Workflow

Before building a model, we should understand the ML workflow. There are many different variations of workflow and this is just one of the examples. Remember, we need to define 3 things I mentioned previously (task, performance measures, experience) before starting the workflow. In general, there are 4 steps:

  1. Explore Dataset. We should spend most of our time in this step. In this step, we need to prepare and preprocess the data first because many of them are quite messy, e.g. missing or wrong values, different formats, unstructured data.
  2. Choose a Model. There are so many models available out there such as decision tree, neural networks, k-NN, and so on. The best practice is to start with the simplest one as a baseline. If the performance does not satisfy you enough, we can try building more advanced models.
  3. Build a Model. We also call this as training step. Basically, we feed the preprocessed data into the model so it can recognize patterns within the data.
  4. Test the Model. In this step, we validate whether the model is satisfying enough to solve our problem. If it is not, we can go back to step 1 to get more data. Or, to step 2 to change our model.

Before we start training our model, we should divide the data into train and test data. We train our model using train data, and validate the model using test data. The reason behind this segmentation is that we want to make sure that our model can adapt to any data apart from our train set instead of just remembering the dataset. Usually, the proportion of train and test data is 80:20.

Case Example

To make it clearer, I’ll give you an example. Let’s take a look at the dataset on this table below.

Rent Price Dataset

Suppose that we have a sample house rent pricing problem. Our task is to predict the Rent Price (M IDR) based on Area (m²). Then, we define the performance measure as the average error, and the experience is our training data on the table.

Let’s start our ML workflow.

  1. Explore Dataset. Although it’s just a dummy example, we will explore our dataset first. Usually, we check whether there are missing or inappropriate (mostly in categorical column) values, but the dataset is proper in our case. We have 4 observations in our dataset. The minimum and maximum rent prices are 1.4 and 2.2 M IDR, respectively.
  2. Choose a Model. Next, we choose linear regression as our model. It is most likely that you have heard about linear regression, which is one of the simplest and most popular model in Machine Learning. The goal of linear regression is very simple. We want to find a line which minimizes the average of error. Error is defined as the difference between actual and predicted values. Then, we calculate the average of the errors of the whole data to obtain the final error.
  3. Build a Model. Then, we train the model based on the data by minimizing the average error. Our linear regression is represented by the blue line in the Linear Regression figure below.
  4. Test the Model. Finally, we need to test the performance of our model. For example, we have a sample testing data whose Area is 13 m² and its Rent Price is 2 M IDR. After we feed that test data into our model, the model gives us the prediction value of 1.68 (Predicted Rent Price = 1.68 M IDR). The error of prediction is Actual Rent Price-Predicted Rent Price = 2-1.68 = 0.32 M IDR. Based on this result, we need to determine the next action, i.e. whether to add more data or choose another model.
Linear Regression

Yeah! Finally, we have finished creating a simple ML model (well, at least conceptually)!

Source: Pixabay

By the way, visit these links if you want to delve deeper!

There will be more interesting things coming about Machine Learning. Stay tuned at Tokopedia Data Squad Medium! Thanks for reading!

--

--