Machine Learning Basics with Examples — Part 1 Introduction

7 min readAug 18, 2018

I am a data engineer and for more than a couple years I am reading about Machine Learning and Data Mining to improve my skills. I have completed Udacity Machine Learning Nanodegree. Also I am trying to use these methods in my day job to create useful models for our company which my lead to a revenue or cost cutting.

My learning path for a new topic is consists of some steps

Reading about the topic,
Trying out some examples / demos,
Create a business case, project for the topic,
Find a playground and play with the code day and night,
Start writing on the topic and learn more while doing the research for the blog posts,
Start coding in professional life.

Now I feel confident enough about machine learning to move onto the writing phase of my learning cycle.

Photo by Glenn Carstens-Peters on Unsplash

My plan for the series is as follows

Introduction (this post)
Supervised Learning
Classification
Decision Trees
Random Forests
SVM
Naive Bayes
Regression
Unsupervised Learning
Clustering
Feature Selection and PCA
Send Models to Production

I will also include a post about Association Rule Mining, although it is not considered as a Machine Learning method, it is a nice to have knowledge and a part of Data Mining discipline.

I will not include any post about Reinforcement Learning and Deep Learning yet. They may come in future when I have more time to discover them further.

Technologies and Tools

During the series I will use a couple of tools and technologies to implement examples.

Python : It will be the programming language that I am using for the developing models
Matplotlib : Data visualization library for python
Pandas : Data analysis library for python with a nice data structure called DataFrame, a table like structure.
Sci-Kit Learn : Machine Learning library for python
Kaggle : Kaggle will be our main data source. Every sample will have a link to its data on Kaggle. We will also use Kaggle for evaluation.
Github : Sample codes will be on my github repository.
Anaconda : Anaconda is a software bundle for data science and machine learning.
Jupyter Notebook : It is a nice IDE like web-based environment to write code and its documentation.

Machine Learning 101

Depending on the wikipedia definition :

Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed.

In my terms, machine learning is to find the mathematical model or formula in the given data using statistical methods. Computers are learning patterns inside the data and then they can apply this pattern to predict output from new coming data.

Machine Learning Steps

1 Identify the Problem

First thing in a machine learning lifecycle is to identifying clearly the problem that you want computer to solve. It’s not the case that you pour a pile of data into your processor and expect some magic responses. Or it may give you the answer 42, if you’re lucky enough.

https://imgflip.com/memegenerator/20007896/cmon-do-something

There should be a non ill-defined problem for you to solve. Like “We would like to predict the customers’ probability of churn next month?” (classical example of data mining and machine learning) or “We would like to predict what a customer going to buy next?” or “We would like to know when is the best time to reach a customer?” or even “We would like to know if this nevus is cancerous?”

These examples are all working examples for a good question. If you have some domain knowledge, and some business knowledge then it shouldn’t be too hard for you to decide about your feature candidates.

2 Get to Know Data

So now, we have a problem in our hands to solve. And we have some data in our hands. It’s time to deep dive into data to see what can we use from it.

Let’s first talk about the data types. Usually in machine learning, it is accepted to have two types of data :

Caterogical Data
Numerical Data

All the data like gender, education, marital status falls in to categorical data and it is also splitted into two sub-types ordinal and nominal. If categories have an order between them like education (primary, secondary, high school, college etc.) they are ordinal. Otherwise they are accepted as nominal data like in gender (male, female)

Numerical data is the data like age, height, price etc. It can be also seperated into two sub-types discrete and continous. Discrete numbers always have boundaries between them, for example age can be 36 or 37 but not 36.5 (actually this depends on the usage) but the height can be 178.25 centimeters.

You need to know your data, define its types, find its statistical identifiers like mean, median, variance etc., find its quality problems, clean it, transform it then use it.

3 Prepare Data

After you learn about your data, you need to prepare the data for the algorithms you are going to use. Some algorithms can handle caterogical data and some can not. For some algorithms you need to binarize caterogical data.

There are also transformations which are called feature engineering. It is to create new features from the ones we already have. One example for the feature engineering is to calculate age from date of birth, or categorize people as child, young, adult, old by using age.

In preparation phase data also need to be cleaned. What is a dirty data? Data with anomalies can be called dirty data, for example a customer with 200 years old is probably not true. Or there can be some typos in the data, let’s assume we have city as a feature, algorithm would consider “New York” and “newyork” as different cities, this should be cleaned.

Finally there is also the problem of missing data, because we are not living in an ideal world we will always have some missing data. That can be because of a non-completed form cell, a sensor lost connection etc. There are different ways for this task. Most common methods are :

Deleting the data points with missing value
Filling the missing values with mean data
Filling the missing values with the most common value

4 Split Data

It is a common practice to split the data we have into two or three parts for machine learning applications.

Splitting the data helps us to avoid overfitting problem. The shortest definition to overfitting is that algorithm memorizes the data it’s trained and does not perform well on a new data. Overfitting can be understood by looking to training and testing performance of the algorithm.

Usually data is splitted by 70% to 30%. The big portion is called the training data and the small portion is called test or validation data. Model is developed by using training data and it never sees the test data until it’s trained. After the model trained, model applied to test set to evaluate the model.

5 Let Computer to Learn the Pattern

After splitting the data, it is time to develop models. We need to decide which models are going to be trained depending on the problem we have.

After deciding the model and the library should be decided.

By using the training data and the library the model should be developed. It is best to try different algorithms, algorithm parameters (they’re called hyperparameters) and different features from the training dataset.

6 Try Out the Pattern on Data

After model training is completed, test data should be given into the model to produce an output.

7 Evaluate Results

Now output of the model needs to be evaluated. There are different methods for the evaluation. Most used techniques; for binary classification f-score, for regression r-squared or root mean squared error etc.

By using the evaluation metrics, we will be able to select the correct model, the correct hyperparameters and the correct features for the best results.

8 Plan for Production

Finally, model should go to the production. And this would be planned according to the need of business. If business needs to run the model on batch data in -let’s say- monthly intervals then plan would be different then the business which needs a realtime output for a single datapoint or streaming datapoints.

We will discuss this in further posts.

Final Words

Thank you for reading. Don’t forget the check this post again to be updated on the new posts. This post will be updated with the links to the new posts.

Share, like, comment, keep in touch.