Rise above archaic tech with the use of ML, NLP, and BOTS ! Part 1 — Machine Learning

A gentle introduction to machine learning, its workflow and which tech you need to learn to get started.

Andreas Rein

Published in

BDO Digital Labs

8 min readJun 1, 2017

The inner workings of a unsupervised machine learning algorithm

What is machine learning?

“Building a model from example inputs to make data-driven predictions vs. following strictly static program instructions”

Types of machine learning

Machine learning can be divided into two learning algorithms.

Supervised

Supervised machine learning is the most commonly used type of machine learning today. It means that we build a model from a data set to make data-driven predictions. In the end, it is all about value prediction.

We train a model using training data and the model then predicts the result in the new data. In Bayes spam filtering we flag an email as spam and it learns that emails like the one we flagged should be treated as spam.

We never have to do any calculations in our head anymore

“In supervised machine learning, the computer works out the relationship for you”

Unsupervised

Unsupervised machine learning is the more intelligent one. The one we have been promised by science fiction movies for years. The one that is wise enough to predict the result without any reference data. So, in a more technical term, unsupervised machine learning identifies clusters of like-data.

You don’t say anything about what the algorithm should look for. We just give the algorithm the data and we don’t tell it what to look for. The result will be unpredictable since we don’t tell it what’s correct. Gmail uses unsupervised learning algorithms for clustering emails in different categories. It learns by itself. These kinds of algorithms are still not that much in use compared to supervised though. Should we fear the explosion of unsupervised machine learning? Definitely.

So, what is the difference?

It clusters the data into different groups. Say you are building a face detection algorithm and you feed the algorithm with a thousand pictures of faces and other various pictures containing no faces. Then the algorithm will build a model based on that. Next time you show it a picture of a face it will know that it is a face. It learns what you tell it to learn and it will eventually distinguish between a skyscraper and a face. There is no machine elfish magic involved.

When it comes to an unsupervised algorithm it won’t learn that a face is a face, but it will find out that faces are very different from skyscrapers.

Workflow

Machine learning workflow can be described with the following 5 steps.

1. Asking the right question

When we start out it’s all about asking the right question. What are we looking for? What kind of results do we want to achieve? Is 80 percent probability a realistic goal?

In this step, we define the final statement, scope, expected performance and how we will achieve this.

2. Preparing data

The statement is defined and ready to be enacted on. We start to collect the data, but where should the data come from? This can be a variety of sources, like Google, government databases or your company. The golden rule to remember in this step is;

“The closer the data is to what you are predicting, the better”

But when that is said, this rule also need to come into the light:

“Data will never be in the format you need it to be”

So, with these two rules in mind, we can start eliminating columns we don’t need and impractical duplicates. We also need to get rid of empty columns. We need to remember to keep our data set limited to only the needed columns so correlated columns need to go dark as well.

Here we see a correlation between the field skin and thickness

With correlated columns, I mean columns that contain the same information but in a different format. This kind of columns does not provide any useful information and there is a chance that the algorithm we choose can get confused by them.

3. Molding data

When the data is prepared, we mold the data. This means dropping rows and adjusting data types. We can also create new columns where we see it’s needed. Best practice says that it is better to change string columns to integers where it is possible. When molding data, we just cannot be sloppy. This step can mess up the next steps if we screw up or forget what we did to get that new column.

“Always track how you manipulate data”

4. Selecting the algorithm

So, the data is now prepared, cleaned and molded. That means we are ready to start thinking about what algorithm we should use.

The algorithm you choose can have a big impact on the result. Always experiment.

Role of the algorithm

The algorithm is the engine that drives the entire process. It executes the logic and process the training data and then produces the trained model. The model contains code which it can evaluate data and parameters created during the training process.

Decision factors

When your deciding on which algorithm will produce the best results, there are several decision factors to think about. How does the algorithm learn? Is it a profoundly complex algorithm? Is it basic or enhanced? However, you can and you should, experiment with different algorithms to see which one perform best.

Popular algorithms

There are over 50 algorithms to choose from. Some of these are:

Naive Bayes is based on likelihood and probability. It weighs each feature the same, each column is equally important. It requires a smaller amount of data and it is simple to use, and easy to understand.

Logistic Regression algorithm put weight on the relationship between features. It’s also simple and very fast. Logistic Regression is very stable to data changes.

Decision Tree uses binary tree model where the node contains decisions. This algorithm does require enough data to be able to determine nodes and splits.

Random Forest is an ensemble algorithm that creates multiple trees. It uses a technique called bagging to improve predictive performance. This does cause it to be quite slower than for example Logistic Regression.

5. Training the model

This step is all about letting specific data teach a machine learning algorithm to create a specific forecast model. Before you train the model, it is important to split the data used for training and data used for testing. If you forget to split, you will seriously get good predictions since you test with the data the model is trained upon. It’s good practice to split the training data 70% and test data 30%.

There is a lot of options to improve the performance when it comes to training the model. This comes down to which algorithm you decide on and what environment you’re in. Sometimes it helps to add additional helper columns for a performance boost.

Tips

In the machine learning workflow, the early steps are most important. But you should expect to go backward sometimes.

She just found out that by using a random forest algorithm she can increase her probability by 10%

Language and frameworks

There is not one definitive programming language, framework or environment when it comes to machine learning. But there are some that are more popular than others. Especially Python and R.

Python

Python is certainly not the fastest nor the most powerful. But it does have some great libraries that help you in the machine learning fairyland. Its syntax is simple, elegant, consistent and math like. It makes the code very readable and writable. Python’s ecosystem is rich and some popular libraries for machine learning uses are:

Numpy (Scientific computing)
Pandas (Data frames)
Matplotlib (2D Plotting)
Scikit-learn (Algorithms, pre-processing and performance evaluation)

R

R is another great choice for machine learning. It is quite faster than Python. Its visualization possibilities are also much better than python. Because R is feature rich and powerful it has a steep learning curve at the beginning.

R has been used primarily in academics and research but it’s growing in the enterprise market. Statistics is R’s central mission and when it comes to visualization it has without a doubt the lead.

R: The R Project for Statistical Computing

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX…

www.r-project.org

Jupyter Notebook

Jupyter Notebook is a great environment for machine learning. Some of you may know it as IPython. You have a single interactive document where you can document, write code and implement interactive widgets.

It is perfect for iterative work like the machine learning workflow. It is also very shareable. You’re also not locked into a specific language because it has support for over 40 programming languages including popular languages as Pyton, R, Julia, and Scala.

Project Jupyter

The Jupyter Notebook is a web-based interactive computing platform. The notebook combines live code, equations…

jupyter.org

Where to learn more about machine learning?

There are tons of great resources out there to dive deeper into machine learning. On Pluralsight you can find multiple practical and top notch courses when searching for machine learning.

On medium, there are also plentiful supreme articles about machine learning. If your more interested in the technical part of it, Github is your comrade.

This was part 1 of a series about machine learning, natural language processing and bots. Hit that clap button if you enjoyed this article.