‘Suitcase Words’: What Machine Learning (AI) really is & what it does

These days, no matter what industry you might work in, there are a lot of ‘suitcase words’ that are flying around. ‘Suitcase words’ refer to any words of phrases that you can stuff anything and any meaning into (Machine Learning and AI both seem to be ‘suitcase words’). Because you can fit anything into such words, those words start to lose their initial meanings, they seem to lack finite definitions, and there is often a lack of precision of language. ‘Suitcase words’ may arise when trying to encompass broad domains, or when there is a general lack of foundational understanding.

In the business world I operate in, I have often seen ‘suitcase words’ or ‘buzz words’ used for contexts in which they don’t make any sense and it is often due to a lack of foundation understanding. This article, is my attempt to dispel the mysteries around one such ‘suitcase word’, i.e. Machine Learning, and bring forward a semi-technical understanding of what it is and what it does.

The Precise Definition of Machine Learning

In order to veer away from any type of generic ‘suitcase’ explanation of what Machine Learning is, a precise definition is needed. Machine Learning is a type of AI (Artificial Intelligence) and refers to the science of getting computers to ‘learn’ without being explicitly programmed. While the act of programming refers to specific sets of detailed instructions that tell computers what to do and how, in Machine Learning, the goal is to have a computer learn from experiences — just as human-beings learn from the culmination of life-experiences. Tom Mitchell in his book titled ‘Machine Learning’ formalizes this definition in the following way. A machine is said to accomplish the task of ‘learning’ if:

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Said in other words, the goal of Machine Learning is to create algorithms that are used to accomplish a task T, that based on some data set E, where the performance in task T as measured by P improved over E. Sounds complicated? It’s really not. In many respects, this statement also encompasses the human experience with respect to learning. Humans are experiential-beings. Every day we contextualize data via our human operating system’s natural algorithms and use this data to drive decisions and insights.

There are three different types of Machine Learning you should be aware of too — and while this list is in no means exhaustive, it should already give you an awareness that there are many types of conventional Machine Learning algorithms that can be used to solve different types of real-world problems based on the known and unknown parameters.

Supervised Learning: Labeled Data Sets

  • Regression (Market Forecasting, Weather Forecasting, Advertising Popularity Prediction, Estimating Life Expectancy)
  • Classification (Diagnostics, Customer Retention, Image Classification, Identity Fraud Detection)

Unsupervised Learning: Unlabeled Data Sets

  • Clustering (Recommender Systems, Customer Segmentation, Targeted Marketing)
  • Dimensionality Reduction (Meaningful Compression, Feature Elicitation, Structure Discovery, Big Data Visualization)

Reinforcement Learning

  • Real-Time Decisions, Robot Navigation, Learning Tasks, Skill Acquisition, Game AI

Machine Learning Terminology:

In order to walk-the-walk, you need to be able to talk-the-talk. This is where things often go wonky when people try to explain Machine Learning in the context of business applications. Fundamentally, one must understand that all Machine Learning algorithms are solving a problem. To solve a problem there are generally inputs, steps to solve the problem, and the outputs. In order to be able to use Machine Learning successfully, you need to understand the Inputs, the Steps to Solve the Problem, and the Outputs intimately.

In Machine Learning, the following definintions hold:

  1. Inputs: Experience E
    Training Data, Training Sets
  2. Steps to Solve Problem: Task T
    Machine Learning Algorithm
  3. Outputs: Performance P
    Prediction, Outcomes, Hypothesis (Real-Values, Discrete Values)

In Machine Learning, inputs are sent through an algorithm, which is used to map those very inputs to an output:

i.e. Output = Steps to Solve a Problem(Inputs)

Inputs: Training Data, Training Sets

In Machine Learning, there is a notion of training an algorithm such that it’s performance as measured by P improves over a set of experiences E. The more experiences that you can provide to train an algorithm, typically, the better predictions you are able to obtain. A training set is typically characterized by its size or number of training examples (m). For example, when predicting the price of house, you might have accumulated a set of prices of 56 different houses in a particular geographical area. In this case, the size of your training set is 56. Furthermore, training sets can be labelled or unlabeled — and this characterization will become important when determining what type of Machine Learning algorithm can be used — i.e. Supervised Learning (labelled data) or Unsupervised Learning (unlabeled data).

Steps to Solve Problem: Machine Learning Algorithm

Depending on the type of Machine Learning algorithm in question, the words used to describe the parts of an algorithm may change. In general though, a Machine Learning algorithm (in the case of Linear Regression, or Multivariate Regression) will include:

  • Hypothesis Function, hθ(x)
  • Variables or Features, n
  • Parameters, θ1 and θ2

The simplest Machine Learning algorithm is that of Linear Regression:

Hypothesis Function: hθ(x) = θ1 + θ2x
Hypothesis Function is dependent upon Variables or Features, x.

hθ(x) refers to the Hypothesis Function, or the prediction, or the outcome.

x refers to Variables or Features upon which the Hypothesis Function is dependent upon. The number of features is referred to as n.

θ1 and θ2 refers to Parameters of the Hypothesis Function — a different set of Parameters will most likely yield a different prediction or outcome.

When we say that a Machine Learning algorithm is being trained, what it means is that we are trying to optimize the Parameters of the Hypothesis Function over the set of m training examples in order to get the best match between the predicted result and the actual result over time, i.e. our definition of ‘learning’. So how will we know if we have arrived at the correct θ1 and θ2? How do we know if θ1=-0.5 & θ2=5 is better or worse at making a prediction than θ1=0.5 and θ2=2.5?

Different Parameters θ will result in a different Hypothesis Function. The best Hypothesis Function is the one that best fits the training set examples, m.

We know that we have obtained the best θ1 and θ2 when the difference between the predicted result from the Hypothesis Function and the actual result from observations is as close as possible, i.e. minimizing the error between predicted and actual. Minimization of the error is represented by something called a Cost Function (also called the Objective Function, the Squared Error Function or the Squared Error Cost Function). The Cost Function can be thought of as the cost that the algorithm pays for its inaccuracies, the higher the absolute value of the Cost Function, the greater the error of the Hypothesis Function.

In Linear Regression, minimizing the Cost Function, in this case J(θ1,θ2), is a task of minimizing the sum-of-squared-errors or differences across the m training set examples in order to get the ‘correct’ values for θ1 and θ2. Instead of getting into the mathematics of how to minimize the Cost Function — know that the minimization of the sum-of-squared-errors will result in the optimum values of θ1 and θ2 that will result in the lowest ‘cost’ that the algorithm will have to pay in order to make a prediction.

Cost Function: Sum-of-Squared-Errors between Prediction & Actual Results, J(θ1 , θ2)

In reality, gradient descent (taking partial derivatives of J(θ1,θ2) with respect to θ1 and θ2) is used in order to iteratively solve for the optimum θ1 and θ2 that will result in the minimum cost. At the end of the day, when trying to solve for the optimum Parameters for a linear Hypothesis Function, it becomes a calculus-type problem of finding the minimum of a 3D surface plot.

Personally, I am a math person — and if your a math person like me — you will appreciate the mathematics that are involved to find the optimum θ1 and θ2 via the gradient descent algorithm.

But honestly, all of this just barely scratches the surface of the true potential of Machine Learning for a variety of different applications.

The Best Resource

If you really want to learn the ins-and-outs of Machine Learning, the best resource I can point you to is actually a course offered by Stanford University Online.