Machine Learning in a nutshell

Rekha Venkatanarayan — Mon, 23 Mar 2020 17:53:20 GMT

What is Machine Learning?

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that provides the systems an ability to automatically learn and improve from experience without being explicitly programmed. ML can be defined as an algorithm that finds patterns in given data, create models based on provided data and then use these models to recognize those patterns in new similar data.

How is ML different from Traditional Programming?

Traditional Programming

Traditional Programming refers to any manually created program (which has the in-built decision making capability based on rules) which uses input data and pre-canned rules to produce the output. This is a static process as rules are applied independently of the data being analyzed. Traditional programming tactics have a rigorous mathematical approach and is programmed to perform a certain task based on the defined rules.

For example, for a given input sample X to generate the output sample/s y of a function y = f(X), we need to write a program for function f(X) to satisfy and generate the output sample/s y.

Machine Learning

In ML the input data, as well as the expected output, are fed as input to the ML algorithm to generate a program. ML algorithms can automatically formulate the rules/logic from the given data without human interventions. Logic created by ML is then be used to work with a new set of input data. ML algorithms are data-intensive and dynamic as it alters the output based on the patterns found in the fed input data. ML system is truly a learning system as it is not programmed to perform a task, but is programmed to learn to perform the task.

For example, we are given input samples X and the output samples y of a function y = f(X). ML will use input samples X and output sample/s y to “learn” the function f and evaluate it on the new input data set.

ML is an interactive process and is called augmented analytics. It is also more prediction-oriented and can be used for prevailing insights that can be used to predict future outcomes.

What is ML Data?

To solve any problem using ML depends on gathered pertinent data. Data is at the core of Machine learning as the algorithms learn from provided data. It is critical to feed ML algorithms the right data to solve the problem. Data can be in any unprocessed form such as text, sound, value or picture that cannot be analyzed by an algorithm. This raw data needs to be pre-processed, (converted in the useful scale, format with meaningful features) and prepared before it can be used to train (create) a model.

How to Prepare Data for Machine Learning?

Data Preparation Process is as below

1. Select relevant data

a. Selecting the relevant data to solve a given problem is the first step. Gathering past data in any form like database tables, images, text, and comma-separated files, etc. is important.

2. Cleanse the data

a. Selected data may be in a raw format that is not suitable for working with learning algorithms. Cleaning of data is required. It is done by identifying and fixing errors such as missing, duplicate, incomplete and inconsistent values, etc. in the data set. This is known as Record Sampling.

b. An error-free training data set helps to produce better results from machine learning algorithms.

c. Along with the quality of data, the quantity of the data is also very important. Condensed data set can generate accurate results. This is done by selecting the most important attributes (called as features in ML) in the dataset. This is known as attribute sampling.

d. Furthermore, techniques such as Feature scaling, Feature decomposition, and Feature aggregation, etc. are applied to the cleaned data.

e. Cleaning and preparing the data for the learning algorithm is an iterative process.

f. Once the data is prepared, it is consumed by various machine learning algorithms to find the right model. Finding the right model is also an iterative process.

How to train the Model?

We cannot create any model without data. Creating a model in ML is called training the model and for this data is usually split in the ratio of 80:20. So 80% of data is used to train the model and the rest is set aside for testing the model. To verify, the model predictions are compared with actual output and its accuracy is confirmed. As a good practice, the model should be re-created regularly as the data changes.

Machine Learning: Are we there yet?

Rekha Venkatanarayan — Tue, 08 Jan 2019 17:09:12 GMT

Artificial Intelligence: Are we there yet?

AI machines have a long way to go before they come close to mastering certain human traits that require subtle nuances. Let’s focus on their two shortcomings: Decision Making and Depth Perception.

Background:

The core of AI is Machine Learning. In Machine Learning, machines find patterns in the provided data, then create/train a model and use that model to recognize the patterns in new data. Depending on the type of problem to be solved, machines undergo Supervised or Unsupervised Learning. Supervised Learning works with labeled/annotated and categorical data. It uses regression techniques for continuous values and classification techniques for finite values to forecast the target value from the training data. Unsupervised learning works with unlabeled data and uses the clustering technique to predict the target value which is not the part of training data. A Machine Learning model is only as good as its training data, ala “Garbage In, Garbage Out.” Therefore to derive to any valuable pattern, how the success of the prediction is measured matters. Hence before creating the model, preprocessing techniques like data munging are used on raw input data to get rid of the unreliable meaningless information which could lead to wrong decisions. Here, missing values get identified and accounted for, and corrupt data is recognized and fixed.

Shortcoming 1: Decision-Making

AI machines fail in real-life quite frequently. Let’s consider the case of Makoto Koike’s cucumber farm where the model’s accuracy for sorting cucumbers fell from 95% (lab) to 70% (real life). This drop was due to “overfitting,” where the model was precisely tuned for training data but not for large sets of unknown data. The reasons for poor predictions of models are Overfitting when the model does not generalize well, and Underfitting when too few features inform the model. For optimal real-world performance, Cross Validation techniques are employed, where the model is trained with a subset and evaluated with a complementary subset of the data.

Possible Solution:

Acquiring good domain-specific data is hard, time-consuming and expensive. One avenue to obtain large labeled training data could be crowdsourcing. Using a channel like Google’s “reCAPTCHA,” we could get data annotated by humans.

Shortcoming 2: Depth Perception

Vision subsystems hold the key to AI machines understanding the world around them, but their learning is still in infancy. A two-month-old baby will overshoot or under reach a toy, but a six-month-old will have no problem going straight for the toy. Depth perception in vision subsystems is a hard problem to solve, and most existing solutions are complex, slow and not always accurate.

Possible Solution:

The machine’s Supervised learning (for depth perception) can be trained using a set of real camera images labeled with corresponding ground-truth depths, or a set of synthetic graphics images. The machine learns to map monocular vision cues in the image such as gradient to estimate the relative depths in a scene and create a 3D model that is quantitatively accurate

As Moore’s law suggests that computation scales each year exponentially, but an AI model’s accuracy scales logarithmically with the amount of labeled data. This implies that the bottleneck to machine intelligence will not be computing power, but rather how much quality input data we can label for training. Modeling the right architecture for a specific problem is difficult

Conclusion:

Machine learning is a continuous process, mainly because of the machine’s constantly changing environment. There’s “No Free Lunch.” No one model works perfectly for every problem. The superiority of any two learning algorithms at a given point will depend on a specific problem at hand, assumptions, priors, available data, and cost. Data modelers need to budget time and resources for model upgrades and maintenance.

Stories by Rekha Venkatanarayan on Medium