The ABC’s of Machine Learning

Abhinav Chauhan
Tilicho Labs
Published in
8 min readApr 25, 2018

Machine Learning/Artificial Intelligence — Machine Learning has become one of the most ubiquitous buzz words in the tech industry. From Google to Microsoft to Amazon, to IBM, Tesla, Netflix, Baidu, Facebook, etc. and even startups everyone wants to have a piece of the action. In all keynote presentations/speeches of chief executives of large corporations, machine learning figures out prominently. Machine learning has a huge potential in that it can solve complex tasks that are not possible by traditional programming.

Data Science — Data Science is a broad field that primarily deals with generating actionable insights from data. Machine Learning is the 1000-pound gorilla at the center of data science. Data Science pipeline consists of Data collection/importing, followed by Data cleaning/formatting (which is the single most time-consuming part of the data science pipeline and consists of dealing with missing/invalid values, data formatting errors, outlier removal, encoding data, etc.), then Exploratory data analysis, followed by Data Wrangling and then Data Modeling (here Machine Learning comes into play).

What is Machine Learning — “Machine learning is giving computers the ability to learn without explicitly programming them”. Sounds scary right, but there’s no need to worry as of right now unless you are Elon Musk (Google “Elon Musk on Artificial Intelligence”). Machine learning has evolved from Pattern Recognition and the demarcation between the two is quite blurred.

Before we delve into the applications of Machine Learning let us understand what it really means. First of all, let me clearly state that Machine Learning is a subset of Artificial Intelligence, and this subject has not developed recently but rather has been present for a long time. In fact, the name ‘machine learning’ was coined in 1959 by Arthur Samuel. So why is it that when machine learning has been there for such a long time that all of a sudden, we have started seeing machine learning and related buzz words everywhere from tech magazines and interviews to blogs, keynotes, etc. The answer is Neural Network (aka Deep Neural Network/Neural Net/Artificial Neural Network). ANN is a vast system of interconnected nodes which is modeled after human brain. Neural Networks requires significant amount of computational power (memory) and are able to solve complex tasks such as speech/voice recognition, image recognition etc. that were earlier not possible.

Most common Platforms and Tools for Machine Learning — Deciding between Python and R

The 2 most popular open source machine learning/data science languages

Python ecosystem — Python along with R is the most preferred language for data science & machine learning and has libraries such as NumPy & Pandas for data wrangling, Matplotlib & Seaborn for data visualization & Scikit-Learn for implementing machine learning algorithms. Python has support of Google also. In fact, Google’s Neural Network library TensorFlow is written primarily in Python.

R programming language and its packages have an equally big and dedicated fan following and ultimately which one you choose (Python or R) depends on you. Just like Python is supported by Google, R language also has the backing of Microsoft.

I went with Python because it is a general-purpose programming language (meaning you can do a lot of things apart from data science/machine learning, such as creating games, automating simple tasks, back end tasks, etc.). Also, it is one of the most-simple language with an easy learning path. (“Think Python” or “Learn Python the Hard Way” will be good enough to get you to speed with the language’s syntax and if you have programming background, then it is going to be a cakewalk for you)

Other tools/software frequently used for machine learning related tasks are MATLAB, Octave, SAS & SPSS (both primarily for statistical tests and analysis) and Tableau & QlikView (for data visualization).

Types of Machine Learning

1.Supervised Learning — Supervised learning is that class of machine learning where we have labelled training data (existing data with class labels) on which we model our algorithm. And then we use that algorithm to predict outcome for unlabeled data (test date). Supervised learning can further be broadly classified into Regression and Classification tasks.

  • Regression involves predicting a continuous variable such as the average price of a house based on its specifications (by analyzing the existing data on housing market) or predicting the score of a candidate on an exam based on his past record. Multiple Linear Regression, Polynomial regression and Decision Tree Regressor are some of the most famous regression algorithms.
  • Classification involves predicting a discrete class label such as Yes or No; whether a loan gets approved or not; t-shirts size (small, medium, large), etc. A very famous introductory classification sample problem (on Kaggle) is predicting whether a passenger survived on Titanic or not based on features such as from which port he boarded the ship, his age, gender, his cabin class, family members that were with him, etc. Some of the most famous classification algorithms are Support Vector machines, Logistic regression, Decision trees, Random forests, Naïve Bayes classification, etc.

2.Unsupervised Learning — Unsupervised Learning is that class of machine learning where we draw inferences without training data (labeled data). For example, segmenting (clustering) customers into different groups based on their needs or traits. Some of the famous unsupervised machine learning algorithms are K-means clustering, hierarchical clustering, DBSCAN, etc.

3.Reinforcement Learning — a goal-oriented learning based on interaction with environment. Reinforcement Learning is said to be the hope of true artificial intelligence. The potential that Reinforcement Learning possesses is immense. Self-driving cars are based on this field.

Fuel for Machine Learning — Data

Data has been termed as the oil of the 21st century and data scientist was described as the sexiest job of the 21st century by Harvard Business Review in 2012. Since then machine learning and data science landscape has evolved dramatically at an ever-increasing pace. Without data, machine learning is nothing, the “learning” in machine learning comes from data. Let’s say we have a dependent variable that depends on 2 independent variable (total 3 variables), then we can plot the data points on a 3-dimension space, but if the dependent variable depends on hundreds or thousands of variables (in many machine learning situations it is even more than thousand) then we won’t be able to visualize this data and we would need hyperplanes (multi-dimension version of a plane) to separate different classes from each other. This is where machine learning comes in handy. Put plainly, machine learning is simply fitting a line (hyperplanes) to training data and then using these lines (hyperplanes) to create classification boundaries (or to predict values in case of regression) and then predict the classes of testing data (new data) using these boundaries.

Real life applications of Machine Learning

IBM Watson playing Jeopardy!

Everywhere around us machine learning applications abound, only thing is we never notice or think about them. Some of the applications are:

  • Email classifiers — Gmail automatically classifies a mail as “spam” or “not spam”, based on algorithms modeled on large database of mails (that google knew were spam or not spam).
  • Self-driving cars — Google and Uber are working on and already have a working prototype of a driverless car which is based on Reinforcement learning.
  • Grading essays — Using Natural Processing Techniques, machine learning have proven its mettle on grading students’ essays with the same accuracy as that of teachers.
  • Detecting cancer — Using image recognition techniques which are based on Convolutional neural networks, cancer tumors can be detected in advance before they turn malign.
  • Recommender systems — With the help of machine learning techniques and a large collection of user data, companies such as Amazon, Netflix & IMDB are able to give recommendations to their users on which movies they may like and which products they may be interested in buying.
  • Credit approval — Many banks use classification techniques such as SVM and Logistic Regression to decide whether to approve or reject a loan. A problem here is that giving the user feedback on why his loan was not approved may not be possible as machine learning simply produces the outcome and not the reason underlying it (the reason is simply the data used for modeling!)
  • Anomaly/Fraud detection — Machine learning can be used to detect anomalistic behavior which may help prevent fraudulent financial transactions and cyber theft. Example if a user normally does not transact for more than $1000 and suddenly bank’s servers receive a transaction request for $100000, then based on machine learning it can be detected that something is fishy, and the transaction or even the bank account can be frozen.
  • Speech recognition — “There are over one billion voice searches per month. (January 2018)”. The underlying technology behind it, yes you guessed it right, none other than machine learning.
  • Image recognition — Facebook is able to detect our friends in our pictures and asks us to tag them, Google allows us to search our photos by typing objects name that might be present in the photos, again thanks to image detection techniques of machine learning.
  • Playing games — Computers have already showcased their gaming abilities by not just defeating humans in Chess (that can be achieved by pure computational power) but also in the ancient Chinese board game GO, where winning on the basis of computational power is not possible as there are billions of combinations of moves and thus learning from experience and intuition (or intelligence) is required for playing the ancient game.
  • Chatbots — Chatbots is an emerging field where users type their queries and automatic responses are generated by analyzing their text (natural language processing). Increasingly chatbots are used for customer query handling in different industries.
  • Walmart coupons — Coupons are only successful if they are redeemed, and so companies like Walmart and Target use their machine learning systems to create personalized coupons and increase the efficiency of their promotions.
  • IBM Watson — IBM Watson is a computer that was created to win at the game Jeopardy, where questions are awkwardly phrased and filled with sarcasms and pun. For a computer to not just be able to understand these questions but also win the game was a big stride for NLP
  • Predicting stock market using sentiment analysis — By analyzing the tweets of a large number of users that were tweeted with a specific company in mind, the stock price of that company can be predicted to a good accuracy by assessing the sentiment (mood) of the tweets.

--

--