My experience with real-world Machine Learning at Couture AI

Pratik Borikar
cverse-ai
Published in
8 min readOct 1, 2019

It’s mid-July and my summer internship at Couture AI., Bengaluru has come to an end. Looking back, I have realized that the last two months have been highly productive and resourceful. Before coming to Couture, I had little to no idea about Machine Learning, Big Data or even Hadoop for example. In these two months, I have grown as an individual and have dived into topics like machine learning, AI and algorithms and gained some valuable professional insights into the work culture and environment. Even though I have only scratched the surface I now know there is a great potential and curiosity for me in these emerging fields of Data Science and Artificial Intelligence. Apart from this, I have also written reports, conducted seminars and published a blog (This is the first blog I am writing).

During my internship, I had majorly worked on 2 things: (i) Understanding the basics of Machine learning and some supervised learning algorithms and (ii) Studying HDFS architecture and automating the process of adding a data node to a multi-cluster setup.

While working on these topics I have strengthened my mathematics principles, my shell command line skills and have attained a completely different perspective when it comes to solving a problem using ML. I will further elaborate on the projects I have worked on.

Project 1: Understanding the basics of Machine learning and some supervised learning algorithms

Machine Learning is one of the most exciting technologies in the field of computer science. It attempts to make machines similar to humans by giving them the ability to think. Machine learning is the implementation of Artificial Intelligence (AI) that provide computer systems the ability to improve and learn from experience implicitly, rather than being explicitly instructed. A more formal definition of Machine Learning was given by Carnegie Mellon University Professor, Tom Mitchell which states,

“A computer program is said to learn from experience ‘E’, with respect to some class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’ improves with experience ‘E’.”

Ongoing years have demonstrated that Machine Learning can be utilized to automate a variety of complicated tasks like image recognition, playing games, text generation, natural language processing and so on. Machine learning has other practical business applications which yield real business results, such as money and time saving that could potentially impact the future of an organization.

How does the Machine Learning process work?

Things in machine learning are iterative in nature and it is a complex and tedious process. The reason for the complexity is due to a large amount of data involved in which we are trying to find predictive patterns and models. The process can be divided into seven steps, which are as follows: (1) Gathering Data (2) Data preparation (3) Choosing an ML model (4) Training the model (5) Evaluation (6) Parameter Tuning (7) Predicting Outcomes using tuned model. The following diagram visualizes the process:

n
The ML Process (Source: Google Images)

Supervised Learning Algorithms

In supervised learning, we already given a data set with the correct output already known to us. We train/teach the machine using labeled data and then the trained model is used for predicting output for new input data. Supervised learning can be further classified as:
1. Classification: A classification problem is when the output variable is a category. For example, sorting a range of pictures of people into ‘male’ or ‘female’ category.
2. Regression: In this type of problem the output is a real continuous variable. For example, trying to predict the price of a house based on area, locality, which floor is it on, and so on.

Random forest regression, Isotonic regression, Survival regression, and Gradient-boosted regression are a few supervised learning algorithms that I studied. Let me give you a brief introduction to these algos.

Random Forest
In simple terms, Random Forest builds multiple decision trees and merges them together to get a more stable and accurate prediction. The forest is made up of numerous decision trees, where the outcome of each tree is considered for predicting the final result. Decision making can be considered analogous to the voting system. Each tree ‘votes’ for its predicted outcome and the outcome receiving the highest number of votes will be the chosen as the collective outcome for the model.

Random Forest with two decision trees

One major advantage of the random forest algorithm is that it can be used for classification as well as a regression problem, although it is mainly used for classification. It does not over-fit the data and can handle missing values.

Gradient Boosting Algorithm
Gradient Boosting is an ensemble learning technique used for classification and regression problems. It creates a prediction model as a collection of weaker prediction models. The main aim of the boosted gradient algorithm is to minimize the sum of squared errors of the model by using a collection of predictors. The technique combines many predictors to make a stable and accurate prediction.

Gradient Boosting

Isotonic Regression
Isotonic Regression involves fitting data-points with the best monotonic function (Non- decreasing function) that minimizes the sum of the square of errors.
In isotonic regression, we are given a finite set of real numbers Y =a1,a2,…. which represent observed responses and X= x1, x2,…. which are to be fitted, given along with their weights wi. Here y1,y2,…. represent approximated values which are subject to constraint yi ≥ yj for all (i,j). We need to find the function that minimizes:
f(x) = sigma wi*(yi-ai)², with respect to the ordering of the Ys in the above-mentioned fashion. Pool Adjacent Violators Algorithm is used for finding f(x). Isotonic functions are majorly used in predicting the probability. The following graph shows the data fitted using Isotonic Regression.

Project 2: Studying HDFS architecture and automating the process of adding a data node to a multi-cluster setup

With the integration and rise in popularity of social media websites like Facebook, WhatsApp, Twitter, Instagram, YouTube, etc. is generating data at a tremendous rate. Statistics show that approximately 500+ terabytes of data are being generated every day on Facebook in the form of photos, videos, messages, etc. The New York Stock Exchange (NYSE) for example, generates about 1 terabyte of new trade data per day. All the data generated can be labeled as Big Data, not only due to its sheer size but also due to the variety of formats it is present in.

Problem with traditional RDMS
The major problem with the Relational Database Management System(RDMS) is that it relies heavily on structured data like banking records, transaction details, employee details and so on. It could not deal with the heterogeneity of Big Data since the data needs to be in a particular format before storing it.

How does HDFS solve the problem?
HDFS(Hadoop Distributed File system) is the primary data storage system used by Hadoop applications. It employs a distributed architecture consisting of namenodes and datanodes to implement an efficient file system that provides high-performance access to data across highly scalable clusters. HDFS stores data in a distributed way, in small continuous segments called blocks among different datanodes( I will explain what exactly is a datanode shortly). Also in HDFS, you can store a variety of data, from structured to unstructured data as there is no ‘pre-dumping schema validation’, as in case of RDMS. Let us now take a look at HDFS architecture.

HDFS Architecture
Apache Hadoop HDFS Architecture follows a master-slave topology. Master is a high-end machine whereas slaves are inexpensive computers. The Big Data files get divided into the number of blocks, which are stored in a distributed fashion on the cluster of slave nodes. The two main components of HDFS are Namenode and Datanode.

Namenode vs Datanode

The following diagram shows the HDFS architecture:

The major goals of the Hadoop file system are to provide fault tolerance, distributed storage, reliability, and scalability.

Automation of adding a slave node
Couture AI. uses Apache Hadoop Distributed File System on their servers to store the mammoth amount of data and information. Setting up of a multi-node HDFS cluster can be a tedious task but adding an additional node manually is equally tedious. Automation of this process makes it faster and easier to add a new slave node to the existing cluster.
Shell scripting language was used to write different scripts to achieve the objective of adding a new datanode to the cluster. Shell scripting was preferred to other languages like python due to the following reasons:

  • It can run a sequence of commands as a single command
  • It is easy to use and understand
  • It is portable as it can be executed in any Unix like OS without modifying the underlying code.
  • It can be used to efficiently convert a series of steps into a single program.

The following image shows the Web UI of HDFS before adding the new datanode. Please note that there is only one datanode in the cluster. The names of the servers have been blacked out for security reasons.

Initial Cluster with one datanode
A new datanode is added using the shell script program

The entire process takes about 2–3 minutes as compared to about 12–15 minutes when done manually.

Lastly, I would express my gratitude towards Couture AI. team for guiding me and providing their valuable insights into helping me complete my projects. I would also like to thank my university, Birla Institute of Technology and Science, Pilani for giving me this opportunity.

--

--