Training a Machine Learning Engineer

Barath Narayanan
8 min readAug 30, 2019

--

Data science and Machine Learning are one among the most uttered words in the current technological world. Almost all computer science majors or non-computer science majors in some cases have the word ‘Machine learning’ or data science in their respective portfolio. However, not many are sure on how to approach a particular problem. Given a vision-based classification/segmentation, certain people blindly resort to Convolutional Neural Networks (CNN) without thinking if the goal can be achieved using a simpler approach. Sometimes a simple edge detector is sufficient to accomplish the task. Not all problems are complicated. Also, if the training results of the network aren’t as expected, most people add more layers and hope for a miracle with their fingers crossed. Unfortunately, miracles don’t work the way one wants. Due to advancement of APIs, people design their network without proper background or knowledge, hoping that it would solve their problem.

Photo by Franki Chamaki on Unsplash

Instead, one should learn to analyze their data and choose the algorithm that one thinks would solve the problem based on their theoretical knowledge, adopt some approaches applied in the same field and later implement their own network/model. Choosing the set of hyperparameters and number of layers entirely depends on the initial round of results on the validation dataset.

There is no clear outline on how to study Machine Learning/Deep Learning due to which many individuals apply all the possible algorithms that they have heard of and hope that one of implemented algorithms work for their problem in hand. Instead, take a step-by-step approach in order to solve the same. Below, I’ve listed out some of the steps that one should adopt while solving a machine learning problem.

Preprocessing

One of the most important steps in ML is to prepare the data. It’s like getting the ingredients ready for a recipe. There are multiple different preprocessing methods across several domains. For instance, in text based classification, one would want to make sure all the letters are in the same case, no words outside the dictionary is present, lemmatize the words, remove punctuation and so on. In vision-based classification tasks, resizing all the images to the same size is a mandatory preprocessing step.

In addition, we should also preprocess such that it can aid classification/segmentation process. Below is a set of images from NIH Malaria Dataset. This dataset contains images belonging to 2 different classes (Parasitized and Uninfected).

Images from NIH Malaria Dataset Identified as Parasitized
Images from NIH Malaria Dataset Identified as Uninfected

As one would observe the main difference between parasitized and uninfected images is the presence of Plasmodium (red spots). However, one should also notice the range of color present in these images. Main purpose of our classification task is to detect Plasmodium in order to distinguish the classes. We can assist the classification task by maintaining the color across the dataset by applying color constancy. Figure provided below shows sample images after applying color constancy. Top row presents the sample images present in the dataset, bottom row presents the images after the application of color constancy. These type of preprocessing steps would assist classification model especially networks such as CNN. Similar image processing methods can be applied depending on the application.

Preprocessing using Color Constancy

Also, in some cases, image segmentation could also be applicable to assist the algorithm for classification. For instance, in lung nodule detection in Xray or CT scan, one would want the classification algorithm to focus solely on lung thereby reducing its region of interest. Below is a simple example of lung segmentation.

Lung Segmentation on Chest Radiographs

Datasets

Training Set: A set of data points/images/text utilized by a ML engineer to train their model

Validation Set: A set of data points/images/text utilized in order to estimate performance and modify parameters, hyperparameters for better performance (similar to a mock exam)

Test Set: A set of data points/images/text solely utilized to estimate performance. Note: This should be treated like an exam, one can’t take a sneak peek.

Before designing an architecture, one of the vows that a ML engineer should take is to not study the results on the test dataset and modify the parameters. However, one of the highly recommended steps is to partition a part of the training dataset for validation. Typically, we utilize 10% of the training data for validation, however if the dataset is small, we can perform k-fold validation in order to study the performance. Andrew Ng refers to validation set as development set.

If a separate database/dataset isn’t available for testing, one could split a given database into groups of 70% for training, 10% for validation and 20% for testing provided there is sufficient data points to study the performance.

Architecture Design

Before designing an architecture, as a ML engineer, observe the major difference between classes (provided it is a classification problem). For instance, if a classification tool is build to distinguish shapes (say circle and triangle), we don’t need a deep CNN for classification instead a simple geometrical feature based classifier should work. Choose the network accordingly! Notice the differences and later design feature space/neural network based architecture accordingly.

One of the most important aspects before designing an automated tool is to meet the requirements of the end-user

Photo by LinkedIn Sales Navigator on Unsplash

For instance, if the requirement of a radiologist in order to identify lung cancer is based on some set of features (could be geometrical, texture, intensity etc.), we could design the network solely based on the same as the main motive behind the ML model to assist him/her in their decision making process. However, if the requirement of the tool is to make automated decisions, we could design our network/feature space accordingly.

Another important aspect one should also consider is the computational capability of the hardware/computer that is going to be used during deployment. Not all models can be used in real-time due to various issues which includes but not limited to computational capability and memory consumption. Ideally, one would want their model as computationally efficient as possible. Also, in some cases, hardware and the computer should have the capability to re-train if required based on some new set of training data.

Photo by chuttersnap on Unsplash

There is no replacement for Literature search! Before designing any architecture, have a look at some of the models developed for the same/similar application. Transfer learning based approach based on their architecture for a similar application might work for the application in hand. In case any of the authors have open-sourced the model, try to take advantage of that. Understand their code, model design, equations and later design your own. In case it’s for the same application, you could use their results as the benchmark for your research work.

Transfer learning not only applies to deep learning, it applies to us too!

Once a clear understanding of the problem is established, design the architecture based on the theory you’ve learnt. I would highly recommend Andrew Ng’s Machine Learning, Deep Learning specialization on Coursera to get a good head-start on theory as well as its implementation.

Once the model is trained, check the results on validation dataset in order to estimate the performance. Determine if the model is suffering from any of the under-fitting or over-fitting issues. Brief description of over-fit and under-fit is provided below. Fine-tune the parameters/layers accordingly. Also, check if there is any pattern on cases where the algorithm fails. You could modify your algorithm accordingly. For instance, if your face detection algorithm doesn’t work for faces captured in the slightly darker region of an image, you could apply image pre-processing (color based) algorithms that would help the model to overcome the issue.

Under-fit

Your model/architecture is suffering from under-fitting issue if both your training and validation accuracy is relatively low. For instance, if the training and validation accuracy is about 75% for distinguishing cats and dogs in a particular database where the benchmark is about 99%, your algorithm is considered to be suffering from under-fitting issue. Some of the remedies to overcome this are as follows: (i) get more training data (could be augmented), (ii) calculate more features, (iii) calculate polynomial based features, (iv) reduce the value of the regularization parameter, (v) add more layers (in case you’re developing a deep learning based architecture) after observing the results provided at each layer, (vi) Add pre-processing methods that could be useful to distinguish the classes

Overfit

Your model/architecture is suffering from over-fitting issue if training accuracy is high and validation accuracy is considerably lower. For instance, if the training accuracy is about 99% and validation accuracy is about 80%, your algorithm is considered to be suffering from over-fitting issue. Some of the remedies to overcome this are as follows: (i) get more labeled training data, (ii) remove certain features, (iii) apply feature selection approaches to select a subset of features for classification, (iv) apply regularization or increase its value, (v) remove certain layers (in case you’re developing a deep learning based architecture) after observing the results provided at each layer, (vi) Apply dropout regularization

Test

Once you’re done fine-tuning the parameters, hyperparameters according to the performance on validation dataset, test your algorithm on the data that it has never seen. It will provide the true performance of your algorithm. If the performance is good, you could deploy it for use in real-time or publish it in a conference/journal.

Photo by Thomas Kolnowski on Unsplash

Test the algorithm under different conditions, databases and make it as robust as possible. Also, get feedback from user/peer researchers for improvisation.

Photo by Charles 🇵🇭 on Unsplash

--

--