DATA, DATA, DATA.

Machine Learning is NOT magic.

Lucrece (Jahyun) Shin

Published in

CodeX

4 min readSep 12, 2021

A person (97%) and a tennis racket (60%) 😂. Is this object recognition model recognizing the rainbow shirt’s stripes as tennis racket strings? This may be an example of a neural net’s **texture bias**. This paper is looking into amending this by introducing **shape bias**. (photo creds: me wearing Better Stay Together ♥️)

Shift of Thoughts

During the early stages of my experience with Machine Learning, my problem solving philosophy was to focus on modelling. As my entry in Machine Learning started with Deep Learning, complex neural network architectures from the state-of-the-art research papers showed exceptional results. Through time; however, as I got to work in real-world settings as an employee of an AI startup as well as a ML masters student, I’ve come to shift my attention to a different component in Machine Learning pipeline: DATA.

What humans “learn” come from their life experiences. We are largely shaped from situations we were exposed to, things we saw, people we met, and conversations we made. The same thing applies to a machine learning model as well. The type and quality of input features and labels largely determine a model’s performance. Thus here are some of the important questions I ask myself while optimizing the model performance:

How is the data distributed?
How can I augment the data so that the model becomes more robust?
Are some features not informative and only contributing to noise?
How can I extract the best features?
What is the number of features that are most informative together?
How can I formulate the labels such that the model can learn most efficiently?
What type of loss should I use for optimization with the given features and labels?

Iterative Process

It’s also important to ask such questions iteratively. In a machine learning pipeline, we formulate features and labels, put them through a model, optimize weights, and inspect the model performance. While observing the results, we can start asking questions. Which aspects is the model most confused about? Will reducing the number of input features reduce noise in data? Is some characteristic in the input data I did not notice before contributing in deteriorating the model performance?

Here’s a fun example of how turning to data saved my life. During my image recognition project using CNNs, I saw poor classification accuracy for one particular class: knife. Prior to realizing the importance of data, I would have simply thought that the model capacity was not big enough and perhaps made the model two times bigger. But this time, I instead checked the distribution of input images’ feature encodings (=output of the final convolutional layer) using t-SNE plots . The plots showed that encodings of the knife class images actually had some overlapping distribution with another class, avocado 🥑 , thus resulting in a sub-optimal decision boundary. When I looked at the training images of the avocado class, I indeed found some images containing a knife. I had overlooked the fact that the two classes could appear together in a kitchen setting. To fix the problem, I adjusted my data in following ways :

Cropping out the knife part from avocado images
Collecting more images of various types of knives

Examples of images of “avocado” class that contained another class object, “knife”

Machine Learning Model as a Function

The more I focused on the data and not the model, the “black box” of a Machine Learning model started to become more and more transparent. A machine learning model is a “function”. It should not be too different from the most basic function form of y = f(x) in grade 11 math class. It’s just that x and/or y might be high-dimensional, possibly leading to a much more complex relationship between. I also cannot solve for the weights (slopes, if you’d prefer) on a sheet of paper as I’ve done with a simple y = ax + b function. There might be hundreds of weights, if not millions. So I let the computer solve for the optimal weights. But these weights are just the numerical output of what I told it to do. It is trying to map the data (x) given by me, to a particular form of labels (y), which is also given by me. I’m responsible for setting up the environment and the props. The computer is only the computing hardware.

Curiosity is the Key

Anyone can download an open-source dataset and train a model using an open-source library. But a real curiosity to ask questions why the model is acting this way and analyzing data to find the answers is a key to understand how a machine “learns”. I learned to stay away from taking any peculiar outcomes of the model for granted or losing control of how the model was behaving. Failed models are not true failures, but stepping stones to the final optimized model. My new machine learning philosophy, backed up by years of personal experience through trial and error, is to analyze and optimize data iteratively.

Thanks for reading! ♥️

- L ☾₊˚.