Preparing your data for Machine Learning

Published in

AI³ | Theory, Practice, Business

3 min readMar 12, 2018

As data scientists, the most important asset for our work is in part of our name: data. With the age of big data, there seems to be plenty of information to actuate and run our algorithms and analysis upon.

You can think of your algorithm like as a car: your engine is your classifier and your fuel is your data. Engines require fuel to operate, and will operate more optimally if the fuel is of higher quality. Similarly, your machine learning algorithm will only be as good as your data quality is. Thus it’s important to extract the information we need from the data properly and effectively before training a classifier with it.

To many — especially newcomers — the task of creating a fully functioning machine learning system from start to finish (that is, from raw data to learning results) seems like a daunting task. There are lots of informative discussions and articles on how to apply a specific algorithm, but there isn’t much information out there about how to prepare your data properly from one end to the other. I’m here to change that.

Who am I?

I’m Ala, an infrastructure and machine learning engineer working in Silicon Valley. Last year I completed my Ph.D. in Canada on wearable technologies and machine learning, this provided me with a rather broad knowledge spectrum on everything from circuit design to machine learning. One of the most important skills I learned were how to create cheap and high performing “end-to-end systems” — full fledged systems from hardware specification, to firmware design, to kernel design, to the machine learning models that sit on top.

Entering into this Ph.D. I had no knowledge of machine learning or hardware design. In fact, I had only dabbled in soldering and circuits as a hobby for a few months prior to starting my Ph.D. It was one of the most difficult endeavors I’ve faced, but it was well worth it.

The workflow

When designing a framework or a machine learning algorithm, the end-to-end solution from raw data to prediction is referred to as the overall workflow, and I will be referring to it as such from now on. The workflow consists of 3 steps which will help you prepare your data for a learning algorithm. Each step is dependent on the other. That is, if the first step is done improperly, then the second step will suffer, and so on.

1. Marking

We mark the data so as to capture when it begins and when it ends before using it as training data. This is especially important if we are collecting the data ourselves prior to acting on it, as opposed to existing datasets.

2. Filtering

One of the best ways to clean up noisy data is through the use of filtering. There are dozens of filters that you can use, but the most important of them all are also the most basic: the high pass and low pass filters.

3. Feature Extraction

One of the most important — albeit time-consuming — pieces of the workflow is extracting the proper features that are relevant to what you are trying to classify. This is very application specific, however there are a few guidelines and standardized features that can be readily extracted for a number of applications. Additionally if you are adventurous, you can create your own features and extract them.

Thank you for reading. If you enjoyed it, hit that clap button below as many times as possible! I will be creating articles for each of the steps, as well as detailing some more “exotic” machine learning algorithms such as Hidden Markov Models and Transfer Learning!

Let’s also connect on Instagram, LinkedIn, or E-mail.