Marking & Preparing Data for Machine Learning

Ala Shaabana
AI³ | Theory, Practice, Business
5 min readMar 24, 2018

If a Machine Learning algorithm is considered the “engine” behind your application, then surely its training data is the “fuel”. An engine only functions well when it is using proper fuel that is:

  1. Compatible with this engine (you cannot fill up your car with jet fuel!).
  2. Clean enough such that your engine will function more efficiently (there’s a reason that premium fuel costs more than regular fuel).

In this article, we will discuss how to properly mark and set up incoming data from the source before turning it into training data. Indeed, as data scientists we may be used to data being readily available for our consumption, but the proper preparation of this data is just as vital as the algorithm trained on it. We will first begin with talking about data marking and segmentation, and then proceed to data preparation. We will use a template problem of classifying muscle movements based on their Electromyographic (EMG) signal.

Problem: We wish to build a classifier to find the differences between the muscle movements of users with Carpal Tunnel disease and healthy users. We wish to train a classifier by collecting data from users typing on a keyboard. To extract features, we must record the muscle movement on a user’s forearm as they are pressing the keys. This means we must mark the data with the time that a key is pressed and a key is released. The workflow would look like this:

EMG signal workflow from raw signals to classification.

The above figure highlights a typical workflow to train a classifier on an EMG signal — generated by animal muscle movements. In this article, we will primarily work inside in the “Data Segmentation” box. The end product should look like this:

Marked EMG data coming from 8 channels (electrodes) of an EMG machine. The marker begins with the key press and ends with the key release. In this case, the user is pressing the key over and over again. The first marker is the key press, while the second is a key release.

Data Marking and Segmentation

Throughout this article, we will refer to a phenomena we are training a classifier to identify as an “event”. In our problem, the event is a key press and release. When working with temporal data, marking the start and end times of an event becomes important in extracting the information and turning it into training data. For example, when working with speech recognition it may become important to mark the beginning and end of words, sentences, or dialogues in the data in order to extract features and use as training data. Another example is sign recognition using EMG signals — the start and end time of each gesture must be recorded so as to extract the action potential information that is outputted during the gesture.

There are a number of ways to mark data, we will divide them into 3 categories:

  1. Manual: In addition to recording the event on a sensor, the event is also recorded externally — for example on a camera. In this way, it becomes possible to comb through the video manually and find the start and end times of the events. When working with large amounts of data, this can become very cumbersome. Additionally, human error must be accounted for.
  2. Algorithmic: When working with signal data such as audio or bio-signals, some signal features may be good indicators of the start and end of an event. For example, signal energy can be computed and compared against a predefined threshold (usually determined by the hardware or the activity being measured). If the signal is above the threshold, then likely an event has occurred and we should extract features here, once the signal drops below the threshold then it the event has likely ended. While this approach removes the possibility of human error, it is also possible to obtain false positives in which a different event raised the signal energy. In our EMG example, muscle exhaustion releases lactic acid which raises the signal’s amplitude — this in turn can create false positives. Finally, threshold-dependent approaches tend to be somewhat unreliable.
  3. Hybrid: Combines the algorithmic approach with the manual one. This approach depends on utilizing hardware as well as software to create “triggers”. Triggers mark an event’s start and end time according to an external stimulus.
XTREMIS: The DAS marked in the photo is a “Data Annotation System”, which flips specific bits according to incoming signals.

The figure above shows a design for an EMG data collection circuit that contains a Data Annotation System (DAS) which marks the signal as it is coming in from the electrodes attached to the user’s muscles. The DAS is composed of 6 pins that are connected to the keyboard, and their bits are flipped according to each key being pressed. This means we have 2⁸ = 256 different keys that we can press. This is more than enough combinations to represent the entire keyboard. The setup would be as follows:

Indeed, although the hybrid approach is the most complex to implement, it is the easiest to use. For example, if collecting data for sign language recognition, the DAS can also be wired to one button, which is pressed when a gesture begins and pressed again when a gesture ends. The entire data collection system was implemented by myself, a fellow graduate student, and my Ph.D. Supervisor, and can be seen below with the various components highlighted:

Thank you for reading! If you enjoyed it, hit that clap button below as many times as possible! I will go into more detail about the device as well as filtering and feature extraction.

Let’s also connect on Instagram, LinkedIn, or E-mail.

--

--