Sleeposcope

Identifying periods of sleep and wakefulness with a non-contact sensor

Is your child sleeping through the night?

Motivation
This is a consulting project for a medical devices company, henceforth referred to as company X, that is interested in identifying nighttime periods of sleep and wakefulness in patients. The signal used for this purpose is continuously recorded from a mattress-embedded, 1-D accelerometer. Please see Figure 1.

Figure 1: Shown above is a 24-hour slice of the signal with ground truth labels for asleep, awake, and out of bed periods. Ground truth labels were provided by an expert at company X.

Goal
The goal of this project is to use machine learning to automatically identify periods of time when a patient is asleep, awake in bed, or not in bed, based on accelerations seen by a sensor embedded in the patient’s mattress. Classification is currently performed visually by an expert, a tedious and time consuming task.

The data
Expert-labeled data were provided to me to use for training and testing. I used the expert labels as ground truths. The data I’ve had access to thus far includes approximately 16 nights of sleep from each of two patients, henceforth referred to as patient #1 and patient #2. The accelerometer signal was sampled at 1 Hz. Figure 1 shows a representative 24 hour period from patient #1 with ground truth labels for each state.

Feature engineering
I subtracted the signal floor (approximately 200) and scaled the signal by the standard deviation of the whole series. I binned the signal to 60 s windows and created features for each of these windows. These features included 8 signal features from the present window as well as those same features calculated over two neighboring windows (past and future) of 120 s each. Signal features for each of the three windows (past,present, and future) were defined for the scaled signal as follows: mean, maximum, minimum, standard deviation, rate of mean-crossings (this one I will explain later), mean log, maximum log, and minimum log. I then performed classification on these 60 s windows. Below, I explain why.

The problem I am trying to solve is to extract features from the signal that can differentiate between the three states (awake, asleep, out of bed). It is important to note that accelerometer signals are typically noisy. The mattress will pick up any vibrations in the environment. Moreover, while movement or lack-there-of is an indicator of whether a person is asleep or awake, there is plenty of movement even when people are asleep. You can see this in Figure 1: large high-frequency spikes in the signal persist throughout the period labeled as asleep. Yet, if an expert can visually differentiate between states, it is reasonable to expect that a machine can do it too!

The first thing I wanted to know was this: Is the distribution of signal values alone sufficiently different between the three states to distinguish them from one-another? The answer becomes apparent if we take a look at a histogram of signal strength for each state: “No”. Although the awake-in-bed state has a much longer tail (i.e., it includes more samples at higher signal strength), there is plenty of overlap between the the distributions across the three states. The challenge is therefore to extract features from the signal that expose the differences.

Figure 2: Histogram of signal values shows overlap between distributions across the three “states”.

It just so happens I know a thing or two about time series signals. Scaling the signal by its standard deviation and subtracting the signal floor allow for generalizability across individuals. Binning the signal allows us to define various measures such as mean, standard deviation, minimum, maximum, and frequency content. I divided the signal into 60 s windows. 60 s constitutes a large enough window (60 points) to calculate some statistics on. It is also not particularly useful for our purposes to differentiate asleep, awake, and out-of-bed periods lasting shorter than 1 min. In a time series whatever is happening in the current minute is not independent of the rest of the signal. The signal might look like “awake” right this minute, but if you were asleep the past two minutes and you remained asleep in the next two, there is a good chance you are asleep right this minute. I didn’t strictly enforce this, but including signal features from the past and future windows in features of the current window invokes the continuity of the signal.

I defined “rate of mean-crossings” as follows: For each window, I subtracted from it the mean value for that window and counted the number of times this mean-subtracted signal crosses zero. I then divided this count by the length of the window. This is a crude measure of frequency. As it turns out, this variable has distinctly different distributions across the three states. Please see Figure 3.

Figure 3: Histogram of “rate of mean crossing”, a crude measure of frequency content, highlights differences between the three states.

Another way to extract useful features from the signal is to take the log of it. Taking the log shrinks the really large spikes and allows for differences in the rest of the signal to be explored. For instance, Figure 4 shows maximum log of the scaled signal over each 60 second window. Looks like a promising feature, right?

Figure 4: Histogram of maximum log values over 60 s windows clearly shows different distributions across the three different states.

Training and testing data sets
I set aside 4 days and nights of data from each patient for testing. The remaining 12 nights from each patient, I used for training and cross-validation. The original training set consisted of 2349, 9196, and 16248 sixty-second samples of out-of-bed, asleep, and awake-in-bed conditions, respectively. To generate a balanced training set, I resampled the awake-in-bed segments of the data by shifting the binning window multiple times, a variable number of seconds at a time. On the other hand, I sub-sampled the out-of-bed data. In this way, I ended up with a balanced dataset of 24075 unique samples. With our 24 features, we are ready to classify!

Classification with a random forest
I used a random forest classifier to classify each 60 s window into one of the three states. Tree-based methods are popular in the activity classification literature and easy enough to implement in scikit-learn. So why not give it a try? It turned out, in this case, the random forest performed better than logistic regression (more on that later).

Training, cross-validation, and testing accuracy
To choose the best model parameters, I performed a grid search on maximum tree depth, maximum number of features to consider at each split, minimum number of samples samples required to perform a split, and minimum number of samples per leaf. This grid search was performed with a 5-fold cross-validation and cross-validation accuracy was used to score the models. This was all just a couple of lines in scikit-learn. The top scoring models from this grid search reached a mean accuracy of 95% on the test data (Figure 5, A). Not bad, right?

Figure 5: (A) The model was trained on 12 nights from each patient. Ground truth and predicted labels are shown for a test night from patient #1. Classification accuracy on the test data was 95%. (B) The model was trained on data from patient #1 only, and tested on patient #2. Classification accuracy on the test data dropped to 55%.

While a mean accuracy of 95% is impressive, it is important to consider that this model was trained on many nights of data for each patient and tested on data from those same patients. To be useful, a model needs to be generalizable. How will my model perform on data from a new patient it has never seen? I can’t know that because I don’t have any more data to test on, but I did go back and trained the model only on patient #1 and tested it on patient #2. Classification accuracy dropped to 55%. Why? Let’s take a look at Figure 5. In panel B we can see where most of the classification error arises from. The model misclassifies out-of-bed as asleep. This is likely because patient #2 has a much noisier and more variable signal than patient #1. In fact, when I train the random forest classifier on patient #2’s data only, I get a test accuracy of 71%. Much better, right? And finally, when I train the model on one night of data from each patient, I reach a mean classification accuracy of 87% on the remaining nights. These results are summarized in Table 1.

Table 1: Mean classifcation accuracy is shown for each combination of testing and training sets.

In case you are wondering, I did try out logistic regression as well. It is simpler, it is more interpretable, so it makes sense to try it first before moving on to more complex models. On the same data as in the first test in Table 1, logistic regression achieved 83% mean accuracy. So in the end, random forest won out.

What to do next?
The only way to improve generalizability here is to train the model on more patients. Two people is really too few for any kind of activity classification task. On the other hand, while the 95% test accuracy sounds fantastic, you can see in Figure 5B that if your goal is to count the number of times patient #1 awoke during the night, you are going to be awfully wrong! This problem has a rather simple solution that is commonly used in actigraphy literature: a sliding window with a bunch of heuristics: For instance, if an awake period shorter than 2 minutes is surrounded by asleep on both sides, we may reclassify those two minutes as asleep. As another example, if a sleep period is followed immediately by out of bed, obviously, you are missing the awake-in-bed period in between. A more “machine-learny” way to impose common sense is to use a Hidden Markov Model (HMM) with three hidden states (awake, asleep, and out-of-bed). HMMs involve an assumption of stationarity with regards to transition probabilities that is clearly not true here: the probability of transition from sleep to awake and vice versa, for instance, is not constant, but depends on the time of day/night. Still, the model may be sufficiently robust to the violation of this assumption. The transition and emission probabilities can be simply calculated based on the expert labels. I had zero luck with the hmmlearn python library! I have half a mind to go in and implement the Viterbi Algorithm myself if I find some time. One of these days I may actually get around to it.

So what?
Even without further heuristic-based smoothing of the tiny blips that you can see in Figure 5B, you should be able to tell that this model will do a really good job at estimating total sleep time per 24 hours, as well as sleep latency (how long it takes you to fall asleep once you get into bed), both really important parameters in relation to quality of sleep. Now the machines can do the labeling and the people at company X can focus on the important stuff!


Click here for demo slides.

Like what you read? Give Bahar a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.