[Review] 1. U-time: A Fully Convolutional Network for Time Series Segmentation Applied to Sleep Staging

Published in

jun-devpBlog

5 min readMay 8, 2020

1. Introduction

Deep learning networks are popular for the analysis of physiological time-series. The most successful models nowadays combine the convolutional and recurrent layers to extract useful features. However, the recurrent layer in the network often requires the domain(task)-specific adjustment which could be difficult for non-experts.

The proposed structure from the paper, U-Time, is a fully convolutional network based on the U-Net architecture which doesn’t have recurrent layers. Therefore, it doesn’t require such task-specific modifications.

According to the author of the paper, ‘sleep stages are characterized by specific brain and body activity patterns and sleep staging is a process of mapping transitions over a night of sleep’.

The reason why the sleep staging is important is that sleep patterns combined with other variables can be a cue for diagnosing many sleep-related disorders.

Most of the time, the classification of sleeping stages is done manually but this is clearly a difficult, time-consuming work where it needs to be done by experts such as clinicians. Typically, the clinicians inspect 8–24 hours long multi-channel signals and split into segments of fixed-length intervals (30-second in paper) and classify each segment.

The paper proposes another way to automate this process without recurrent layers in its system even though the recurrent layer is a conceptually appealing choice for analyzing a time series data due to the reason mentioned above. Instead of recurrent layers, the paper chose the feed-forward neural network since many studies found that the recurrent layer can be replaced by feed-forward networks without losing its accuracy.

2. Structure

U-Time is a neural network consisting of a fully convolutional encoder-decoder structure. It is originated from the popular U-Net architecture for image segmentation and temporal convolutional networks.

The following illustrates how the U-Time maps a long input sequence to segmentations at a chosen temporal scale.

Figure 1. from [1], Example of how U-Time works

The first row of figure 1 is an input sequence with T segments, C channels and i sampled points. In the above example, T=4 as each segment is split by red dot-line. The encoder takes the raw input signal(physiological signal) and compresses it into a deep stack of feature maps. Then the decoder reconstructs the input signal domain with a dense, point-wise segmentation from the given feature stack. A segment classifier uses such dense segmentation to output the final prediction(sleep stages) at a chosen temporal resolution.

Structure overview of the U-Time architecture

(1) Encoder

The encoder has four convolution blocks and two consecutive convolution layers with 5-dimensional kernels dilated to width 9 in each block preserve the dimensionality of input by adding zero-padding. Those convolution layers are followed by batch normalization and max-pooling.

Figure 3. Dilated convolution with dilation rate of 2

After max-pooling, two convolutions are applied to down-sample the signal. With the help of a stack of dilated convolutions, the encoder has a large receptive filed at its last convolution layer. This convolution with a large receptive field substitutes the role of a recurrent layer.

(2) Decoder

The decoder of U-Time consists of four transposed-convolution blocks for conducting an up-sampling operation to extracted input feature maps from the Encoder.

Figure 4. The matrix representation of transposed convolution

The filter sizes for transposed convolution layers are the same as the filters used in the Encoder. The resulting feature maps of transposed-convolution are, then, concatenated with the corresponding feature maps computed by the encoder at the same scale(skip connection). Lastly, a point-wise(1x1) convolution with K filters results in K scores for each sample of the input sequence. This means that we eventually get vectors of length K for each sampled input point and we treat those vectors as a confidence score representing the most probable class for each sampled input point.

Note that the transposed convolution is different from deconvolution

(3) Segment Classifier

The point-wise segmentation result for each sampled point from Encoder-Decoder is an intermediate representation. What the segment Classifier does is to map this intermediate representation to the final representation in label space. The dimensionality of Encoder-Decoder output is [T, i, K], meaning that for each fixed-length interval T we have i sampled points and each point has confidence score in a vector form of length K. However, what we want to achieve is not a point-wise segmentation but a interval-wise segmentation. What segmentation Classifier does is to fulfill this aim.

In order to achieve that, the Segment Classifier firstly aggregates the sample-wise scores to predictions over a longer period of time, period of i time steps(which is equivalent to one interval length T). In other words, it performs a channel-wise mean pooling(average pooling) with width i and stride i. As a result, we get confidence scores of lower temporal resolution and the dimensionality of such representation is [T, K]