[Review] 2.1 Sequence-to-Segments Networks for Segment Detection

Published in

jun-devpBlog

6 min readMay 21, 2020

1. Introduction

Detecting segments of interest from an input sequence is a challenging problem which often requires not only good knowledge of individual target segments, but also contextual understanding of the entire input sequence and the relationships between target segments.

In order to deal with such a problem, the Authors of this paper proposed a new architecture called Sequence-to-Segments Network(S²N) for segment detection. It is based on the end-to-end sequential encoder and decoder structure. Firstly S²N encodes the input features into a sequence of hidden states to capture information in it. Subsequently, it applies a proposed decoding structure composed of a sequence of units, named Segment Detection Unit(SDU). It incorporates the information from hidden states of the encoder with a decoder state to detect segments of interest in given input sequence(e.g video).

Definition of the ‘Interest’: concepts that denote the part of the data where has the highest semantic values. In a youtube video, for example, the part of a video where people might have been the most looking forward to watching.
Input data: Time series data with annotated segments of interest(In this case, it is labeled by human).
Network output(target): Detect the segments of a video where people would be interested the most

2. Previous approach

A traditional approach to solve this problem is to train a classifier to distinguish the segments of interest from others. Once the classifier is trained, an individual candidate segment of the time-series data is given to the classifier and repeat this for the whole time-series data. However, this approach doesn’t take into account the fact that the segments of interest are related not only to the local context(e.g individual segment of whole time-series) but also to the global context(whole time series).

In short, in order to address this problem, we need to consider not only partial information of given data but also the holistic gist of it.

3. Short recap to the related works

GRU

As we all know, the RNN architecture was designed to analyze the time-series data. However, it has been suffered from the vanishing gradient issue when it comes to working with long-term dependency. The reason behind this is that RNN computes its gradient recursively(BPTT) in backpropagation and it has multiplicative terms as many as its sequence length. As we can clearly guess, if each term is below 1 then the total gradient converges to zero, and this causes the vanishing gradient problem(please refer here for more detail computation for BPTT)

There were various attempts to solve this and the most two representative ways are (1) LSTM and (2) GRU. Both are designed in a similar way and show excellent performance in most cases.

GRU addresses the vanishing gradient problem of standard RNN by using the update gate z_{𝑡} and reset gate r_{𝑡} which are computed in a way described below. Those are vectors in the range [0,1] by the sigmoid activation and determine whether how much of previous and current information should contribute to the output. For example, if z_{𝑡} is equal to one then the output doesn’t contain the information from the current step and only has the information from previous steps, and the other way around for the case z_{𝑡} is equal to zero.

For more detail about the GRU, you can visit here where the whole flow of GRU is well explained.

Bi-directional RNN

**The architecture of the bi-directional RNN from** **here**

**pseudocode of the forward and backward pass of Bidirectional RNN from** **here**

Encoder and Decoder

Encoder-Decoder network, also called sequence-to-sequence network, aims to map a fixed-length input to a fixed-length output where both lengths don’t have to be equivalent. This architecture of the network is widely used across various fields such as translation, feature extractions, and sentence generations.

Figure 3. from here, an illustration of Encoder-Decoder Network

The encoder network encodes the time series and produces an encoding state vector as well as a sequence of hidden states that progressively capture from local to holistic information about the time series. Encoder often consists of a stack of several recurrent units(RNN, LSTM, GRU cells).

The decoder network also consists of several recurrent units. However, unlike encoder, it produces not only its own hidden state but also the actual output at each time step 𝘵.

Attention mechanism

As the variants of RNN has been applied, such as LSTM and GRU, detecting the long-term dependency less suffers from the gradient vanishing problem. However, as one can easily imagine, if we have hundreds of words as a given input sentence and want to generate output sentences based on that(such as translation or sentence generation) This could lead to loss of information, missing some keywords. it happens because the typical encoder-decoder is incapable of remembering long sentences. Often it forgets the first part of the input, meaning that the influence of the words appearing at the beginning of the sentence decreases as it is propagated through encoder with many units in it.

The attention mechanism, however, partially solved this problem by allowing the decoder to take a closer look at all hidden states of the encoder from each time step rather than taking only into account the last hidden state of the encoder.

As the decoder now considers the all input sequence to produce its output per time step, we no longer need to worry about forgetting.

Figure 4. from [5], example of an additive Attention

The detail explanation of the attention with animation can be found here[6]

4. S2N architecture

Model architecture overview

The proposed S2N network, according to the authors, is a novel encoder-decoder network with recurrent layers specialized to analyze a time series to detect the temporal segments of interest.

Figure 5. from [1], The architecture of S2N

For the encoding part of the proposed network, you can think of a typical encoder with recurrent layers. It takes time-series data for each step and produces hidden states and encoding vectors {e₁, e₂,…,e𝑛} that progressively capture the information from previous inputs and states. Here, one example of such time-series data could be a video such that you want to extract only interesting(meaningful) footages.

What separates the S²N network from others is its decoding part. The authors introduced their own key component named Segment Detection Unit(SDU) built on the GRU structure but outputs three modules: (b), (d), and (c). (b) and (d) denotes the starting and ending points of the predicted segment of interest while (c) stands for the score estimator how confident the SDU is about its prediction to the segment beginning point (b) and ending point (d).

Since each SDU outputs a set of ((b),(d),(c)), in total the network makes a number of predictions of interesting segments as many as the number of SDU in the decoder.

So far, we have built fundamental backgrounds to understand the general architecture of S²N. In the next chapter we will take a closer look at its update rules, Loss functions.