[Review] 2.2 Sequence-to-Segments Networks for Segment Detection

Published in

jun-devpBlog

4 min readMay 22, 2020

1. Recap

This article is based on the previous article. If you are either not familiar with this topic(e.g RNN) or haven’t read the last article, I recommend you to take a look at here.

Figure 1. from [1], The architecture of S2N

2. GRU for state update

As we have seen in the last article, the authors of this paper introduced quite many changes to the standard structures.

Figure 2. update rule for the hidden state of the decoder in SDU(left) vs general decoder(right) from [7]

One of those interesting parts is the way the decoder updates its hidden state h_{j}. As is shown in figure 6, the most decoders update its hidden state based on the hidden state h_{j-1} and the output y_{j-1} from the previous cell. The decoder of S²N, however, applied a different way. While it takes the previous hidden state as input like other decoders, a learned input vector ‘z(trainable parameter)’ is used instead of the output y_{j-1} from the previous cell.

Figure 3. code of the implemented learnable input vector z from the author’s git repo

3. Pointing modules for starting and ending points (b) and (d)

The authors built the boundary pointing modules based on the Ptr-Net. Given the hidden state h_{j} of GRU in SDU in the current time step j these two modules predict the boundary positions in a similar way as Ptr-Net does. The detailed pointer mechanism to predict (b) in j-th SDU is as follows.

W₁, W₂, v are learnable parameters of the pointing modules. This mechanism determines an index j to be the starting point (b) which shows the highest response to a pointer function g. Unlike the Ptr-Net where the input vector x is used as a parameter of g, the encoding state vector e is used instead in S²N since e contains richer information than x. The same mechanism applies to the End Position Pointer.

Assignment Strategy

4. Loss function for localization

The loss function of S²N has two terms: (1) Loss for the localization and (2) Loss for confidence estimation.

(1) Loss for the localization: L_{loc}

Let’s start to take a look at the first loss term: L_{loc}. The computations of loss for predicting the staring point and ending point are divided into two part: L_{loc}^{b} and L_{loc}^{d}.

Since those two terms are actually defined in the same way, only gets different parameter (b) and (d), we only need to take a look at the first term L_{loc}^{b}

Loss for the localization computed based on Earth Mover’s Distance(EMD)

The basic idea behind this is to make the probability distribution of ground truth segment boundary: p* and the probability distribution over the location of the segment boundary returned by the pointing module: Pr(b=i) closer. So the training proceeds in the way that the distance between these two distribution Pr (b=i) and p*(i) gets reduced.

As other neural network models do, Pr(b=i) is computed by the soft-max layer and p*(i) is a boundary indicator vector(one-hot encoded whose i-th entry is 1 if i is the ground truth boundary)

In order to measure this distance, we can use a standard method: Cross-Entropy but the authors used a different metric called Earth Mover’s Distance(EMD) which can be computed based on the differences between the two cumulative distributions of Pr(b=i) and p*(i).

The reason why the authors used the squared loss is that it usually converges faster than L₁ loss and easier to optimize with the gradient descent algorithm.

(2) Loss function for confidence estimation

Here, the cross-entropy loss is used to measure the compatibility between the predicted confidence score: c𝑛 and desired confidence score: σ𝑛. The predicted confidence score: c𝑛 is from the Score Predictor module of SDU and how to obtain the desired confidence score: σ𝑛 is shown below.