A Study of Feature Selection of Whale Vocalization

These two IEEE paper from Duke University are recommended to me by Dr. Murphy, who was a visiting assistant professor at Duke. Both paper discuss feature selection techniques specifically for whale signals prior to classification. From my point of view, the first paper is more math-heavy, introducing four transformations related to Fourier tranform, while the second paper introduces four mappings for dimensionality reduction. These paper are for those who have some prior exposure to signal processing and linear algebra.

Whale vocalizations with Weyl Transform

This paper focuses on signal processing techniques, specifically the feature selection process, and does not have any discussion in machine learning methods. The authors propose an unique method to analyze a whale signal. Intuitively speaking, various signal analysis methods suit different kinds of signals. The subject matters in the discussion are right whale and humpback whale signals. To classify one from the other, the authors use the chirp-like signal, an intrinsic identity of marine mammals. To obtain such information, they propose Weyl Transform, which supposedly gives second-order information of a signal. Recall from physics, a second-order derivative of replacement vector x(t) gives us acceleration vector a(t). Put another way, it gives us a global view of the subject matter.

In section 2, the author briefly introduce MFCC and Chirplet transform. MFCC is an very established feature in SR (speech recognition). Chirplet is a niche method very specific for chirp-like signals (ignorable). Basically, these two methods extract coefficients from a signal and present essential information to the classifier. One motivation of proposing an alternative to MFCC is that MFCC only provides first-order frequency information. To understand MFCC, basically it projects STFT to a triangular filter before taking the DCT, which reduces dimensionality and retrieves only the real part (cosine part of Euler’s formula). The filter H(k,m) below manifests why MFCC is characterized by first-order frequency information.

Section 3 explains the source of data and rationale of selecting chirp-like singals as the feature. Section 4 mathematically deduces the feature selection process. The authors propose two feature sets out of the Weyl transform, modifying Feature Set I to get Feature Set II. I have some difficulties understanding the math in this section, yet from formula (4), it is clear that Weyl transform is built on top of Fourier Transform. In addition, Fig. 3 clarifies their intention. We can see that Feature Set I (c) and Feature Set II (d) attempt to attain the chirps from a signal.

Last section is classification result. The authors employ KNN as classifier, Hamming window for MFCC, and ROC (receiver operating characterisitc) curve to visualize classification results of 4 features. To interpret ROC curve, one needs to understand type I and type II error. Generally, depending on the axes, the closer the curves are to the upper-left or lower-right corners, the better the results are. From Fig. 4, Feature Set II outperforms Feature Set I and the others, and proves that Weyl transform is a good feature selector for Whale signals.

Intrinsic Structure Study of Whale Vocalization

Similar to the first paper, this paper also focuses on the feature selection process. It focuses on the dimension reduction techniques, in contrast to the mathematical transform in the first paper. This is reasonable since the whale siganl, as stated in both paper, is a polynomial-phase siganl. We naturally desire the number of coefficients to be as small yet respresentative as possible. The whole propose of the paper is to compare efficacies of nonlinear mapping to linear mapping, as in Fig. 1 below.

PCA (section 2.1)

PCA is perhaps the most popular dimension reduction technique. For those who have just taken Linear Algebra like me, it is not hard to understand its concept. In short, we compute the covariance matrix of the input matrix, and then compute the eigenvector of the covariance matrix. Eigenvector with the highest eigenvalue is the principle component of the data set. I found the following link a good read:


Multidimensional Scaling (section 2.2)

The explanation of MDS is too succinct on the paper. MDS is a spatial rearrangement in such a way that, distances between objects indicates the similarities (dissimilarities) of the objects: similar object are represented by points that are close to each other, dissimilar objects by points that are far apart. To state the problem mathematically, given a pair-wise (Euclidean/non-Euclidean) distance matrix Dx of N points in k-dimension space, find a new set of N points in p-dimension space (p<<k) such that the new pair-wise distance matrix Dy is the best approximation of Dx. Below is the algorithm for classical multidimensional scaling:

  1. Set up distance matrix D with entries,

2. Apply double centering do D:

3. Find p principal components. Since B is symmetric, by spectral theorem, we can decompose it to a diagonal matrix. Multiplying square root of the first p eigenvalues with their corresponding eigenvectors,

A good read with example of MDS: https://pdfs.semanticscholar.org/dbe6/5466b905b63cf67739d3b696d617ec101d00.pdf

The problem with the above two methods is that the linearity identity fails to capture global information of the manifold learning problem (I think of manifold as geometric structure of the data). On the other hand, a nonlinear mapping has two obvious advantages. It captures local structures of the manifold and present a global view by connecting the local data information. Recall from Multivariable calculus, nonlinearity is a function that has order greater than 1.

Laplacian Eigenmap (section 2.3)

To understand the algorithm, one may need elementary knowledge in graph theory. Our starting point is to preserve local structure of the manifold. An idea is to visualize the manifold (data set) with an undirected adjacency weighted graph G (connected/disconnected). Each vertex is an object, while the edge represents similarities (dissimilarities) between objects (data). The graph is undirected because similarity between i and j equals similarity between j and i, and weighted for the degree of similarities. Surprisingly, the conversion is very logical (at least for me).

Now, we have restated the problem to mapping the graph G to an m-dimensional Euclidean space, so that connected points stay as close together as possible. The simplest objective function in the Euclidean space is the following formula,

where we penalize euclidean distances with their corresponding weight. Finding the argument minima is equivalent to finding the eigenvalue of

(I don’t understand the optimization part). The word Laplacian comes from the Laplacian matrix L = W(weight matrix) — D(degree matrix); the word Eigen originates from solving the optimization problem)

A rigorous proof of Laplacian Eigenmap: http://yeolab.weebly.com/uploads/2/5/5/0/25509700/belkin_laplacian_2003.pdf

ISOMAP (section 2.4)

ISOMAP is a modification of MDS. Replace the Euclidean distance in MDS with the following,

Neighbors are motivated from K-NN algorithm (I guess). Recall from data structures, to find the shortest path between two vetices, we can use Dijkstra’s algorithm. After constructing the distance matrix P, we decompose it to its Eigen (principal) components.

Fig. 4 gives a comparison of the four methods with Laplace Eigenmap having the best result under 3 different classifiers, and subsequently proves the author’s thesis that nonlinear mapping provides a better insight to whale vocalization.