Look Who’s Talking

Published in

TL;DR Innovation

4 min readFeb 22, 2018

Spectral Learning Algorithm for Speech Separation

I do not know the first thing about teaching science to the blind. Even my earliest course in problem solving begins with Step 1: Read the Problem; Step 2: Draw a Picture of the Situation — both initial steps inextricably associated with sight. I’ve exhausted cases of dry-erase markers and meters of chalk creating elaborate diagrams on the board and lean heavily on classroom projectors for animations and presentations. Entire courses in biology require students to memorize images seen through microscopes and sketch intricate cartoons of membrane transport mechanisms. It’s no wonder that computer scientists and electronics engineers are fervently developing algorithms and hardware for machine vision. Even if the ultimate goal of intelligent, sighted machines may be far off, tools capable of sprinting through hours of security video, cabinets of medical images, or mountains of scanned documents can make an intelligent researcher very efficient.

One important area of research into machine vision is image segmentation, wherein image pixels are divided into groups or regions that represent objects or parts of objects. This is a bit different than finding a specific object in an image by comparing a known model of the object with the image through correlation or other similar operation. In segmentation, image characteristics including texture, brightness, and color are used to associate pixels into statistically similar clusters while additional characteristics such as contrast, intensity gradients, and edge detection help to discern the interface between different objects. In this fashion, the segmentation algorithm is “blind” to the identity of the individual objects and simply points out their position, shape, and number.

Professor Michael I. Jordan and his research group at the University of California, Berkeley, have recently developed a blind algorithm for finding the optimum values of a “similarity matrix,” used to partition image pixels or points into disjoint clusters with points in the same cluster having high similarity, and points in different clusters having low similarity, using a process termed “spectral clustering.” Instead of starting with the entire image and partitioning the points into exactly two groups of belonging or not belonging to the current cluster (as in the creation of a binary search tree), the algorithm considers the clustering of the image into any number k-subsets simultaneously. The optimal number of clusters, k, and to which cluster each point belongs, is treated as an error minimization problem through comparison of the algorithmic results with the “correct” answer — a segmented image created by a human operator. Like other supervised learning algorithms, once it is trained to produce the correct results it can be applied to new, but similar input.

The learning spectral clustering algorithm was presented at the 2004 Neural Information Processing Systems (NIPS) conference this past December where Jordan’s group also described its utility to separating speech from multiple speakers that has been recorded by a single microphone. When an image of multiple objects is recorded by a single digital camera, it is actually being captured by millions of individual pixels and displayed as a rectangular matrix of intensity and color values. A single microphone records pressure values as a function of time, which can be plotted as a rectangular matrix having time along the horizontal dimension and pressure intensity along the vertical. As sound is most often associated with frequency analysis, a windowed, short-time Fourier transform can be applied to the pressure intensity values to obtain a rectangular matrix of frequency versus time, known as a spectrogram. Whereas multiple simultaneous objects are separated by position in an image, simultaneous speech from multiple speakers differs by frequency (pitch) and resonant features known as timbre.

Traditional voice recognition algorithms must know the identity of an individual speaker before their speech can be separated from other speakers or background noise, much like the known models used in object recognition. Since the learning spectral algorithm is only concerned with clustering similar items, it can separate the speakers blindly, without knowledge of their identity. In addition to clustering similar harmonic features of pitch and timbre, the algorithm tracks non-harmonic cues such as continuity, where two time-frequency points are close in time or frequency, and “common fate cues” — elements that exhibit similar time variation such as identical start and stop time and frequency co-modulation resulting from the “speech psychophysics” of punctuation, inflection, and emphasis. The sound bites used for algorithm training are easily created by merging the speech files of individual speakers and the original separate files are used to determine the optimal results. With no knowledge of the language spoken, the identity of the speakers, or the mechanics used to create the sounds, the algorithm can be used as a tunable filter to extract a desired conversion. While not required by the algorithm, its utility to image and speech analysis and informatics in general reflects the far-reaching vision of its developers.

This material originally appeared as a Contributed Editorial in Scientific Computing and Instrumentation 22:9 August 2005, pg. 8.

William L. Weaver is an Associate Professor in the Department of Integrated Science, Business, and Technology at La Salle University in Philadelphia, PA USA. He holds a B.S. Degree with Double Majors in Chemistry and Physics and earned his Ph.D. in Analytical Chemistry with expertise in Ultrafast LASER Spectroscopy. He teaches, writes, and speaks on the application of Systems Thinking to the development of New Products and Innovation.

Look Who’s Talking

Written by William L. Weaver