Real-time Sound event classification

Chathuranga Siriwardhana
Oct 9 · 5 min read

I introduced a method to classify sound events using machine learning in a previous post. In the previous post, the sound events were captured separately as small audio clips. Therefore, no segmenting process was needed. The separate audio clips were used to train the Neural Network and test it as well. In this post, I’m going to introduce a method to classify sound events which are sequentially concatenated in one audio clip (or a stream). We have to classify events and give them a label along with the timestamp of the respected audio clip or the stream. Let’s call this as time-tagging. Note that the same procedure which used to classify isolated sound events will be reused here. Therefore, I strongly recommend you to read the previous post first.

Let’s have a look at the waveform of one sample audio clip we are going to be time-tagging.

A sequence of sound events

The suggested method to time-tag is as follows.

1. Reduce noise using a common noise sample

2. Split the audio clip into a single sound event containing audio clips

3. Trim leading and trailing silences of single sound event audio clips.

4. Classify single sound event clips using previously trained neural network

5. Output time tags.

The procedure is valid for a scenario in which no two sound events happening simultaneously. This is because the prediction model we are using here is trained for isolated sound events only. Let’s assume that the noise remains unchanged throughout the sound event series. Let’s have a look at performing the above steps with a single sound events concatenated clip. Then, at the end of the post, I’ll introduce a method to classify sound events in real-time using a microphone’s audio stream.


You can prepare a sample by concatenating some single audio clips as follows.

raw_audio = numpy.concatenate((raw_audio,data))

Reduce evenly distributed noise as follows.

noisy_part = raw_audio[0:50000]  # Empherically selected noisy_part position for every sample
nr_audio = nr.reduce_noise(audio_clip=raw_audio, noise_clip=noisy_part, verbose=False)

Splitting the audio clip

Now, we have come to the core idea of continuous sound event recognition. The challenge in classifying a sequence of sound events is that determining the starting and ending points of those sound events. Almost always there is a silence part between two sound events. Note that inside some sound events, there can be silences. We can use those silence parts to split a sequence of sound events. Have a look at the following code which is used to accomplish the task. Observe that the parameter tolerence is used to adjust the splitting sensitivity. So that the small silence parts inside one sound event are not used to split the respected sound event further.

# Split a given long audio file on silent parts.
# Accepts audio numpy array audio_data, window length w and hop length h, threshold_level, tolerence
# threshold_level: Silence threshold
# Higher tolence to prevent small silence parts from splitting the audio.
# Returns array containing arrays of [start, end] points of resulting audio clips
def split_audio(audio_data, w, h, threshold_level, tolerence=10):
split_map = []
start = 0
data = np.abs(audio_data)
threshold = threshold_level*np.mean(data[:25000])
inside_sound = False
near = 0
for i in range(0,len(data)-w, h):
win_mean = np.mean(data[i:i+w])
if(win_mean>threshold and not(inside_sound)):
inside_sound = True
start = i
if(win_mean<=threshold and inside_sound and near>tolerence):
inside_sound = False
near = 0
split_map.append([start, i])
if(inside_sound and win_mean<=threshold):
near += 1
return split_map

The algorithm uses a fixed-sized window w with hop length h. The window is slide on the given audio inspecting the window’s mean amplitude. If the amplitude is lower than the given threshold_level, the algorithm increases an internal parameter called near. When the parameter near gets a value larger than the parameter tolerence, an audio clip ending split point is determined. Likewise, the audio clip starting split points are also determined using window mean amplitude. Note that an internal boolean parameter inside_sound is maintained to distinguish between the start and end splitting points.


Trim single sound event audio clips

Now we have split our audio clip into single sound events. The small audio clips need trimming their leading and trailing silent parts. Let’s use librosa to accomplish the task.

sound_clips = split_audio(nr_audio, 10000, 2500, 15, 10)
for intvl in sound_clips:
clip, index = librosa.effects.trim(nr_audio[intvl[0]:intvl[1]], top_db=20, frame_length=512, hop_length=64)

Note that the split_audio only provides the time tag intervals. We need to obtain the actual audio clip by nr_audio[intvl[0]:intvl[1]].


Classify sound events

To classify the isolated sound clips, we can use the trained Neural Network model from the previous post.

#Load segment audio classification model
model_path = r"best_model/"
model_name = "audio_NN3_grouping2019_10_01_11_40_45_acc_91.28"
# Model reconstruction from JSON file
with open(model_path + model_name + '.json', 'r') as f:
model = model_from_json(f.read())
# Load weights into the new model
model.load_weights(model_path + model_name + '.h5')

The model was created to predict a label using a label encoder. We need to replicate the label encoder here as well.

# Replicate label encoder
lb = LabelEncoder()
lb.fit_transform(['Calling', 'Clapping', 'Falling', 'Sweeping', 'WashingHand', 'WatchingTV','enteringExiting','other'])

To classify audio clips using the loaded model, we need absolute mean STFT features. Have a look at the following code which accomplishes the task. Note that the function accepts label encoder lb as an input to produce a meaningful label as the sound event.

def predictSound(X, lb):
stfts = np.abs(librosa.stft(X, n_fft=512, hop_length=256, win_length=512))
stfts = np.mean(stfts,axis=1)
stfts = minMaxNormalize(stfts)
result = model.predict(np.array([stfts]))
predictions = [np.argmax(y) for y in result]
return lb.inverse_transform([predictions[0]])[0]

Now, we can use the predictSound function on the isolated clips just after the trimming operation mentioned above. Visit the GitHub repository for the full code and the classification example.


Real-time sound event recognition

Up to now, we have been working on a recorded sequence of sound events. What if we want to do the same thing without having to record, but in real-time. Well, for this, we can use a microphone input stream buffering into a temporary buffer and work on the buffer. The IO can be easily handled with a python library such as PyAudio (Documentation here). Check out the GitHub repository for implementations.

Hope you find the article useful.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade