Machine Learning is Fun Part 6: How to do Speech Recognition with Deep Learning

Alexa, order a large pizza!

Machine Learning isn’t always a Black Box

Turning Sounds into Bits

Images are just arrays of numbers that encode the intensity of each pixel
A waveform of me saying “Hello”
Sampling a sound wave
Each number represents the amplitude of the sound wave at 1/16000th of a second intervals

A Quick Sidebar on Digital Sampling

Can digital samples perfectly recreate the original analog sound wave? What about those gaps?

Pre-processing our Sampled Sound Data

Each number in the list represents how much energy was in that 50hz frequency band
You can see that our 20 millisecond sound snippet has a lot of low-frequency energy and not much energy in the higher frequencies. That’s typical of “male” voices.
The full spectrogram of the “hello” sound clip

Recognizing Characters from Short Sounds

  • HHHEE_LL_LLLOOO becomes HE_L_LO
  • HHHUU_LL_LLLOOO becomes HU_L_LO
  • AAAUU_LL_LLLOOO becomes AU_L_LO
  • HE_L_LO becomes HELLO
  • HU_L_LO becomes HULLO
  • AU_L_LO becomes AULLO

Wait a second!

“Hullo! Who dis?”

Can I Build My Own Speech Recognition System?

You can access the same thing for Amazon via your Alexa app. Apple unfortunately doesn’t let you access your Siri voice data.

Where to Learn More

  • The algorithm (roughly) described here to deal with variable-length audio is called Connectionist Temporal Classification or CTC. You can read the original paper from 2006.
  • Adam Coates of Baidu gave a great presentation on Deep Learning for Speech Recognition at the Bay Area Deep Learning School. You can watch the video on YouTube (his talk starts at 3:51:00). Highly recommended.

--

--

--

Interested in computers and machine learning. Likes to write about it.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Best Books for those Serious about Machine Learning

Quantifying circ2vec versus tile2vec performance

Artificial Neural Network (ANN)

CNN Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more

Jigsaw Unintended Bias in Toxic Comment Classification

Neural Network Parameter exploration pt. 1

Wavelet Experiments

Identifying relevant text sections for long sequence classification.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adam Geitgey

Adam Geitgey

Interested in computers and machine learning. Likes to write about it.

More from Medium

Machine Learning Applications

Handwriting Recognition with ML (An In-Depth Guide)

SINGULAR VALUE DECOMPOSITION — WITH SOURCE CODE — EASIEST WAY

A Primer on Current & Past Deep Learning Methods for NLP