Photo by Glenn Carstens-Peters on Unsplash

Getting AI to Take my Notes for Me

How I built a program to transcribe and summarize audio.

Vedaant Varshney

Follow

Published in

The Startup

7 min readMay 6, 2020

--

In our daily lives, we are constantly being bombarded by data, but here’s the catch. Not all of it is digital. On average, we hear 30,000 words and speak at least 7000 daily.

Putting that into context, that’s the equivalent of about half the words in Harry Potter and the Sorcerer’s Stone, every single day.

As we all know, not every single word is important, and to track what is, we try taking notes. However, we suck at knowing about what’s going to come in handy, and this results in us taking down too many notes.

That’s why I built a program that can take notes for you! While that might have been a bit of an oversimplification, the result is still the same.

My program can take an audio feed or audio files, and it uses speech recognition to get a text output. Then it adds punctuation and capitalization back to the text using neural networks, before finally summarizing it into a few useful sentences.

The user receives a complete transcription of the audio, separating up to two speakers. They also get a list of key takeaways or points from the file. Whether its small meetings, voice memos, or possibly even a lecture, you can get all of the text in a readable format along with a summary of all the important bits.

Knowing what the system can do is great and all, but I’d argue that it’s the most important to know how it works. If I was to categorize the 3 main features, they would be:

Automatic Speech Recognition
Punctuation Restoration
Text Summarization

Automatic Speech Recognition (ASR) for Transcription

Most of us are already familiar with ASR systems, and Siri is one of the most renowned.

You’re probably already familiar with some ASR systems, like Siri or Alexa for example. The purpose of Automatic Speech Recognition is to allow us as humans to “speak” to our devices, paving the way for more natural interactions with them.

Let’s go over how it works in some more detail!

When your computer detects audio, it converts them into waves, which are registered as sounds. In the field of Natural Language Processing, we call the sounds that are used in language phonemes. English has 44 of them, and to use an example, let’s use the word “cat”.

Cat uses 3 phonemes/sounds, and they are the “c” sound, the “a” sound, and the “t” sound. It’s not the easiest to explain through text, but I go over it in more detail in my video about this project.

To determine which sounds correspond with each phoneme, we can use neural networks that treat specific parts of waves as their input, and output probabilities for the wave being each phoneme. We can repeat a similar process for mapping a set of phonemes to words, and then you pretty much have a complete speech recognition system in place.

Essentially needing a supercomputer for speech recognition limits the players in the space to some big companies like Google, Apple, and IBM, but there are open-source alternatives like CMUSPhinx

This might seem great and all, but if you’ve ever tried to speak to Siri you probably know that it’s not the best right now. That’s because ASR is a very complex field. It can take a team of hundreds of linguists and tech experts with supercomputers to build up an ASR system of Alexa or Siri’s caliber, due to how difficult it is to map sounds and phonemes to words.

For my project, I ended up using Google’s speech recognition services along with the Python SpeechRecognition module for the best (and free-est) results.

Punctuation Restoration with DNNs

The next step is to make the raw text output I got from the speech recognition readable since it’s just a string of words right now. I solved this by adding in punctuation and capitalization where necessary.

For the approach, I took a large library of online books and my friends’ old essays, and I processed it so that every single word is assigned one of three classes. These classes indicate the piece of punctuation after the text, with 0 representing nothing, 1 a period, and 2 a comma.

I then used a technique called word embedding to represent all of the words. This means that every single word is assigned a vector, and these values are all different. If we were to represent the vectors in two dimensions, words that have a similar meaning will be closer together.

Representing the vectors in a 2D space, the closer the vectors are, the more similar the words are.

If we had the word dog, its vector would likely be close to mutt, but the words phone and window would be much farther away. For this project, I used GloVe word embeddings, with the vectors having 50 dimensions.

For our model, I used a Deep Neural Network which took in batches of 5 word vectors as inputs, passes them through hidden layers with dropout, and then outputted a probability of each word being followed by a piece of punctuation.

Simplified Architecture of my Neural Network

I then converted the vectors and classes back into text and was able to get a readable output once again. A final step was to capitalize words after periods and any common proper nouns.

This text is what makes up the complete transcription of what’s being said.

Summarization with TextRank Algorithm

Going back to the initial goal, we need to summarize the transcription and get it down to something concise. I ended up using extractive text summarization, which can take the most important points and sentences directly from the transcription.

An alternative was abstractive text summarization which can create new sentences based on the transcription. It is a significantly more complex approach, and after some testing, it was clear that the extractive method was the way to go.

The first step is to preprocess the text, making sure the algorithm can deal with it. This involves getting rid of the punctuation and capitalization that was just added, along with any stop words that don’t help us with the meaning of the text.

It’s not a complete list, but here are some of the more common English stop words.

Now we separate this cleaned text into sentences and create a similarity matrix, which is a square matrix that has the dimensions n*n, with n being the number of sentences. The values in the matrix determine how similar any two sentences are to each other, and we can calculate that by comparing how often the same words were repeated and finding the distance between the two sentences.

The values in the matrix can be divided between other sentences, with the fractions adding up to 1 in each row.

That might seem a bit confusing, so let me delve a little deeper into the methodology. First, we can create a vector for each sentence, where each dimension is a word, and its value is how many times the word is repeated. After that, it’s possible to calculate the distance between the two sentence vectors using cosine similarity.

On the left, there are sample sentence vectors for the system to take in. On the right, there is a diagram that goes over cosine distance.

Cosine similarity calculates the angle between the vectors, and the lower it is, the more similar the sentences are.

The similarity matrix on the left multiplied by the 1*x or 1*n on the right.

The values obtained from the cosine similarity can be inputted into the similarity matrix from before, and then we can multiply it by a matrix with dimensions 1*n, with n being the number of sentences. The values of the outputted matrix would be the ranked importance of each sentence. I chose to use the top 3 for each minute of text.

Combining all three subsections together, you end up with my system!

Although my project is functional, it’s far from being anything but a work in progress. Its accuracy is reasonable but definitely needs some improvement with punctuation. There’s a lot of space to develop upon in the future, potentially further improving the way we deal with and declutter verbal data.

The amount of information that we’ll be taking in daily is only going to rise, and to prevent information overloads, being able to convert verbal information into summarized text effectively is something that we need now more than ever.

While this system might not be a complete solution, it’s definitely an important piece of the puzzle.

Resources I Used and Liked Along the Way

Also, feel free to check out my video on this project!

Please feel free to contact me through email for any inquiries or corrections in the article. Feedback is always appreciated as well!

E-mail: vedaant.varshney@gmail.com
Website: vedaantv.com
LinkedIn: Vedaant Varshney