Voice Tech Podcast
Published in

Voice Tech Podcast

All About Parsing: What it is, and how it relates to Text-to-Speech software

In the world of speech recognition — or any linguistic field — the word “parsing” is thrown around frequently, with more than a couple meanings and applications.

At its basest definition, ‘parsing’ is “to analyze (a sentence, in this case) in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc”. What this means to us is that we’re taking the sentence as a whole and breaking it up into understandable chunks, with their own individual meaning and context, as well as describing their relation to one another. Traditionally, you may see this done as a sentence “tree” once it’s parsed in one way or another. Take a favorite example of mine spoken by Groucho Marx: “This morning, I shot an elephant in my pajamas — “

Now, if we take this sentence at face value (no spoilers, those who know this one), we look at the typical interpretation of the main part of this sentence. We’ll leave off “This morning” because it complicates things a bit. We’re left with the following:

I was in my pajamas, I shot an elephant. Just a regular Sunday! (Source)

This is a sentence tree in a very basic sense; a representation of how we might classically parse this. But the fact that a parse like this is so specific and informative can be a difficult issue with limited information, as we will find with the full example of this quote: “This morning, I shot an elephant in my pajamas. How he got in my pajamas, I don’t know!”

I’m loath to overexplain a 90-year-old joke, so I will let it sink in by providing the alternative (and correct, in this case) parse of the above.

An unlikely scenario indeed! A bit of a garden path sentence.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

The point of this, aside from a wonderful excuse to force you all to hear a bit of classic wordplay, is to show that having all the words in a phrase is not enough to garner its full meaning in a lot of cases. We as humans are pretty darn good at interpreting a sentence’s meaning from context (go humans!), but it’s a tough process for machines. When we talk about parsing in computational linguistics, these sentences have to be interpreted through an established grammar: an established set of rules for how a given language works, to determine what’s “grammatical” and what isn’t. This is used as a framework to structure the sentence into possible interpretations. I say “possible” here very deliberately as human languages in general (certainly English) have a great tendency to be ambiguous. This is gotten around generally by using one of several methods of machine learning; a large part of this is based on the frequency of certain types of utterances. In one of the originally proposed methods of parsing (using Probabilistic context-free grammars), an ambiguous sentence would be compared to other instances of the same or similar sentences, and would simply give the probability of each given interpretation. We’ve come a long way since then, but there’s a lot to cover there… so I won’t! Not today, at least.

More interesting, perhaps, is the current day implementation of how our Voice Assistants work. Alexa, for example, doesn’t do all this hard work locally (Neither do others, although this may not always be the case…) Your Echo device will take the sound file that it has received(read my last post if you’re curious how that magic works), and pass it to the Alexa Service, hosted on Amazon’s cloud, and the bulk of the processing is done there. Even then, the work is very trimmed down compared to this weighty comparison model. Alexa acts based on a few key pieces it’s looking for in a request, and it uses those to determine the base meaning of what you’re looking for. An example of a request is shown below from an excellent guide for those looking for a quick overview of beginning development for Alexa-enabled devices:

Alexa’s way of parsing a request (above), and the data it sends to the skill (below) (Source)

These, and most examples of the same kind of request from Voice Assistants, are visibly radically simplified, as all that really needs to be done is determining what the invocation and skill names are, and then parsing out where the “utterance” is, and going based on just that small snippet. Even then, it’s much easier to parse out when the structure of the request is already known; you’re asking your device to perform a task for you, and that severely limits the possibilities of what you could be saying.

There’s a lot to this, and I’m only beginning to scratch the surface, but the gist of it is that ‘parsing’ what you’re saying outside the scope of virtual assistants is a huge ordeal wrought with errors and inconsistencies, especially when you consider that the way we as humans speak is, frankly, wrought with errors and inconsistencies. Even in this limited context that we’re assessing with Alexa, there’s a lot of work to be done. I definitely plan to continue digging into the nitty-gritty of what happens to get from A to B, but I hope this little peek was at least marginally insightful!




Voice technology interviews & articles. Learn from the experts.

Recommended from Medium


The Age of Artificial Intelligence: Is Mongolia ready? (Part I)

Training AI Systems the West World Way

How Artificial Intelligence(AI) is Improving the Hiring Process

Why AI might not be such a good idea…

The Facial Recognition Backlash

What’s So Bad About Automation?

How Our robot works?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Kitelinger

Alex Kitelinger

More from Medium

Patreon Recommendation : Inspyr3D


How we Designed a Self-Guided Tour App for Colby’s Campus

Here’s Why Advertisers Choose Dable Native Ads