What Is The Meaning Of All This
The earliest Sumerian writing consisted of non-phonetic logograms. That’s to say, it was not based on the specific sounds of the Sumerian language, and it could have been pronounced with entirely different sounds to yield the same meaning in any other language.
— Jared Diamond, Guns, Germs & Steel
This quote highlights the fact that humans model the world around them as concepts which are imbued with meaning. Initially language and subsequently writing was invented to enable humans to communicate meaning-loaded concepts with each other. This suggests that the brain is able to decode the elements that carry meaning, whether from sound (spoken language) or symbols (writing). Likewise, the brain can reverse the process, encoding a series of ideas into speech or text.
While we humans do this encoding and decoding of meaning effortlessly, the complexity of this entire process is not readily understood or available (did you know that inventing writing is such a hard process that it is believed to independently invented only twice in human history). Modeling this machinery however in a effective manner is necessary for us to be able to train machines to work with “natural language.”
Let’s examine how a machine could “understand” natural language and extract an abstract representation of “meaning” that it can work with.
A simple diagram illustrates the basic components.
At the left, we gather input. The spoken or textual input enters a Decoder, where the input is processed through multiple steps to extract a representation that the machine can understand. This entire process is called Comprehension. The output of this decoding process is a Meaning.
The meaning object is consumed first by the execution engine and subsequently by the Encoder. The Encoder (also called “Language Synthesis”) is where we construct a response based on the extracted meaning and the results of the execution engine.
We will concentrate on the first part of this operation: decoding input in a way that we can extract meaning aka Comprehension.
Breaking Down Comprehension
The two efforts in research and academia to capture meaning are logical forms and abstract meaning representation (AMR). At Ozlo we have kept our goals quite modest, we do not require a fully specified “meaning representation” that is able to condense all aspects of natural language into meaning trees, instead we pick areas that are relevant to domains and product capabilities. We do not aspire to be an all-purpose chatbot, so we can start with simpler elements of meaning representation and build incrementally with more complex elements of natural language as we need to.
Two methods that are immediately useful to us are:
- Modeling concepts. Concepts are the semantic units of meaning that Ozlo understands. Within each domain that Ozlo knows about, like Food, or Movies, there are concepts that help him put utterances into context. Within the Food domain, Cuisine and Food Preferences are the sorts of concepts Ozlo needs to be conversant in.
- Modeling actions. These cover a somewhat broad range from modeling the type of questions to a granular understanding the actual verb actions. Some examples are
- command — e.g. do something, get something
- interrogatives — e.g. asking about entities or their attributes
- statements — e.g. expressing preferences, greetings, salutations etc …
Technically the comprehension component can be thought of as a movement from syntactic to semantic elements. As the utterance passes through Ozlo’s comprehension pipeline we extract increasingly detailed semantic elements.
Let us take an example user input and work through the various component.
where can I get a pizza ?
The user input is a raw run of text (this could be acquired from a text interface or transliterated from voice to text). This is run through a tokenization framework to get a set of tokens (or words). The tokenization of the example would be:
In this phase the comprehension engine attaches rich metadata to the tokens generated from the previous step. Metadata attached to token spans provide rich signal for various processing downstream. The two of the most important pieces of metadata we attach are parts-of-speech (POS) tags and categorical or NER tags. We have created our own POS tagger which is trained specifically on utterance structures that are carefully tuned to utterances in our domains and product experience. This allows us to create a POS tagger which has very high accuracy for utterances that we are interested in and does fairly well well with more general language input.
The following is a snippet of the annotation labels we generate. It is a small subset of the entire universe of labels, also notice here that the annotation process can work over multiple token spans (multiple segmentations of the input utterance):
We now build a Constituency Parse that uses the POS tags generated by our focused POS tagger. The Constituency Parser is trained on a corpus which is similar to the POS tagger training data. This approach allows us to target a single training set for multiple components in the comprehension pipeline. In fact we have spent quite some effort in creating the tooling required to collect consistent judgements from uniform training sets of utterance for various trainable components in our comprehension stack.
For a small startup this kind of effort is very important, since it allows us to maintain a very sharp focus on the product experience by targeting areas in language comprehension that we need to be really good at and making sure all components in the comprehension stack are trained and tested towards the same targets.
To return to our example, a constituency parse yields a result that looks like this:
With the Constituency Parse in place we now have a complete syntactic representation of the input utterance which captures both the syntactic units (POS tags) and the relationships between those elements (constituent structure).
Ozlo then combines this syntactic structure with a semantic bag of information to generate a coarse grained Meaning Representation.
The Meaning Representation tree is represented as a semantically-denoted Predicate-Argument data structure. This Predicate-Argument data structure is inspired by predicate argument structure in linguistics but is not exactly the same.
Our example yields the following Predicate-Argument data structure.
The Mood, Question Type & semanticBag in figure 5 are inference we do both, from the syntactic structure and semantic annotations that we are able to extract from parts of the utterance.
This Predicate-Argument data structure is what we call the Meaning Representation artifact that the comprehension layer produces.
The purpose of processing the utterance via the comprehension stack and generating this Meaning Representation artifact is to convert the natural language utterance to an abstraction that the machine can work with. At this level of abstraction we have
- extracted the grammatical structure (the predicate argument structure )
- inferred coarse form (like is it a statement, question, command etc …)
- attached bags of semantic information to the appropriate parts of the structure.
What Happens Beyond Comprehension
The consumer of the Meaning Representation is an Intent Classification system. The intent classifier is able to convert the Meaning Representation to a set of features which it matches against a set of tasks registered with the system at startup time.
The fact that we are able to convert any input utterance (theoretically in different language even) into a Meaning Representation allow the intent classifier to aspire to be language independent. This allows the remainder of the system to deal strictly with a machine compiled representation of the input, which is a big win in terms of developing software that can work with varied inputs. Like the Sumerian logograms, we can process concepts independent of their original encoding
Discussion on the intent classification and dialog management systems is beyond the scope of this post. In subsequent posts we will explore some of the components that make up the comprehension stack in more detail. We will also discuss the learning algorithms employed at various part in the comprehension stack. Stay tuned …