Deep Learning and NLU
Note: I documented the following for my recently open-sourced natural language understanding system (though largely deprecated).
I view natural language understanding (NLU) as the process of taking arbitrary textual input (including voice-to-text transcriptions), of arbitrary size with or without context, and outputting a cross-linguistically consistent semantic representation (e.g., a lambda calculus representation). One of the primary reasons I paused development of the above project is because I believe the core of my system will soon be possible, and eventually superior, with deep learning, obsoleting most of my work.
Below, I describe how deep learning can achieve the components that comprise this NLU process. I refer to Google’s SyntaxNet in most of the descriptions because SyntaxNet is the most complete, accurate, well-documented, and open-source implementation of these deep learning approaches; other papers have documented similar findings.
A. Text segmentation (word boundary identification)
Google’s SyntaxNet uses pre-trained neural networks to identify word boundaries in 40+ languages (with very high accuracy), including Chinese (which does not separate words with spaces).
B. Entity recognition
Identifying entities (e.g., a name of a city, restaurant, person) can be incredibly computationally intensive if forced to search every input n-gram in every possible index of entities (e.g., indices of cities, restaurants, people). Ergo, it is best to use pre-trained language models to determine the probabilistic likelihood of the nth term belonging to a particular entity category based on the previous n terms in input. For example, a trained language model can identify that the next term in “people who live in …” has a high likelihood of appearing in location-related indices (e.g., cities, countries), and will avoid searching low-probability entity categories (of which there could be dozens; e.g., people, book/film names). Applying ML to entity recognition is not new and was a planned component of my system that I never reached.
C. Morphological analysis
Similar to text segmentation, Google’s SyntaxNet can infer inflection, which is pertinent mainly to dependency parsing but also applies to WSD and grammatical conjugation. “In Russian, a heavily inflected language, morphology can indicate number, gender, whether the word is the subject or object of a sentence, possessives, prepositional phrases, and more.”
D. Word vectors
After input text segmentation and terminal symbol identification and analysis, the next component of a deep learning NLU system would map the text input to word/phrase embeddings. These vectors would be pre-trained with unsupervised learning (e.g., skip-gram) on large corpora (e.g., Wikipedia corpus). These vector representations are essential for the parse tree generation in the next step, but also capture the fuzzy nature of language, such as identifying semantically similar phrases. Also, given language’s fuzziness, there can be varying degrees of precision in mapping to semantic representations. For example, in some use cases, two terms qualify as synonymous (e.g., “alarm” and “clock”) while other times the same terms should not. Vector computations enable the probabilistic semantic analysis to modulate the degree of precision (or semantic fidelity) appropriate for the current task or interface.
E. Parse trees
Parse trees are the paramount component that differentiates NLU from NLP. Primarily, it determines the relationships between terms within a text sequence to infer the text’s meaning. These relationships are modeled with a parse tree and an associated semantic tree (which would map to a linguistically-independent semantic representation). Google’s SyntaxNet demonstrates a simple feed-forward neural network can construct these parse trees: “Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word’s syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree.”
Indeed, SyntaxNet is merely part-of-speech tagging for grammatical structure, not semantic role labeling. But, once this approach identifies the syntactic relationship (e.g., subject + verb/action + object), the word vectors can infer the semantic representation of that relationship. Also, these structures account for the same dependency parsing needed for NLU tasks. While constructing a parse tree, SyntaxNet employs beam search) to disambiguate its parses: “An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible — due to ambiguity — and a neural network gives scores for competing decisions based on their plausibility.” Again, these POS trees are not NLU’s goal, but now that neural networks can accurately construct the syntactic structure of input, handle disambiguation, and vector spaces can model the word and phrase semantic representations, it is nearly possible to integrate these components to output rich semantic trees.
Google trains its models on treebanks provided by the Universal Dependency project. Google also notes this deep learning approach is not yet perfect, but it is continuously improving: “While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning.”
Note: One such possible deep learning approach is to use neural networks to output dense parse forests, and then use existing implementations to search the parse forests for the k-best semantically valid, disambiguated parse trees with their associated semantic trees and grammatically correct display-text. I wrote a scant account of this approach here.
F. Unseen words
Neural networks can generate character-based input representations, instead of word-based embeddings, to determine the semantic structure on unseen terms by identifying “systematic patterns in morphology and syntax [to] allow us to guess the grammatical function of words even when they are completely novel. […] By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). […] These models […] are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context.”
G. Language independence
As SyntaxNet demonstrates, the implementations that use word embeddings and neural networks are independent of language, and Google has trained neural networks for 40+ languages. Again, Google trains its models on treebanks provided by the Universal Dependency project.
H. Grammatical conjugation
Grammatical conjugation is not unique to NLU, but essential for any natural language interface, including tasks that must map a given semantic representation to its corresponding display-text (i.e., the reverse of the above process). Using the same models that the morphological analysis (part C) employs, coupled with the syntactical structure that the parse trees reveal (part E), these systems can correctly conjugate terms according to their language’s grammar.