Honest Guide to Machine Learning: Part Three

Published in

Axiom Zen Team

12 min readDec 5, 2016

Natural Language Processing

The Honest Guide to Machine Learning provides a deep dive into machine learning technology — no math degree necessary.

Part 1 of our Honest Guide to Machine Learning
Part 2 of our Honest Guide to Machine Learning

In Part 3 of our guide, we introduce you to the world of natural language processing — what it is, the different forms it takes, and the challenges that are stopping us from conquering it in its entirety.

What is Natural Language Processing?

To understand what natural language processing (NLP) is, we need to know what a “natural” language is. We use this term to differentiate human languages from computer languages like C++ or Python. You may find it hard to believe, but there are over 6,500 languages spoken in the world today. Of that number, only 2000 have fewer than 1000 speakers; the language with the most speakers in Mandarin Chinese, with over a million living speakers. So when we talk about processing ‘natural’ language, we mean human text.

NLP comes in two forms: generation and understanding. While most researchers focus on one or the other (and today’s article will mostly focus on understanding), there are some researchers who believe it’s a mistake to divide the field into two separate camps. They think a model which will do both things is possible, and stronger for it — in such a case, the divide is instead based on application. Some possible applications include machine translation (Human Language A to Human Language B), question answering (Siri, IBM Watson), and information retrieval.

In some ways, talking about Natural Language Processing is similar to talking about machine learning — everything comes down to input, and the way that input is processed. So, we’ll begin at the logical place.

Input

In order to process language, researchers break that language down into smaller processing units. They have a few options to choose from — the larger the unit, the more data necessary to process it. Options include phonemes, morphemes, lexical information (called tokens), phrases, sentences, or paragraphs.

The most common choice for natural language processing is tokens. Tokens are generally synonymous with words, except they include added normalization. Say there’s a sentence that needs to be broken down into tokens: you don’t just consider the words. You also take into account question marks and periods, which can affect the nature of the words in that sentence. (Note that while this sounds easy in English, in other languages tokens can actually be quite difficult to capture. For example, Mandarin has no boundaries between words; splitting the symbols in different places makes different words, and meaning is all about context.)

Preprocessing (also called normalization) helps researchers turn words into tokens. This is generally done using pre-existing rules, which have to be manually created. Dividing sentences, calculating based on question marks or periods, and changing verbs to their stem word (going to go, for instance) are all examples of NLP preprocessing.

Types of Processing

After the input style is chosen, the next step is to choose the style of processing. In the same way we had to choose a model in machine learning, choosing a type of processing is all about your desired output. What is the higher level information you need to extract from your tokens or processing units? From the least to most complex, here are the options available:

Bag of Words

Analyzing the structure of a sentence can be incredibly difficult. Luckily, it isn’t always necessary. There are times when the simplest formula is to ignore the structure entirely, focusing solely on individual tokens. This creates a metaphorical ‘bag of words’ — each word is considered in relation to the other words in the bag, drawing conclusions and learning from their similarities. Example of application: Topic modelling often uses Bag of Words. When topic modelling, programmers need to understand the topic of a block of text. They create a bag of words for each article or piece of text, and then select words from the ‘bag’ and figure out the topic based on how those words connect. For instance, if you saw reocurring words like deep, learning, AI, and research, you would know that text was about machine learning.

Tagging

Perhaps you don’t need the full structure of the sentence, but you do care about the sequence of the tokens. Since Bag of Words wouldn’t work in this case, you would need to use tagging to process your input. Tagging is directly related to a task from machine learning called sequence labelling (structured classification). In tagging, researchers assign a tag to each token based on observations on their sequence. They then observe previous tokens, and tag the next one based on that sequence. If you had the sentence, “I walked on the,” you would know that the next word would be a noun. Two famous kinds of tagging are Part of Speech Tagging and Named Entity Tagging. Part of Speech Tagging labels tokens using 36 classes (such as CD — Cardinal Number and NNP — Proper Noun), and these are used to create parse trees (see below). Named Entity Tagging uses smaller classes, which varies based on each individual use. A famous one is Three Classes, which divides into Person, Organization, and Location. Example of application: If you need to compute the probability of the next word in an existing sequence of text, you would use tagging. This is used on phones to predict what word you might want to use next while texting.

Syntactic Processing

There are times when even a sequence isn’t enough. The next step forward in complexity is syntax. Syntactic processing cares about the relationship between tokens, but this relationship might not be about sequences.

Chunking

One kind of syntactic parsing is called chunking. Chunking lets you group sequences of tokens together in a pyramid shape, building higher and higher levels of information. Take the sentence, “Yesterday I walked from Axiom Zen’s headquarters to the Maritime Museum in Vancouver.” ‘Axiom Zen’ is a single chunk, but ‘Axiom Zen’s headquarters’ is a noun phrase; by building Axiom Zen as your first chunk and ‘Axiom Zen’s headquarters’ as your second, you can begin to understand how noun phrases connect to each other. Imagine building a wall by placing bricks on top of each other. Starting with the smallest chunks, you can build up until you have the entire wall — the whole sentence as a whole.

Constituency Parsing

The next stage in processing evolution hasn’t entirely replaced chunking, but is often seen as a more powerful tool. Instead of blocks on top of each other, constituency parsing applies context free grammar to NLP. Remember our previous example, “Yesterday I walked from Axiom Zen’s headquarters.” Unlike the chunking structure, a tree allows free movement of tokens, so the sentence can be understood in many shapes: “From Axiom Zen’s headquarters, yesterday I walked.” This helps with ordering, and understanding different orders. So where do these “commonly understood” rules come from? Humans create them, but they can also be automatically learned by the computer. Example of application: The Wall Street Journal was used to build Penn Tree Bank, one of the first open-sources for NLP.

Dependency Parsing

Learning trees work incredibly well for English, but other languages require more powerful tools. In this case we might turn to DAG (Directed Acyclic Graph). DAG allows you to begin at any point to get to the other pieces.

Dependency parsing relies on DAG. Every word occurring in the sentence is dependent on the previous token, but every token also has the capacity to have special dependants. It always starts with a root, which is almost always a verb. If we use our old faithful sentence, that root would be “walked.” Even if walked was the only word in the sentence, it contains basic meaning. If we start with “I,” on the other hand, we have no context for the shape of the sentence, and have to wait for the action before we can infer meaning. So, in “I walked,” I is dependent on walked, as is “yesterday.” The difference between this and sequencing is that if you want to predict the next word after Axiom Zen, you aren’t comparing it to Axiom Zen to see how it relates; you’re still comparing it to “walked.”

Dependency parsing is constrained to a single language at a time. There are researchers trying to make universal dependency parsers for every language, but it is very difficult.

Semantic Processing

Syntactic parsing allows us to understand connections between things, but still might not give the meaning of those things. The next level of processing, then, is semantic understanding. For example, “The table smiles in joy” is syntactically correct, but it has no meaning.

Sense Disambiguation

The easiest way to use semantic processing is sense disambiguation — to find the meaning of each token and map it to a database of meanings. For example, a token like “take” can have many different meanings (senses).

Entity Linking

Another branch of semantic understanding is called “entity linking” or “entity disambiguation.” The goal of entity linking is to connect an entity to the many possible ways of expressing that entity. So, if you were to see the word “Donald,” it’s goal would be to understand whether it was related to Donald Trump or Donald Duck.

Core Reference Resolution

This is incredibly easy for humans, and incredible challenging for machines. Imagine a paragraph whose first sentence is, “Wren and I considered programming a chatbot, but she thought we should focus on higher-level issues.” Core reference resolution is used to map “she” to “Wren.”

Composition Semantics

The next progression after entity linking is compositional semantics. Once you know the semantic of each token, you still have to put them together to understand the larger sentence meaning. One use of composition semantics is to find triples inside text. Take “I won a gold medal,” and “I was born in 1882.” We can tell from the composition of this sentence that, based on the word born, the 1882 is a date, or that “I” in the two sentences are connected.

Other Processes

There are many other forms of semantic processing, of which these are just a few: Frames are used to find relationships between tokens. In order to train a frame, researchers create resources called frame nets, which are frames built on top of text (the same way the database was made for trees); Open Domain Extractions are similar to frames, but the labels don’t have to be verbs; Logical Forms consider the relationship between tokens. A sentence can be mapped to a logical form, for instance to express SQL queries to a database. Chatbots can understand natural language, and then translate it to SQL queries so that a person can “talk” directly to a computer.

Pragmatics

Pragmatics focus on relationship between meaning and context. Consider this, as an example. Pat and Chris are getting to know each other on a first date. At the end of the evening, Chris tells Pat, “I like you a lot.” Chances are, Pat will feel good about the situation. But imagine that Pat and Chris have been dating for some weeks, and Pat asks, “Do you love me?” Now if Chris says, “I like you a lot,” the reaction will likely be quite different! The same sentence, in different contexts, can have completely different semantic meanings. Pragmatics is in the very early stages of research, with very few researchers or companies exploring it. (If you’re considering a PhD in NLP, this would be a great area to choose!)

NLP Challenges

Natural Language Processing is not a simple or straightforward branch of machine learning; if it was, we would already have solved it! From the very different ways humans and computers process information to the thousands of human languages, here are some of the reasons we still have a long way to go before we achieve human parity.

Ambiguity

Language is very ambiguous — even humans often have trouble understanding sentence meaning. If you consider syntactic trees, there could be three or four different parse trees for a single sentence. For example, “I ate a cake with a fork.” Did you eat the cake using a fork, or did you eat a cake and then eat a fork? The answer is obvious for a human, but very hard for a computer. (It is because of ambiguity that we’ve begun using probabilistic machine learning models for NLP.)

Many Ways to Express the Same Thing

There are many phrases that have the same meaning, and many meanings that can be paraphrased a number of ways. If I say ‘six’ or ‘half a dozen,’ they mean exactly the same thing — and don’t get us started on “six of one and half a dozen of the other.” There is work being done trying to map every possible paraphrasing of the same meaning, including Paraphrase.org — the Paraphrase Database.

Sparsity

The distribution of phrases and tokens in a language is incredibly uneven — common words are used often, but it can be hard to find data for uncommon words and phrases (which make up the vast majority of language). Data becomes very sparse, which makes algorithms much harder to create. A potential way to solve this is to group or cluster tokens together. One obvious group is synonyms — another group might be countries. This allows you to manually group words in a sort of dictionary, the most famous one being the Brown Cluster.

Manual work of this kind is very expensive and time consuming — databases like the Brown Cluster need to be constantly updated. New words and words with new meanings are always being created, as well as slang version of existing words, like 2morrow. Deep learning allows you to group words and find the distance between them (neighborhood words) without a human updating the database. For example, we can tell Barack Obama and Hillary Clinton are connected because they both mention the United States and the presidency and democrats.

Does this surrounding word/representation help with semantic understanding? Take this as answer: If you have orange and grapefruit, their meaning is not the same, but they’ll end up in the same group because they’re both citrus and both breakfast and both fruit. Grouping helps with syntactic understanding, but unfortunately doesn’t progress semantic understanding.

Multilinguality

Most NLP resources are still only available in English, or a few European and Asian languages. The problem isn’t only that much of the research is being done in English, but that many languages are “resource poor,” and there simply isn’t enough data to build from. But if all of the research is done in and on English, and those tools are then applied to other languages, do they translate? Unfortunately, the answer for our current tools is no, but research is being done into changing that answer.

Diction and Dialect

The type of data that we have is usually from a specific domain, genre, or level of formality. If we train and design methods and models with a particular diction, those models become useless in other situations. Wall Street Journal language as a data set is not going to help you process tweets or texts, because Twitter language is essentially a different language from formal English.

Conclusion

Although these problems are significant, we already have many successful applications using NLP. IBM Watson won that famous game of Jeopardy, launching artificial intelligence once again into the public eye; Google’s Inbox can write pesky email replies for you; and Apple’s Siri, Google Assistant, and Amazon’s Alexa are only getting better every year. We believe these are the starting steps towards building accurate multilingual tools for semantic and pragmatic understanding.