Getting NLP Ready for Business

Haixun Wang
AI Graduate
Published in
16 min readFeb 28, 2018
The Chimp Who Learned Signed Language

Artificial Intelligence and machine learning are making big strides in many areas. For some tasks, AI has already surpassed human levels of performance. Still, the most impressive breakthroughs in this new wave of AI come from image recognition and speech processing, and there is a feeling that Natural Language Processing (NLP) is lagging behind.

One thing that stands out in NLP is machine translation (MT): Recent neural network based approaches significantly outperform traditional MT methods. But some argue that end-to-end neural network approaches do not really “understand” the natural languages they manipulate. While we may debate what constitutes understanding, the quality of machine translation, especially for long sentences, does have a lot of room for improvement.

Meanwhile, there is a lot of passion for turning NLP into a driving force for new and established businesses. A friend of mine who is a talented stock trader wants to know if NLP can help read financial news and offer trading insights so that he can scale up his trading practice. Another friend is surveying approaches of building conversation bots that are knowledgeable enough to talk with patients and make medical diagnoses. Yet another friend is in the middle of creating a personal assistant whom you can trust with every thought you have, so that it may offer advice that makes you feel happier, more fulfilled, and more positive toward your life.

How far away are we from realizing these visions?

NLP: The State-of-the-Art

Before NLP was hit by the deep learning tsunami, it had been making slow but steady progress in traditional tasks such as POS tagging, syntactic parsing, entity linking, semantic parsing, etc. In general, these tasks are about text annotation, pretty much what the picture below tries to convey.

Deep learning solutions to these tasks do not necessarily have better performance, but they do make things simpler. For example, previously, to train a parser, we needed to construct millions of features, and now we start with word embeddings and leave the rest to a neural network.

What makes NLP different and difficult, and why deep learning wasn’t much help although it brought remarkable improvements to image recognition and speech processing tasks? There are two things that are quite fundamental to understanding natural languages: priors and structures.

In 2011, Tenenbaum et al asked a very interesting question: How do our minds get so much from so little? Our minds build rich models of the world and make strong generalizations from input data that is sparse, noisy, and ambiguous — in many ways far too limited to support the inferences we make. How do we do it?

Tenenbaum et al argued, quite convincingly, that the answer is Bayesian inference. Bayesian inference allows a three-year-old to learn the concept of horse after seeing merely three pictures of horses. But such inference may rely on innate priors that are hardwired into our brain through eons of evolution.

It is difficult for machines to obtain correct priors for Bayesian inference. Here is a simple example (although not entirely relevant): Given a search query, “jordan 7 day weather forecast,” how do we decide what “jordan” refers to? Humans know instantly it refers to the country Jordan. A naive algorithm oblivious to the structure of the query may mistake “jordan” for Jordan shoes (a Nike brand). This is likely due to the priors it uses in Bayesian inference, which are estimated by counting how frequently people search for Jordan shoes vs. the country Jordan on the web. This estimator is biased, and especially so in our case: It turns out that the former is way more frequent than the latter. The biased priors then dominate the inference, which leads to the wrong result. Should we use a more sophisticated way to estimate priors? Certainly. But there is no guarantee that a more sophisticated approach works in all situations.

Priors are important, but what is more important is that natural languages exhibit recursive structures. Query “jordan 7 day weather forecast” has a structure, which can be mapped to a “weather forecasting” semantic frame, with a location parameter (slot) and a time span parameter (slot). If the algorithm captures this structure, then it can eliminate the confusion about Jordan shoes without bothering with priors. This is also one step closer to truly understanding this query. In state-of-the-art web search and QA / conversation applications, handcrafted templates are used to capture structures in natural language input, which significantly reduces errors in inferencing. The problem is, it is hard to generalize and scale up such solutions.

Neural network and deep learning promote the use of distributional methods in natural language processing. With word embeddings such as word2vec and GloVe, discrete words in a natural language are projected into a continuous space. In this space, “cat” is close to “dog,” and this enables us to generalize conclusions we draw for “cat” to “dog.” However, distributional approaches do not eliminate the need for understanding priors and structures. In fact, we do not have good representation for anything more complicated than words, such as phrases, sentences, and paragraphs, simply because they contain structures that we do not know how to model effectively. Furthermore, we do not have good representation for knowledge and common sense, which are indispensable for reasoning and inferencing.

Maybe the reason deep learning is more successful in image processing is because “structures” in images are easier to capture: A convolutional neural network (CNN) that allows for translation invariance serves the purpose. It is however much harder to do the same for natural languages. As a result, we are not seeing breakthroughs in NLP except in a few isolated cases where we happen to have huge amounts of training data where priors and structures can be learned implicitly (e.g., Google uses billions of historical searches to train RankBrain to sort through search results).

NLP techniques are weak, and there is a long way to go before machines can handle open-domain communication in natural language. But before we eventually get there, how can existing NLP techniques make business impact?

The power of aggregation

NLP is already playing a critical role in many applications. But there is a trick. Typically, in these applications, we do not rely on NLP to understand the meaning of individual utterances in natural language. Rather, we process a large corpus using NLP techniques, and aggregate their results to support applications.

Sentiment Analysis. Sentimental analysis, in particular, aspect-oriented sentiment analysis, is a useful tool for evaluating businesses and products. It performs information extraction on a large corpus of user reviews, and outputs aggregated sentiments or opinions toward (aspects of) businesses and products. But if we dive deeper into the technique, we realize its weakness: We sometime fail to gauge sentiments because we do not understand particular expressions of natural language. For example, “the phone fits nicely in my pocket” is a positive sentiment toward the size of the phone, but it is not easy to automatically associate “fits nicely in my pocket” to “size.”

Summarization. There are two types of text summarization: extractive and abstractive. To summarize an article, the extractive approach selects a few sentences inside the article, while the abstractive approach generates new sentences. The extractive approach uses purely statistical methods, for example, it creates relationships between two sentences by studying their shared words and topics. The abstractive approach did not produce good results until recent years when deep learning is available. But even with deep learning (e.g., a recent work that uses sequence to sequence translation, attention mechanism, copy mechanism, coverage mechanism, etc), the quality of summarization is still not at product level. So, when will the technique be ready to help my friend who wants to use NLP to read financial news and offer trading insights? At least the current approaches need to take one more step by taking into account explicit goals (such as offering trading insights) in summarization.

Knowledge bases. Knowledge base construction is another area that relies on aggregated results of information extraction (IE). It also demonstrates the strength and weakness of aggregation: Efforts for creating a more complete knowledge base haven’t been very successful because i) most open-domain knowledge obtainable by aggregating IE results from big corpora is often already covered by Freebase or other manually curated knowledge bases, and ii) knowledge obtained from individual utterances is often not reliable. Nevertheless, domain specific knowledge bases may lead to huge commercial impact. Take two important industries, e-commerce and healthcare, for example. On e-commerce web sites, users may search products by names or features, but they do not support queries such as “how to fight insomnia” or “how to get rid of raccoons,” although they sell many products for such situations. What they need is a knowledge base that maps any noun phrase or verb phrase to a list a products. Healthcare has a similar situation. We need knowledge bases that connect symptoms, conditions, treatments, and medications.

Search. Many consider the problem of search solved. This is not the case. Search relies on aggregated user behavior data and that means it works mostly for head queries. But in settings other than web search, even head queries are not well served.

Consider query “travel in Arizona” on Facebook. A friend of mine just made a post four hours before my query, and it is a perfect match. But surfacing this post is extremely hard as no user behavior data leads to it. Thus, for social search, email search, e-commerce search, app search, etc., there is still a big role for NLP and semantic matching. Specifically, knowledge graph, entity linking, semantic parsing are the keys to better serve queries with limited or no user behavior data.

Education. A very interesting and lucrative business is to help users learn or use a language more effectively. For example, several startups (e.g., Grammarly, DeepGrammar, etc.) provide tools to correct users’ grammatical errors. On a high level, this is quite doable since algorithms should be able to acquire sufficient grammatical knowledge through offline learning on large corpora. This should enable them to capture most errors in a text without having to understand the meaning of the text. However, there’s still a lot of space for improvement. For example, given “I woke at 4 am in morning,” neither Grammarly or DeepGrammar suggest changing “woke” to “woke up” or “in morning” to “in the morning.” DeepGrammar actually suggested changing “woke” to “work,” which does not make sense. Of course, identifying certain errors need semantic knowledge, for example, when will such tools be able to suggest changing “pm” to “am” in the following text “I woke up at 4 pm in the morning”?

A little technical breakthrough plus a lot of dirty work

We like envisioning fancy NLP solutions, but many of them are Artificial General Intelligence (AGI), because they need to handle all possible scenarios. AGI isn’t happening anytime soon. Still, small technical breakthroughs occur from time to time. Sometimes, with a lot of dirty work, we can turn them into commercial success.

Question Answering (QA) and chatbots are nothing new — the first chatbot was developed in the 60’s (ELIZA, 1966). It didn’t go very far. What changed 50 years later that made QA and chatbots so hot?

Three things happened:

  1. (Technology) Breakthroughs in speech recognition, which made Alexa, Google Assistant, Siri possible; Availability of large knowledge bases, especially open-domain ones such as Google’s Knowledge Graph.
  2. (Market) Messengers have become indispensable elements in business and everyday life, and more recently, smart speakers are suddenly ubiquitous.
  3. (Utility) People are ready to switch from keyword search to voice/natural language based interfaces for more specific answers in a more direct way.

But the technology breakthroughs — speech recognition and knowledge bases — do not automatically lead to QA. We still need to understand questions, to reason, to inference, and to generate answers, but in the last 50 years, there has been no fundamental improvement in such capabilities.

Nevertheless, QA is hugely successful, and we have all experienced that on Google. (It still makes mistakes. Below, Google mistook mother-in-law for mother for its founder. The screenshot was taken in July 2017). It’s just that the success does not come from a new level of natural language understanding, instead, it is made possible by a lot of handcrafted templates.

Here are some observations.

1. Impact is driven to a large extent by technology advances. Thus, it is critical that we know the limitation of technology: After all, not much happened for QA and chatbots for more than half a century.

2. Usually a new technology does not solve 100% of the problem, but that’s OK. We are happy to do a lot of dirty work (e.g., handcrafting templates and rules, etc.) to make up for the gap. To a large extent, success of QA and digital assistants such as Siri, Alexa, Google Assistant, and Cortana is driven by handcrafted templates.

But what about latest conversational AI (e.g., using deep reinforcement learning to build chatbots)? Isn’t it one of the driving forces that make chatbots so hot? Yes. But it isn’t making real impact yet. Here, I focus on goal oriented dialog systems (Siri, Alexa, Google Assistant), although I acknowledge aimless smalltalk (Microsoft Tay) could be entertaining. But we should be constantly looking at the intersection of technical advances and application needs, and not shying away from using low-tech dirty work to make things happen.

Narrowing the problem domain

Let us re-examine my friends’ projects that I mentioned earlier:

  1. A conversation bot that talks with patients and makes medical diagnosis.
  2. An algorithm that reads financial news and offers trading insights.
  3. A personal assistant that records your daily activities and offers advices that make you happier and more fulfilled.

Pizza Hut deployed a chatbot to take orders from customers and it was quite successful. Facebook’s virtual assistant M is dead because Facebook put no bounds on what M could be asked to do. Before discussing the feasibility of my friends’ projects, let us revisit this quote from Microsoft AI chief Harry Shum:

Computers today can perform specific tasks very well, but when it comes to general tasks, AI cannot compete with a human child. — Harry Shum

And this quote from Stanford professor Andrew Ng:

Most of the value of deep learning today is in narrow domains where you can get a lot of data. Here’s one example of something it cannot do: have a meaningful conversation. There are demos, and if you cherry-pick the conversation, it looks like it’s having a meaningful conversation, but if you actually try it yourself, it quickly goes off the rails. — Andrew Ng

When it comes to having a robot making medical diagnosis, people naturally have a lot of doubts and concerns. But technically, it is not impossible. To solve problems in a narrow domain, the first priority is to develop domain-specific knowledge bases that will make our robots expert in the domain. In this case, we need knowledge graphs that model relationships between symptoms, conditions, diagnoses, treatments, medications, etc. What about liability? People are getting health advices from non-medical authorities anyway: One in every 20 Google searches are for health related information. The chatbot merely provides a more direct form of communication than web search. On the other hand, a real challenge of this project might be how to access users’ medical records. In fact, several startups (e.g., doc.ai and eHealth First) have invested in using blockchain techniques to address this problem.

The task of reading financial news and offering trading insights is in a much broader domain: Stock prices are affected by a myriad of factors — natural, political, scientific, technological, psychological, etc. Understanding how certain events may cause stock price moves is difficult. However, it is very possible to narrow the domains and develop specialized tools for them. For example, instead of monitoring the broad stock market, we may focus on commodity futures. Then, again, we develop knowledge bases, which may contain rules such as “the price of copper will go up if there are political turmoils or natural disasters in countries such as Chili.” Finally, we can develop algorithms that read news and detect events such as political turmoils or natural disasters in certain countries. As machines read news much faster than humans do, the signals they provide may translate into advantages in algorithmic trading.

It is a fascinating idea to create a personal assistant that records a user’s day-to-day thoughts and activities, and offers feedbacks that make the user happier and more fulfilled. This reminds me of Google Photos. From time to time, Google selects a few old photos to create an album with a title such as “Rediscover this day 4 years ago.” It never fails to put a smile on my face. Still, photos only capture a few glimpses of a person’s life, while natural languages have the potential to preserve our thoughts and activities in a more comprehensive way, and replay them in a more creative manner back to the user.

However, this is an open domain task: The personal assistant needs to understand all kinds of thoughts and activities, which makes it Artificial General Intelligence (AGI). Is it possible to narrow the problem domain?

Why don’t we start with 1,000 templates? It’s quite reasonable to think that 1,000 templates will cover surprisingly many human activities (e.g., “I ran 3 miles today on Stanford campus” and “I had coffee with Alon at HanaHaus in downtown Palo Alto,” etc.) The personal assistant will transform pixels of our lives into structured representations, sort through them, aggregate them, and present them back to us on a later day in a new tapestry.

Still, there are things the personal assistant cannot understand. For example, “My father-in-law passed away yesterday. My wife and I hugged and talked for the whole night.” It might not fit into any of the 1,000 templates we handcrafted for everyday activities. Still, it is an important event in one’s life that the personal assistant should not miss. There are several things the personal assistant can do. First, using pre-trained classifiers, it may classify and file the event as bereavement. Second, it may use semantic parsing or slot filling mechanisms to further detect who died. Third, when nothing works, it may still record it as raw text, and wait for future technical advances to take care of it.

Another big challenge for this project is how to get people to talk to their personal assistants on a daily basis. Maybe we can piggyback it onto something users do everyday, for example, we detect a person’s activities and thoughts in emails. Or we can have the personal assistant constantly listen to our daily conversations. In fact, a few startups are already building tools for this (e.g., otter.ai, Eva, etc).

Pushing technical boundaries

Existing NLP techniques are insufficient in understanding natural language; AGI isn’t happening, at least not anytime soon. Does this mean the only way of making business impact is through narrowing down the problem domain to the extent that we can use labor intensive techniques to cover every scenario?

Certainly not.

There are many ways to push the technical boundaries. Here, I will discuss two directions we are working on.

If the current NLP technology does not allow us to go deep in understanding natural language, how about trying to go broad?

As an example, let us consider QA and chatbots for customer service. Customer service is a promising frontier for NLP and AI. It does not require us to go particularly deep in understanding natural language. If our technology is able to handle 30% of customer interaction, businesses can save 30% human labor, which is significant. Consequently, many companies are deploying their own QA or chatbot solutions, with various levels of success.

There was a time (before 1970s) when every business needed to manage in their own way some kind of data store (e.g., to keep payroll records). Then came the Relational DBMSs, which declared that no matter what businesses you ran, the Relational DBMSs can handle payroll and other applications for you in a declarative way, meaning there was no need to write code for data manipulation and retrieval any more.

Is it possible to build a general purpose conversational AI for customer service? Or in other words, what does it take for a customer service system designed for one business to be used for a different business?

This may sound far-fetched, but it is not entirely impossible. First, we need to unify the model for the backend data used for customer service. This is feasible because most business data is in relational databases already. Second, we convert customers’ natural language questions to SQL queries against the underlying databases.

Does that mean we need to handle natural language questions for all scenarios? Not really. We are only handling a very small set of natural languages utterances, that is, those can be converted to SQL statements. Under this constraint, natural language questions in one business domain must be similar to those in a different business domain because they share the same latent structure. In fact, if we treat i) database schema, ii) database statistics, and iii) equivalent ways of mentioning database attributes and values in natural languages, as metadata that can be injected into QA and the conversational AI, then it is possible to create one system for different customer service needs.

If lack of training data is a bottleneck for NLP, why not try harder to inject explicit domain knowledge into machine learning algorithms?

This is nothing new, but the problem is very real. Machine learning converts statistical correlations in huge amounts of training data into implicit knowledge. But sometimes, such knowledge can be injected into machine learning in an explicit way.

As an example, imagine a knowledge base has a parentOf relationship but not a grandparentOf relationship. It takes a lot of training data to learn that grandparentOf is equivalent to parentOf(parentOf). A more efficient method is to pass this domain knowledge as a rule to the machine learning algorithm.

In the customer service project we described above, we use deep learning (a seq2seq based model) to convert natural language questions to SQL statements. From the training data, the algorithm learns the meaning of natural language questions as well as the syntax of SQL. Still, even with very large training data, the learned model does not always produce well-formed SQL statements. But there should be no need to learn the syntax of SQL!

--

--

Haixun Wang
AI Graduate

VP of Engineering and Distinguished Scientist @ Instacart.