Context of Natural Language Processing

Motivations, Disciplines, Approaches, Outlook

Jake Batsuuri

Published in

Computronium Blog

11 min readJun 12, 2020

Why though?

Moore’s law means a lot of things to lots of different people. It predicted the doubling of transistors on a chip every two years. This law roughly translated to doubling of computing power to business analysts, and it was the signal that stated computing industry was still the pony to bet on. For the engineers, this meant that given a problem, often the solution was to throw more computing power on it or wait a few until the processing power caught up to it. In recent years, it’s been a worry that this law may be slipping due to product specs not meeting the criteria, since we’re reaching the physical limits of atoms. If this decline continues to happen, we will see a shift in funding into maybe quantum computing, different and larger chips or better utilization of processors .

On the other hand, our data generation has been increasing by about ten-fold every two years. Sure, it’s known that the majority of this data is metadata and less useful information, like social media posts about bacon. But deep in it, there’s still novel research and newly discovered information that’s barely tapped. If data is the new oil, it’s an excellent resource to mine, too, since information is not a non-renewable resource.

With the increased demand for data mining and information parsing and decreasing acceleration of computing power, I think we will see more need for novel methods for data mining and information processing. While it’s true that the rate at which computing power is declining, it is also likely that we will still see cheaper computing power due to cloud infrastructure providing us with better utilization with technologies like container orchestration, coupled with the development of newer hardware architectures. It is thereby making this field still attractive enough to enter for small and medium players.

This article will explore how Natural Language Processing, or NLP, may help fill this market gap and provide a solution to the knowledge bottleneck. Natural Language Processing is the set of methods for making the human language accessible to computers, therefore bridging the two.

Natural Language Processing and its Neighbors

Computational Linguistics

Although in practice NLP and Computational Linguistics, CL may be used almost synonymously, there is a difference between the two. Just like computational biology is about biology first and uses computer science to aid in biology research. Computational linguistics is about the study of language and uses computer science to support its research .

In contrast, Natural Language Processing or NLP is focused on the design and analysis of algorithms for processing human speech. Some of the applications of NLP are:

Extracting information out of texts
Translating between languages
Answering questions
Holding a conversation
Taking instructions

Machine Learning

The current approach to NLP relies heavily on Machine Learning or ML. Machine Learning provides an array of techniques that’s useful for parsing information out of natural languages. ML can be applied to any data, so it has a wide variety of uses. However, in NLP, we primarily deal with textual data. Text data is discrete, and meaning arises out of the combinatorial arrangements of symbolic units.

Since the distribution of words in a language follows power-law or Zipf’s law, we can expect frequently used words to appear way too often, while most other words appear very rarely. Therefore we require our natural language processing model to be robust on long-tail words it may not have seen before.

By User:Husky — Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=1449504

Language is compositional; and it works by combining letters into words, words into phrases and phrases that can create even bigger phrases & sentences. So our language model should be able to detect these structures in the natural language.

Artificial Intelligence

The goal of the field of artificial intelligence is to build software and robots with the same or better range of abilities as humans. The capacity for language is one of the fundamental aspects of human intelligence. So natural language processing is one of the more prominent components of modern AI research. Other areas of research include computer vision, robotics and strategy & problem-solving. There is a famous problem called the “knowledge acquisition bottleneck,” which is the idea that there is already acquired knowledge in many domains that hasn’t been made useful yet. There are domain experts who may not be able to program or even share this knowledge into useful code or information. There may be a shortage of human time available to translate this ever-growing domain expertise into valuable products and services.

The goal of this subset of AI is to solve this by having computers acquire knowledge from texts and conversations. This solution requires both inputs of knowledge and output of knowledge, which may require reasoning on both sides. So “reasoning” is another critical element to artificial general intelligence.

Even decoding simple sentences may require our AI program to have spatial reasoning, cause and effect, physical phenomena, emotions and intentions or even social conventions.

Linguistics

Natural language processing requires a multidisciplinary approach, and may demand extensive knowledge of each of the following sub-fields:

Math and statistics
Linguistics
Computer science
Machine learning & deep learning

Which means many practitioners will be missing expertise in some areas. Since NLP is a more engineering discipline, that may mean the majority of the practitioners may have little to no understanding of linguistics.

Linguistics is the scientific study of language and its structure. The phrase natural language from NLP comes at us from the taxonomy of languages that linguists have developed. Languages can be divided into Artificial and Natural. In the Natural Language, you get the familiar English, Italian and Japanese. The “Artificial Languages” are the languages that are designed for a specific purpose. First Order Language or FOL, is one such example. Programming languages use a particular dialect of FOL; therefore, they are also an example of an Artificial Language.

In general, the properties of Artificial and Natural Language can be summarized as such. Natural languages are generic, and while being useful in many situations are also quite ambiguous and rely on context to be understood.

Artificial languages are the opposite; they offer us precision and lack of ambiguity for the price of being tedious.

First-order logic, FOL, is the logical language that base sciences use to convey information in a precise and rational way. FOL, First Order Language, Predicate Calculus, Functional Calculus are some of the different names for this formal base, artificial language.

FOL is used by researchers and mathematicians when absolute clarity, rigor and lack of ambiguity are needed. It turns out those things are also important when dealing with grammaticality, meaning, truth and proof.

Computer Science

NLP is a large component of computational linguistics. As mentioned before, since this discipline uses computers to aid its study of language. Expert use of computational resources is another required skill. Modern natural language processing happens on a computer with programming languages and libraries. Given the large data sets and heavy number crunching it’s also not unheard of for practitioners to require cloud resources that normal laptops and desktops are not equipped with. Therefore it’s impossible to avoid concepts like computational complexity, memory, storage, parallel computing in general. Because the practitioner needs to compile and train models in a reasonable amount of time. Furthermore, many natural language processing methods make heavy use of finite-state, push-down, and linear-bound automata and other Computer Science, CS, concepts and algorithms.

If the practitioner is on the development rather than the research side, the computer science side becomes even more crucial. Proper understanding of software engineering principles to deliver high quality production code is a necessity.

Speech Processing

The current state of the art considers the part of processing audio signals into words a solved problem. There are many free APIs that let developers turn spoken speech into text.

By File:Major levels of linguistic structure.jpg: James J. Thomas and Kristin A. Cook (Ed.)derivative work McSush — File:Major levels of linguistic structure.jpgThomas, James J. & Cook, Kristin A. , ed. (2005) Illuminating the Path: The Research and Development Agenda for Visual Analytics, National Visualization and Analytics Center, p. 110 ISBN: 0–7695–2323–4., Public Domain, https://commons.wikimedia.org/w/index.php?curid=8899126

On this diagram then, we can consider everything up until the purple circle pretty much solved in NLP. On many language model building projects or model using projects, we may consider this a preprocessing step.

The next rung on the concentric circles is text analysis, particularly with statistical language models, which quantify the probability of a sequence of text. Companies like Google are no longer employing page rank algorithms alone; they are starting to use linguistic methods to provide better search results. If early google versions used simple string matching with network analysis, then modern versions use language models to guess what you’re looking for.

If you have used speech recognition apps, such as the dictation mode on mobile keyboards, you might have noticed that these apps translate your speech into one continuous sentence, regardless of meaning. These systems cannot currently punctuate effectively since there’s little to no semantic understanding of the text.

Approaches

Learning vs Knowledge

The learning school of thought approaches this problem by translating a corpus of text into any desired format, for example, databases, summaries, translations to another language etc.

The knowledge school of thought approaches this problem by representing a text as a stack of linguistic structures outlined in the diagram above. Specifically, in the lower stacks, there are morphemes and words as objects. In the middle stacks, there may be tree structures representations of grammar, and in the higher stacks, there may be logic-based representations of meaning. This solution is heavily influenced by the domain expertise acquired by linguistic research.

Whereas the “learning ”solution is heavily influenced by the deep learning and data science community, where the consensus is, given enough data and enough computing power, we can almost brute force a lot of complicated tasks into submission. Just this week, OpenAI team released a new general language model GPT3 with 175 billion parameters, from the previous GPT-2 with 1.5 billion parameters. This newer version was able to break many previous records but still failed to answer some logical questions that involve basic physics, such as, “If I put cheese into the fridge, will it melt?”

Both schools of thought have the same goal, though. That is, to provide a general base of the solution to any of the NLP applications listed earlier.

Learning vs Searching

The general machine learning problem can be posed as the following equation:

Here we are looking for the predicted output on new inputs given that our model phi used inputs, labels, and parameters to train.

Consider the NLP applications listed earlier:

Extracting information out of texts — this may be organizing the event timeline from a news article
Translating between languages — this maybe anything that Google translate does
Answering questions — this may be chat-bots answering how-to questions
Holding a conversation — this may be chat-bots intended to pass the Turing test

Under examination, there are two distinct stages on these projects. There’s the learning stage when we learn all the parameters based on tons of labelled data. The second stage is the search stage, where given an input text, x, we search the model space for output, y, that minimizes the error of x.

While traditional machine learning and deep learning disciplines don’t stress this boundary a whole lot, in NLP, it ends up being necessary due to the specific unique features that language processing requires.

For example, consider a situation where the name of something is on the tip of your tongue, but you can’t seem to remember it for like 5 minutes. But you remember the context in which it is used and other details. Now imagine entering these context clues into the model, and the model retrieves the word you’re looking for. This type of model is categorized as expressive. It can capture more meaning, if you will.

Now consider another model in which it can only retrieve regex like string matching only. This model is not able to provide results based on clues. You can probably guess that this type of model has a fast search, whereas the first type of expressive model is slower in terms of search. How much slower? Exponentially slower. Therefore, this trade-off is a significant trade-off to consider during the learning stage, especially for production code.

Relational vs Compositional vs Distributional

The relational perspective considers words in relation to other words. For example, a “TV anchor” is a type of “journalist,” and “journalism” is a type of “profession.” Furthermore, “journalism” is the “profession” that “journalists” practice.

The compositional perspective considers words as made up of constituent parts. For example, the term “journalist” consists of “journal” and “-ist”. Furthermore, “journal” can even be broken down into its French constituents “jour” and “-nal”. The strength of the compositional perspective is that once developed; it can analyze a text without much training. Since this method is more rule based, it can generalize to those long tail situations better.

The distributional perspective considers words as replaceable by other similar expressions. For example, there’s the problem of interpreting idioms. Idioms are phrases such as “kicking the bucket,” it is figurative for someone dying, but neither relational nor compositional approaches would parse out this meaning. However, if given enough training, our language model may be able to find other cases where this exact phrase appears. Then from multiple use cases and contexts it can find a decent semantic understanding. So that when generating new text, it can use this or equivalent phrases to convey the meaning.

Each perspective gives us almost contrasting and independent approaches and algorithms. However, each one solves a unique problem well enough to warrant its existence. So future developers will need to develop better integrations between them.

Conclusion

There’s an interview with Jeff Bezos on YouTube about Amazon and the future of business etc, where Jeff Bezos remarks on the human brain. That it can learn with such efficiency, both in terms of power usage and number of training examples needed to generalize. Modern machine learning algorithms often use thousands, if not hundreds of thousands of examples, to be superhuman at particular tasks, like playing DOTA games. Whereas with the same task, people can learn to play pretty well under five examples. Possibly by reasoning and translating learning from other games into DOTA.

Not only that, there seems to be a certain language faculty that our brains possess that also seem to degrade with age. Take the example of a child who’s learning their native tongue. Kids will pick up language with very few examples. The earlier a child is introduced to a second or third language, the better they can internalize them.

If we wish to create similar or better linguistic machines, we must also have these features built into our models to accomplish that feat. Specifically, these machines must have the ability to translate learnings across tasks by recognizing that tasks are similar and, therefore, they must have similarities in terms of generalizations. As well as have some sort of internal structure that allows for learning quick generalizations based on fewer examples.

Up Next…

In the next article, we will explore some formal language theory, as its the foundation for computational linguistics.

For the table of contents and more content click here.

References

Clark, Alexander. The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, 2013.

Eisenstein, Jacob. Introduction to Natural Language Processing. The MIT Press, 2019.

Bird, Steven, et al. Natural Language Processing with Python. O’Reilly, 2009.

Jurafsky, Dan, and James H. Martin. Speech and Language Processing. Pearson, 2014.

Barker-Plummer, Dave, et al. Language, Proof and Logic. CSLI Publ., Center for the Study of Language and Information, 2011.