NLP and Linguistics 1 — Just How Much Linguistics Do We Need to Know?

19 min readDec 2, 2023

Why do we need to even think about linguistics in NLP? Most NLP is dominated by a more tech and algorithm focus. At times we tend to lose track of the beauty and complexity of the task that computers are now accomplishing so effortlessly (with a lot of training data and huge models, of course!).

Since Linguistics is a huge area, I have broken even this introduction into several parts. The aim of this first article is to provide a general overview of NLP and linguistics, to understand, at least superficially, the complexity of linguistics and to assess how these principles and concepts can be used to understand the LLM black box better.

A little bit about what I aimed to write about and where I finally ended up. I started with the idea, ‘that let me just understand a little more about the linguistics side of the story’. It was not meant to be a deep dive — just skimming over the surface. However, as I progressed, it kind of became like peeling layers of an onion. If you think NLP is complex, then you can have no idea of the depth and complexity of linguistics — a discipline which started more than 2000 years ago with the Sanskrit grammarian Panini whose work is still an important foundation of current day linguistics. So, as you can see, there’s obviously a lot there.

Introducing an area as vast as Linguistics is a difficult job. There are a lot of concepts which are new to NLP folk. To go beyond a superficial introduction is not possible. So, at the risk of seeming like a Wikipedia page, I just link to the relevant web page.

In this article I define NLP and Linguistics. Specifically, I look at:

1. Why Linguistics is important for NLP today

2. Overview of NLP and Linguistics

3. Width in Linguistics — what are important sub fields in linguistics

4. Depth in Linguistics

The depth or types of linguistic analysis broadly 6–7 levels:

- phonetics

- phonology

- Morphology

- Syntactics

- Semantics

- Pragmatics (and Discourse)

5. Wrap up and next articles

Why Linguistics is important for NLP today

Understanding the black box of LLMs is probably going to be key to using them in more critical sensitive applications where an understanding of how the model works is critical. Secondly it may help in solving problems such as hallucination. There are 2 main areas that stand out, though others are also highlighted.

1. Using LLMs such as chatGPT for more complex linguistic analysis such as pragmatic and discourse analysis. This approach requires careful and structured prompting. But initial results suggest that LLMs are demonstrating some advanced linguistic capabilities. Given the widespread usage of LLM text based outputs, it would be good to understand the level of higher level social and cultural knowledge that they are able to demonstrate. This will become even more important as businesses seek to automate customer facing and service roles for example. Interesting research papers include here, here and here.

2. Understanding the black box of how LLMs analyze language by using various types of linguistic based analysis such as morphology, semantics, syntax, etc. There are some research papers which are trying to use this approach for example, here and here.

Overview of NLP and Linguistics

Definition of NLP:

There is usually no one definition which fits the bill. I am integrating across several definitions to define NLP more clearly below.

Objectives: NLP has a computer science or engineering focus. The primary objective of NLP is to take in text or speech and develop machine learning algorithms for large scale and efficient processing of the inputs to give computers the ability to interpret, manipulate, and comprehend human language.

NLP techniques: NLP is Multi-disciplinary — involving CS, linguistics, machine learning, Statistics and Maths.

NLP techniques have shown significant evolution going from hard coded linguistic inspired rules in the 1950s, statistical machine learning techniques in the 1990s and finally to deep learning techniques from 2010 onwards. The deep learning phase began with shallow neural networks used in Word2Vec models and later evolved to seq to seq models- such as RNNS and of course the hugely successful Transformer architecture and self-attention mechanisms that are SOTA today.

Outputs

The main output, particularly of NLP, is enabling computers to have an understanding of language which is then used to automate many tasks such as machine translation, NLU, NLG, text summarization, classification, etc. LLMs such as ChatGPT or BARD have been trained on huge corpuses of text and achieve incredible performance on NLU and NLG tasks.

Key takeaway: The focus is on applications and evolving methodologies and algorithms to carry out NLP tasks.

Definition of Linguistics:

Defining linguistics even superficially is going to take much more time. Linguistics is concerned with rules that are followed by languages.

Definition:

Linguistics is the scientific study of language, and its focus is the systematic investigation of the properties of particular languages as well as the characteristics of language in general. It encompasses not only the study of sound, grammar and meaning, but also the history of language families, how languages are acquired by children and adults, and how language use is processed in the mind and how it is connected to race and gender.

https://arts-sciences.buffalo.edu/linguistics/about/what-is-linguistics.html#:~:text=It%20encompasses%20not%20only%20the,connected%20to%20race%20and%20gender.

Objectives:

Linguistics is a more philosophical, theoretical kind of discipline. It tries to answer big, deep and fundamental questions about language and communication such as what features are common to all human languages; how is human communication different from animal communication; how are different ways of communication such as speech, writing and sign language related to each other.

What Linguistics does

The image below shows the various types of linguistic analysis. These are not exactly steps, they are more like the way a linguistic analysis evolves to become a full fledged theory. As you can see the process moves from the specific instance to generalization.

Observation and description phase:

As the image shows linguistics begins with observation and description of language use. This is like characterizing the language in terms of its structures at all levels of language structure: phonology, morphology, syntax, lexicon, semantics and pragmatics.

Generalizations and universal principles

It then develops generalizations and identifies universal principles across all or most language. These can be generalizations on topics such as Recency and Primacy, Active and Passive Voice, punctuation, etc.

Development of theories

All this analysis is ultimately used to develop linguistic theories or approaches such as Structuralism (1900–1950), the generative linguistics (1950s) stream led by Noam Chomsky, Functionalism (1920s) and Cognitive Linguistics (1970s). This is a massive and complex area. So, I will stop at just mentioning the names of theories or paradigms that have dominated linguistic development.

Width in Linguistics — what are important sub fields in linguistics

Branches of linguistics are typically multidisciplinary approaches to the study of language. There are many branches, I mention a few to give you a flavour of the sheer width of linguistics.

Computational Linguistics

The most important sub field from the perspective of NLP is Computational linguistics (CL).

CL, as a subfield of linguistics, is concerned with the formal or computational description of rules that languages follow. Its distinction from NLP is a ‘gray’ area. At times it seems the two are doing the same thing. Especially, as in recent years computational linguistics is also leveraging deep learning methods. Many areas are common between CL and NLP. For example, there has been extensive research in computational linguistics on machine translation and NLU. However, the approaches and objectives of research tend to differ and cross fertilize each other.

Early phases of NLP in the 1950s had a strong focus on machine translation. This was also the period in which Symbolic NLP was dominant which is strongly driven by linguistic concepts. NLP in this period essentially hard coded linguistic rules into computers using ‘ if’, ‘else’ rules. Later, there was reverse fertilization with the rise of statistical NLP and deep learning approaches in NLP also being used in computational linguistics to better understand language structures.

However, basically this is a case where the computational techniques (statistical NLP or deep learning methods) might be similar but the end goals are different. CL tends to focus on the linguistic aspects such as understanding language structure and developing computational models. Whereas, NLP, on the other hand, has a focus on developing algorithms and techniques for efficiently achieving NLP tasks such as translation, NLU, NLG, etc.

Other fields of Linguistics

Linguistics goes much beyond CL. It intersects with social sciences, psychology and neuroscience to explore different aspect of language development in humans. The common point about all linguistic analysis is the rigour that it brings to each sub field. Some of these are:

Sociolinguistics: the study of how language is shaped by social factors. Sociolinguists research both style and discourse in language, as well as the theoretical factors that are at play between language and society.

Psycholinguistics is concerned with the cognitive faculties and processes that are necessary to produce the grammatical constructions of language. It is also concerned with the perception of these constructions by a listener.

Neurolinguistics: The study of the structures in the human brain that underlie grammar and communication. It studies the physiological mechanisms by which the brain processes information related to language. The main focus is on how the brain can implement processes that theoretical and psycholinguistics propose are necessary in producing and comprehending language

These are just a few of the many branches of linguistics. As you can see, its a lot!

Depth in Linguistics — Levels of Linguistic Analysis/ Techniques

Source: https://en.wikipedia.org/wiki/Linguistics

The image above shows the depth of linguistic analysis. Linguistic analysis has many levels of analysis. Each area is in itself a massive area with many complex concepts related to it. It begins with the analysis of speech sounds and how they are understood as speech, to the analysis of words and their components, then to how groups of words are linked up to form meanings and then to the actual meanings of words and finally to the social context of language understanding.

There are 6–7 levels of linguistic analysis. Sometimes a few of them are rolled up into each other and you may see 5 levels and sometimes pragmatics is split into pragmatics and discourse. What these levels enable is to go from the most basic components of speech sounds to surface structures or representations of meaning and communication in linguistic analysis.

The levels are as follows:

1. Phonetics: the study of speech sounds

Phonetics is the most basic level of analysis of any language as it studies the individual speech sounds (phonemes) and how we produce them.

The smallest linguistic unit of phonetics is the phone — a speech sound in a language. Phonetics is not concerned with the meaning of sounds but instead focuses on the production, transmission, and reception of sound.

The production of speech for example looks at the interaction of different vocal organs such as lips, tongue and teeth, to produce particular sounds. These speech sounds are sorted into categories which can be seen in the International Phonetic Alphabet (IPA).

The study of phonetics of is a complex area in itself and has three sub areas such as: Articulatory phonetics, Acoustic phonetics, Auditory phonetics.

Use in NLP

Earlier approaches to automatic speech recognition (ASR) used phonetic elements. These approaches were not very successful as the concepts of syllables, vowels, and consonants may have too many variants to be modelled effectively. ASR is now done successfully using transformer based models.

Currently, this area is more used in purely linguistic focused research such as the study of sounds of endangered languages. Since it is purely concerned with sound, possibly it may not be relevant for understanding the NLP black box.

2. Phonological: study of how we organize speech sounds into words to convey meaning

Phonology seeks to understand how phonemes (the smallest component of speech sound) are organized in a language. It is the study of how speech sounds are organized in the mind and used to convey meaning. It also examines the rules a language follows to determine how certain words should be pronounced.

Whereas phonetics is concerned with the physical production and transmission of sounds, phonology is concerned with the ways in which a language encodes meaning to sounds. To understand this distinction better here is a simple example:

In phonetics we can see multiple realizations of a particular sound. For example, every time we say a ‘p’ it will be slightly different than the other times you’ve said it. However, in phonology all productions are the same sound within the language’s phoneme inventory. Therefore, even though each instance of ‘p’ is produced is slightly different, the encoding of that sound is the same.

https://www.sheffield.ac.uk/linguistics/home/all-about-linguistics/about-website/branches-linguistics/phonology

Uses of Phonology in NLP

The main uses of phonology have tended to be academic and theoretical. There is some limited use of phonological features in speech recognition systems. However, again the performance of these systems was not very effective and better performance was obtained with deep learning approaches.

3. Morphology: The study of word structure

The study of words, how they are formed, and their relationship to other words in the same language. This is probably the first area where we really start to get into serious connections with NLP.

Morphology splits words into morphemes which are the smallest meaning elements of a word. Morphology focuses on how morphemes are combined to make a word.

Morphology varies across languages and depends a lot on understanding what is supported within a particular language. Examples of this type of analysis include: how components within a word (stems, root words, prefixes, suffixes, etc.) are arranged or modified to create different meanings.

English, for example, often adds “-s” or “-es” to the end of count nouns to indicate plurality, and a “-d” or “-ed” to a verb to indicate past tense.

Morphology & Dependency Trees | Cloud Natural Language API | Google Cloud

Morphology of languages has various levels of complexity. For example, languages are also classified in terms of morphology. Isolating languages have little or no morphology, for example Chinese. Agglutinative languages are those that can be easily decomposed into constituent morphemes — an example is Turkish. A third categorization is inflectional where the words have endings fused to them which incorporate changes to the word such as tense, gender or plurality — an example is Latin. However, the key point to be noted is that all these categorizations are not absolute and in fact is more of a continuum.

Uses of Morphology in NLP

Morphology is one area of linguistics that has had greater application in NLP. Stemming and Lemmatization are morphology based techniques for text preprocessing that were a popular part of the statistical approach to NLP. This is the approach that was dominant from the 1980s to 2010 when the deep learning approaches took over. The most popular stemmer was Porter’s stemmer based on Porter’s algorithm for removing word endings. Popular lemmatizer algorithms included the WordNet lemmatizer.

With the rise of deep learning approaches to NLP, both stemming and lemmatization have become subsumed in the way the neural network learns static and contextual embeddings.

4. Syntax: study of grammar

This is where we really get to heavy lifting from linguistic concepts to NLP. Syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences to have a meaningful structure.

The syntactic level of linguistic analysis relates to the structure of the sentence, i.e., the categories of words and the order in which they are assembled to form a grammatical sentence.

Why is syntax important?

There are thousands of words in a language. Mathematically playing permutations and combinations means there are potentially an infinite number of ways these words can be combined to form groups of words or sentences. However, only a few of the combinations actually make sense to us. This is because sentences, to communicate meaning, have a certain structure. This structure comes because most languages categorize all words into groups such as nouns, verbs, adjectives, etc. These categories follow a relatively fixed category order which is known as syntax. Many of the syntactic structures share strong commonalities across languages.

Syntax deals only with structure and is different from semantics which deals with word meanings. An example of a grammatically correct but meaningless sentence by Naom Chomsky:

“Colourless green ideas sleep furiously”.

Noam Chomsky, Syntactic Structures, 1957

Syntax in linguistics is analyzed using constituents. Constituents can be both a word or a phrase which is a constituent with more than one word.

Use in NLP

Syntactic based tools were an important part of statistical NLP in the 1990s. The tools developed included treebank and various tools developed using treebanks including POS taggers and parsers. These enabled computational linguistics and NLP to leverage large text corpuses.

A treebank is a parsed text corpus in which every sentence is annotated in a tree structure called a parse tree or syntax tree. Treebanks were used in NLP to train or test parsers and parts of speech (POS) taggers. In theoretical linguistics they are used to test linguistic theories. These applications were annotated by humans. The level of annotation varies but most have syntactic, POS and even at times morphological information annotated. Famous treebanks include the Penn Treebank and Prague Dependency Treebank.

POS tagging is the process of marking up each word in a text corpus corresponding to a particular part of speech based on both its definition and its context. At the most basic level it can be categorizing a word as a noun, adverb, adjective, etc.

A parser is basically an algorithm trained on a treebank corpus. A parser analyzes a sentence and breaks it down into component structure such as nouns, verbs, etc. Some important parsing tools are : Stanford parser (The Stanford Natural Language Processing Group) and OpenNLP (Apache OpenNLP Developer Documentation).

Would research into Parsers and POS help the black box?

Treebanks, POS taggers and Parsers were key elements of the NLP pipeline in the heyday of Symbolic and statistical NLP (basically up to 2010). The success of transformer models such as BERT or ChatGPT has been achieved by models that do not have explicit modelling of hierarchical syntactic structures.

However, this is an active area of research, with many papers exploring the addition of POS information to transformers to see if their performance improves. The key research question is: to what extent can we further improve the performance of Transformer LLMs by use of syntactic information.

5. Semantics: study of word meanings

Semantic analysis is one area of linguistics which deep learning approaches have had a huge impact on. The earlier approaches to semantics were based on calculating word similarities. This approach did not work so well for word sense disambiguation where a word has multiple senses in which it can be used. An example is the word ‘bank’ which can be used as ‘a bank where one deposits money’ or ‘river bank alongside the river’.

Pre deep learning approaches to semantic analysis of language were dominated by Lexical Semantic Analysis. Lexical semantics includes the study of word meaning, how they act in grammar and compositionality, and the relationships between the distinct senses and uses of a word.

Lexical semantics can be broken into the following types of analysis:

1. How to describe the meanings of words

2. How to account for the variability of meaning from context to context.

An important aspect of lexical semantics is to look at relationships between words that have clear linguistic or ontological relationships. These include concepts such as synonymy (similar meanings), antonymy (opposite meaning), polysemy (same word, different meaning such as ‘bank’). These are the obvious ones. There are many others such as troponymy, hypernymy which probably only linguists know about.

Another important area in semantics is distributional semantics.

Distributional semantics is an approach to semantics that is based on the contexts of words in large corpora. The underlying idea was proposed by Firth in a 1957 article

“a word is characterized by the company it keeps”

John Rupert Firth,

A key idea here is that word meanings cannot be disassociated from the context in which they are used. According to the distributional hypothesis, the more semantically similar two words are, the more likely it is that they will occur in similar linguistic contexts. This is termed as being distributionally similar.

Distributional semantics is measured using word vectors. Word vectors are not something that just came up in 2010 with word2vec. Vectorized representations of words and learning some type of relationships of words to each other was developed in the 1990s as part of corpus based statistical NLP.

The highly effective neural network approach to word vectors started off in 2002 and evolved subsequently with word2vec (Mikolov et al., 2013) and Glove. These word embeddings were dense vectors and of lower dimensionality and were easier and faster to train.

Of course the state of the art is learning dynamic word embeddings via transformer models.

Would the study of semantics help the LLM black box?

This is an active area of research for LLMs. The research finds that it seems the semantics issue is one which LLMs are solving strongly. What exactly they are learning in the embeddings could be helped with inputs from lexical semantics.

6. Pragmatics: study of how language is used in social interactions

Pragmatics and discourse analysis are both areas of linguistics that study language in context. Theories of pragmatics work closely with theories of semantics and syntax to analyze and contextualize meaning.

Pragmatics is a field of linguistics concerned with what a speaker implies and a listener infers based on contributing factors like the situational context, the individuals’ mental states, the preceding dialogue, and other elements. This is an area where meaning becomes more nuanced and subtle, going beyond word contexts. It tries to identify the intention of the speaker and the expected response of the responder.

Let’s look at a simple example of implication: if someone asks “What do you want to eat?” and the response is “Ice cream is good this time of year”, the second person didn’t explicitly say what they wanted to eat, but their statement implies that they want ice cream.

This is a vast and complex area. Historically the study of pragmatics goes back to 1780s in Europe when linguists studying the philosophy of language agreed on a point of view that language must be studied in the context of dialogue and life, and that language itself is a kind of human action.

Types of analysis in pragmatics

Pragmatics can be split into 4 areas of analysis: speech acts; rhetorical structure; conversational implicature; and the management of reference in discourse. To get a flavour of the type of analysis I will briefly review some of the analysis for speech acts which is one of the more popular types of pragmatic analysis.

Speech acts

Speech acts are ways in which people use language to communicate or achieve certain kinds of tasks or acts. These are known as speech acts and are different from physical acts such as taking a walk, or mental acts like thinking about taking a walk.

A speech act would include asking for a glass of water, promising to drink a glass of water, threatening to drink a glass of water, ordering someone to drink a glass of water, etc.

Speech acts can be split between direct and indirect speech acts. Most languages seem to have these categories in one form or another. Common direct acts include assertions, questions and orders. The syntactic forms of these may be different in different languages.

Indirect speech acts can be more complex to interpret. For example, take the question :” Do you know if Jenny got an A on the test?”

A literal response may be, “yes, I do”. This may be considered rude or curt. A more socially acceptable response in a conversation might be “ yes, she did”. This reply is responding to the implied meaning in the question. Ways of requesting or ordering are other examples where indirect speech is used, where, again, the literal meaning can be very far from the implied meaning.

How is pragmatics studied computationally?

Despite its complexity of analysis, pragmatics has been studied computationally and there is a thriving area of computational pragmatics. Computational pragmatics typically includes corpus data, context models, and algorithms for context-dependent utterance generation and interpretation.

Computational pragmatics focuses mainly on inference. The important pragmatics that have been a focus area for linguistics are reference resolution and the interpretation and the generation of speech acts.

Pragmatics and LLMs

As we have seen there has been a massive improvement in semantic tasks using LLMs and contextual word embeddings which enables strong semantic understanding in these models. However, understanding meaning based on context captures only the first level of any human interaction. A more advanced and subtle level is understanding intent underlying sentences, particularly if they are not direct. For example, a sentence such as “The window is still open!” brings in a complaining aspect to the meaning and would be understood by the listener as such in context.

This is clearly an interesting and subtle direction of research which will help evaluate the higher order linguistic capabilities of LLMs. Research suggests that though LLMs perform well on semantic aspects of text, they lag in interpreting the subtler aspects of human conversational speech. This could be important to make the ability of LLM based chatbots to interact with humans more effectively.

7. Discourse: study of how language is used in social contexts

Discourse: studies how meaning is constructed in different contexts, including social, cultural and political contexts. It can be used to analyze a variety of materials, including: Political speeches, newspaper bias, media coverage, etc.

Methods of Discourse Analysis

Discourse analysis can be divided into four main types: critical discourse analysis, conversation analysis, interactional sociolinguistics, and narrative analysis. Each of these types of discourse analysis has its own set of techniques and applications, and each can be used to gain valuable insights into how language is used to shape our understanding of the world.

Discourse analysis is a complex and nuanced understanding of text. For example, scholars argue that measuring populism requires context-dependent understanding, knowledge of the country and the government. Discourse analysis includes tasks such as: topic segmentation, discourse relation recognition, and discourse parsing.

Methods of Discourse Analysis

Text analysis is a key task across social science disciplines — from sociology, psychology, political science, to communication studies.

Pre LLMs, text analysis was less successful computationally requiring extensive manually coded training data for computational analysis. Further, traditional computational methods failed at detection of subtle nuances in text such as irony, sarcasm or subjective and contextual interpretation.

LLM capabilities in discourse analysis

The proficiency of Large Language Models (LLMs) like ChatGPT has been explored in traditional NLP tasks. When it comes to advanced and sophisticated linguistic analysis such as discourse analysis, the issue is less about understanding the black box of LLMs and more about whether LLMs actually demonstrate higher order text capabilities. There is limited research in this complex and subtle area of text analysis. The results of the few studies have shown that ChatGPT was less successful in more complex tasks of discourse relation recognition and discourse parsing.

A useful area of future research would be to analyze LLM output for discourse and compare the cases where the LLM and the coders disagree. We could then get an idea of the motivations provided by the model with those compared to human coders. Such a process can also in itself be informative about the way the LLMs think at a higher level.

Wrap up and next articles

If you have persevered with this article, you would have seen that more advanced areas of linguistics offer considerable scope for researching LLMs in areas of pragmatics and discourse analysis.

Next set of articles

As we have seen this is an active and cutting-edge area of research. In my next few articles I will be focusing more on the current research into LLMs and NLP. Specifically, the areas of focus will be in the following areas:

1. Research on LLMs on the extent to which LLMs have captured semantic and syntactic information

2. Research on LLMs in the area of pragmatics and discourse. I have literally just touched the tip of the iceberg on these topics.

Happy reading. Hope you start your own NLP and linguistics motivated research. This is a fascinating area and it is important for NLP experts to participate with linguists in understanding how LLMs work and to take work further.

If you like my work consider giving it a clap. Also follow me for more such articles.

References

Most references have been hyperlinked for convenience.

NLP and Linguistics 1 — Just How Much Linguistics Do We Need to Know?

Written by Shailey Dash