Introduction to Computational Linguistics

Manas Ranjan Kar
NLP Wave
Published in
3 min readDec 30, 2016

The content borrows heavily from lectures of Prof. Dipti Misra Sharma at IASNLP 2016 held at LTRC, Hyderabad, in collaboration with Google and IIIT. I was a participant at the advanced summer school, which dealt with varied topics in computational linguistics, QA systems, machine learning and speech recognition.

Language is a unique ability of humans. In some ways, we can say language encodes information that is being communicated. We apply the process of analysis (decoding) for understanding and synthesis (encoding) for expression or speaking. Language is mighty helpful in transferring information and ideas.

Communication involves transmission of information, intention (purpose of the communication) and emphasis (focus or the aspect). Communication by its very nature is varied — same stuff may be said/written in multiple ways, leading to same information. That, encapsulates the challenges with NLP and CL and performing them with acceptable accuracy.

Communication involves multiple linguistic elements/entities. They are;

  • Words
  • Sentences — arrangements of the words in certain order. These words lead to a cohesive and composite meaning.
  • Discourse — arrangement of the sentences. Sentences or part of the sentences are related to each other and provide a cohesive meaning.

The problem before practitioners who are trying to decode languages and extract meaning is this — languages differ in a way in which they organize information in these entities. Adding to the difficulty and ambiguity, all these levels interact in the organization of information — every level (words, sentences, discourse) has their own complexity, so solving one level is not often enough.

Technically, Computational Linguistics (CL) is a scientific study of languages from a computational perspective. The scientific component provides explanation for linguistic or psycholinguistic phenomena. The computational component develops computational models/techniques for linguistic phenomena. At the core, the subject under study is human language. The field requires synthesis of knowledge from computer science and linguistics.

A Computational Linguist needs to understand;

  • How human language works?
  • What information is available in the language?
  • How languages encode information?
  • How this knowledge/information can be represented/engineered for computational processing?

As daunting as it may sound, a step by step approach is recommended. What’s near impossible for computers or for that matter any species to comprehend, a 3–4 year old human is able to draw logic and use language far more effectively. A much contested claim from the famed linguist Noam Chomsky is that we humans, are biologically endowed to use language. That allows young kids and babies to pick on cues way faster.

In many languages, there are multiple interpretations of the same sentence. That leads to ambiguity.

For example in Hindi;

Mujhko tumko dus rupaye dene hain

may mean

  • I have to give you ten rupees
  • You have to give me ten rupees.

Languages encode information differently, and in some time code the information partially. This leads to tension between brevity and precision. Brevity wins leading to inherent ambiguity at different levels. For precision, the sentence needs to be more explicit, which rarely happens.

Ambiguity can be at a lexical level or sentence level.

Lexically, for example;

  1. It was a kind act.
  2. The hero die in Act four.

This is called polysemy. Structural ambiguity is also prevalent.

Humans use world knowledge, context (linguistic/extra-linguistic), cultural knowledge and language conventions for disambiguation. Computational linguistics aims to provide all this knowledge to a machine. This knowledge is provided through machine readable lexicon, annotated corpora, knowledgebase etc.

--

--