Natural language processing (NLP) also known by a very common term as text analytics has become a pivotal element of the entire gamut of Digital Transformation. Even though, there have been several academic experiments, projects that have run since 1950’s but the progress in terms of some real world implementation has happened in recent times (nearly a decade). It became possible because of the sheer volume of text data (particularly user generated data) through various channels and in different mediums such as social media apps, e-commerce websites, blog posts, ITSM systems, insurance and loan processing process etc. and due to significant improvement in computational power that is available to process underlying models.
Each NLP system is designed to take raw unprocessed data and break the analysis into smaller sequential problems, and solve each of those problems individually. The individual problems could be as simple as breaking the data into sentences, words etc. to something as complex as understanding what a word means, based on the words in its “neighborhood”.
The foundation aspect of any text processing is to understand the Encoding used for the raw data. It is a known fact that Computers could handle numbers directly and store them on registers. But they couldn’t store the non-numeric characters as is. The alphabets and special characters were to be converted to a numeric value first before they could be stored. Hence, the concept of encoding came into existence.
The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard, in 1960. With time, new languages began to show up on keyboard sets which brought new characters. ASCII became outdated and couldn’t incorporate so many languages. A new standard has come into existence in recent years — the Unicode standard (UTF-8, UTF-16 etc..). It supports all the languages in the world — both modern and the older ones. For someone working on text processing, knowing how to handle encoding becomes crucial. It is also important to understand the way each encoding method stores the characters in terms of different number of bits as this plays an important role in terms of storage of the data.
After knowing the underlying encoding, you can choose to modify the raw text into another encoding format if necessary. This sets the stage to start the natural language processing or text analytics. It is important to note that NLP or text analytics is a step by step process to apply Low to High level (different levels in terms of capability to handle data complexity) solutions to process the data. The data processing could be a single phase or multi-phase process depending on the complexity of data and amount of text modifications (in terms of format, syntax and semantics).
This journey of getting meaning out of the raw data could be divided into 3 broad steps :
Lexical Processing: This step is primarily focused towards converting raw text into words and, depending on your application’s needs, into sentences or paragraphs as well. The idea here is to extract the words that represents the central meaning of the sentence or paragraph. for ex. in an email if words such as lottery, prize and luck exists, then the email is represented by these words, and it is likely to be a spam email. We also use this opportunity to perform some basic processing steps for ex. considering all plural words to be equivalent to the singular form.
But this isn’t just enough to cater advanced problems. In real world only extracting the words will not suffice and we also need to differentiate between text structures that might look very similar for ex. the sentences “My friend’s daughter” and “My daughter’s friend”, have very different meanings. However, applying only lexical processing will treat the two sentences as equal, as the “group of words” in both sentences are same. This solution with only lexical processing will fail with conversational UIs or chatbots where meaning of the words and POS(Parts of speech) are also important. Hence, a more advanced system of analysis is definitely required.
Syntactic Processing: A step ahead of lexical processing and here the focus is to extract more meaning from the sentence, by using its syntax this time. Instead of only looking at the words, we look at the syntactic structures, i.e., the grammar of the language to understand what the meaning is.
for ex. Differentiating between the subject and the object of the sentence, i.e., identifying who is performing the action and who is the person affected by it. For example, “Raj gifted Rohit” and “Rohit gifted Raj” are sentences with different meanings from each other. Hence, a syntactic analysis that is based on a sentence’s subjects and objects, will be able to make this distinction.
There are various other ways in which these syntactic analyses can help us enhance our understanding. For example, “ book a taxi, buy a book, Movie tickets booked”. Each of these are having word “book” in common but used in different context and purpose.
Now the important point to understand here that raw text could be from different languages followed around the world or a mix of multiple languages which implies that difference language construct applies based on the language in scope. It is unwise to expect the content always in English and hence the complexity and challenges increases while handling multi-lingual text.
While Lexical and Syntactic processing are able to help for many variations of text analytics, these are still not sufficient for advanced NLP applications and that brings us to the 3rd step of NLP which refers to Semantic Processing.
Semantic Processing: With variations in each language, it is possible that humans use many different variations of a word for ex. Database or DB, Chief Minister or CM, HRD or Human Resource Development. It will be necessary for the NLP systems as well to have capability to understand and identify synonyms, antonyms, etc. on its own. It is very unlikely that we could store these similar words and build a database of such records.
These cases are typically handled by inferring the word’s meaning to the collection of words that usually occur around it. So, if the words, CM and Chief Minister occur very frequently around similar words (for ex. election, poll, state, assembly etc..), then you can assume that the meanings of the two words are similar as well. In these cases the domain knowledge becomes a key factor if the raw data is being processed for a specific domain such as Retail, Insurance, Healthcare etc..
Once you have the meaning of the words, obtained via semantic analysis, you can use it for a variety of applications.
In summary, Machine translation, chat bots and many other applications require a complete understanding of the text, right from the lexical level to the understanding of syntax to that of meaning. Hence, in most of these applications, lexical and syntactic processing simply form the “pre-processing” layer of the overall process while in case of some simpler applications, only lexical processing is also enough.The final NLP solution will be based on the business problem in hand, complexity and variations involved in the raw data and domain expertise.