Understanding Human Language

Karolina Jozefowicz
Nov 4 · 6 min read
Photo by Dmitry Ratushny on Unsplash

Natural Language Processing, also known as NLP, is a branch of computer science and artificial intelligence that focuses on identifying and translating human language to the computers. This technology tries to represent the human words in a form that will be easy for the computers to bring out their meaning and be able to make decisions based on information provided. The NLP algorithms are applying natural language rules to the unstructured data in such a way that converts them to the form understandable by the machines. Translating natural language, or text data in general, is a task with a lot of problems that are not easily solved. Each sentence can have a completely different meaning depending on punctuation in the written text or an accent on different parts of the sentence in the speech. Therefore, you have to be very careful while working with texts to not loose the whole essence of the worked message. Below you will be able to find out how the mechanism of processing the data in the appropriate form looks like and what are the current applications of NLP.


Main tasks in NLP

Punctuation saves life
  1. Tokenization — one of the first thing that we need to do with our text data is to separate each word from another. This step is generally the first basic cleaning of the data. In tokenization process we are segmenting sentences to smaller pieces — tokens. During this process we also distinct the punctuation characters, apostrophes or hyphens and decide whether we should get rid of them or actually they are changing the meaning of our text message and should be taken into account later. In this phase of dealing with text data we are also getting rid of so called ‘stop words’ — most common pronouns or prepositions, which are not informative about the sense of whole message. There are many different lists of stop words which can be selected for pre-processing the text but you have to be careful while using them to not wipe out relevant informations.
  2. Stemming — in this process we are reducing the words to its root form. In here we are focusing mainly on getting rid of the endings of each word that suggest it is not in its base form such as an ‘s’ at the end of the plural versions of the nouns. This will also not always work (think about word ‘news’) as slicing the words may result in changing the meaning or not existing word. Overstemming of the text can increase the recall accuracy and reduce your precision due to too high generalisation. Nevertheless, it is a good approach to improve the performance of your NLP model as it is very easy to use and can help in correcting spelling errors in tokenized words.
  3. Lemmatization — this is a very closely related process to stemming — we are reducing the count of words by grouping different forms of same word and converting them to the most basic version. However, lemmatization is more complex — it uses the morphological transformation of each word to understand the meaning in specific context. It is also connected with the part-of-speech tagging — this step is very important if we want to understand what the whole sentence is talking about as each word in a sentence has its own role. Lemmatization requires much more knowledge about language than stemming but an easiest approach to apply this process is by a dictionary lookup.
  4. Vectorization — to work further on our dataset we need to convert text to the numerical data.There are many methods to do this but the most popular ones are bag of words, TF-IDF and Word2Vec.
PCA projection of the vectors of countries and their capital cities.

Bag of words is exactly what you can thing of based on the name — it combines every word from the document into the vector representation without the knowledge of their order. We then perform the count vectorization to calculate the frequency of each word in the corpus. TF-IDF method is more complicated than bag of words. It calculates the importance of words based on its frequency — combines knowledge that longer sentence will have similar words more times than short one and less frequent words are usually more informative than frequent ones. Word2Vec is a model pretrained on a Google dataset. It uses cosine similarity between words in a feature space based on their meaning.

5. Topic modeling — after pre-processing our text data we can try to capture the hidden topics in the document and apply advanced analytics such as forecasting or optimization. If we can discover those unseen structures we will be able to understand the meaning of the message. Recently the most common technique for topic modeling is Latent Dirichlet Allocation — unsupervised learning method which finds groups of related words and tries to uncover the underlying sense in the collection of documents. In this method the main assumption is that each document covers a small finite number of different topics with specific words used most frequently and therefore connected to it.

Example of topic modeling visualization with the pyLDAvis library. Model was trained on a set of different hotel reviews. Each circle on the chart represents a topic which size is proportional to its frequency in document.

Applications of NLP

Below you can find the examples of current application of Natural Language Processing in real life so you can see how they are changing our world.

  1. E-mail assistance — grammar- and spell-check are functions which are very useful in everyday life. It wouldn’t be possible for the computers to correctly prompt the messages for auto-correct or auto-complete if it wasn’t for NLP. Natural Language Processing comes also helpful in spam detection — algorithms are determining what emails are more likely to be willed to be kept and which should be filtered out.
  2. Machine translation tools — you may have noticed that few years ago online translations were sometimes resulting in awkward statements. As the time goes it is getting better and better and help people around the world overcome communicating barriers.
  3. Chat bots — 85% of customer interactions are predicted to be made by chatbots by 2020. They help people address personalized inquiries in much shorter time as they are available 24 hours a day.
  4. Sentiment Analysis — NLP helps improving sales and marketing strategy, as it allows marketers to verify the emotions of the customers that are correlated to the brand and to the products.
  5. Health care industry — number of different NLP applications are also a part of healthcare industry. They can help provide better support for example in Alzheimer’s disease with a companion chatbot to keep the routine conversation and verify the progress level of the memory loss. Another tool which uses NLP is a therapy chatbot app for dealing with anxiety.
  6. Conversation interface — nowadays many people are using Siri/Alexa/Google Assistant in everyday life, but what if using them could be improved to the level that they will take care of managing your appointments, making sure that the restaurant you want to visit is actually open on the holiday? Google released new app called Duplex, which is supposed to do that kind of tasks for you.

Karolina Jozefowicz

Written by

Data Scientist with background in Aeronautical Engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade