Natural Language Processing

6 min readJan 14, 2019

Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural language such as English. Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions, when you want to hear decision from a dialogue based clinical expert system, etc.The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The input and output of an NLP system can be −

Speech
Written Text

Components of NLP :

There are two components of NLP as given −

1. Natural Language Understanding (NLU)

Understanding involves the following tasks −

Mapping the given input in natural language into useful representations.
Analyzing different aspects of the language.

2. Natural Language Generation (NLG)

It is the process of producing meaningful phrases and sentences in the form of natural language from some internal representation.

It involves −

Text planning − It includes retrieving the relevant content from knowledge base.
Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.
Text Realization − It is mapping sentence plan into sentence structure.

The NLU is harder than NLG.

Techniques used in analysing NLP:

There are sevaral main techniques used in analysing Natural Language Processing. Some of them can be breafly described as follows.

Pattern matching-
The idea here is an approach to natural language processing is to interpret input utterances as a whole father than builing up their interpretation by combining the structure and meaning of words or other lower level constituents. That means the interpretations are obtained by matching patterns of words against the input utterance. For a deep level of analysis in pattern matching a large number of patterns are required even for a restricted domain. This problem can be ameliorated by hierarchical pattern matching in which the input is gradually canonicalized through pattern matching against subphrases. Another way to reduce the number of patterns is by matching with semantic primitives instead of words.
Syntactically driven Parsing-
Syntax means ways that words can fit together to form higher level units such as phrases, clauses and sentences. Therefore syntacticaly driven parsing means interpretation of larger groups of words are built up out of the interpretation of their syntacticconstituent words or phrases. In a way this is the opposite of pattern matching as here the interpretation of the input is done as a whole. Syntactic analyses are obtained by application of a grammar that determines what sentenses are legal in the language that is being parsed.
Semantic Grammars-
Natural language analysis based on semantic grammar is bit similar to systactically driven parsing except that in semantic grammar the catogaries used are defined semantically and syntactically. There here semantic grammar is also envolved.
Case frame instantiation-
case frame instantiation is one of the major parsing techniques under active research today. The has some very useful computational properties such as its recursive nature and its ability to combine bottom-up recognition of key constituents with top-down instantiation of less structured constituents.

Flow of an NLP Project :

Is > 80% accuracy required?

Accuracy in this context is the percentage of records where the correct answer is produced.
High accuracy systems generally require much more work to handle many more text variations, and the work gets harder and harder, especially above 80%.
Lower accuracy systems are often still useful for large-scale analytics and trend analysis.

Is > 80% coverage required?

Coverage in this context is the percent of applicable records for which an answer is provided.

- An “applicable record” is a record which contains text that provides the desired understanding.

High coverage systems generally require much more work to handle many more text variations, and the work gets harder and harder above 80% coverage.

Can you afford substantial time and effort?

Of course, “substantial” is relative but is generally many months’ worth of work.
Note: Search Technologies can evaluate your content and requirements and provide more precise estimates.

Macro or Micro understanding?

See a description of the difference between micro and macro understanding here.

Is training data available?

Training data is typically required to train statistical models for many types of understanding.
Training data may already be available if:

- The system is replacing a process which was previously done manually

- The system is filling gaps for manually entered values, for example, by end users filling out forms

- Public training data is available

- Log data from end-user interactions can sometimes be used to infer training data

- Appropriate third party metadata is available for a portion of the content

Can you afford to manually create training data?

If training data is not available and cannot be inferred, then it will need to be created manually for most types of macro understanding.
Depending on the scope of the project, this can be done by just a few people, or it could require a larger team perhaps even using a crowd-sourcing model.

Is the text short or long?

Generally, short text contains fewer variations and complex sentence structures and is, therefore, easier to process for micro understanding.

Is the text fairly regular / narrow domain?

This question often has more to do with the authors of the text than the text itself.
If the authors are similar across the board, then they will typically produce fairly regular text that spans a fairly narrow domain.

- Examples include employees, airline pilots, java computer programmers, maintenance engineers, librarians, users trained on a certain product, contract lawyers, etc.

On the other hand, if the authors cover a wide range of backgrounds, education levels, and language skills, then typically they will produce a wide range of text variation across a wide domain. This will be difficult to process.

Is the text academic, journalistic, or narrative?

Text which is written by professional writers tends to be longer, more varied, and have more complex sentence structure, all of which is hard to understand by machine.

Is there a human in the loop?

In some applications, there will be human review of the results. Such applications will generally be more tolerant of errors produced by the understanding algorithms.
For example, a system may extract statements which indicate compliance violations. These would, of necessity, need to be checked by a compliance officer to determine if a rule was actually violated.

Is entity extraction the only requirement?

Applications that only extract entities are much easier to create than extracting more sophisticated understanding, such as facts, sentiment, relationships, or multiple coordinated metadata values.

Do you have known entities?

Known entities come from entity lists that have been previously gathered ahead of time. These can be things like employees (from an employee directory), office locations, public companies, countries, etc.
Unknown entities are not previously known to the system. These can include people names, company names, locations, etc. Unknown entities can only be determined by looking at names in context.

Is entity-tagged text available?

Unknown entities will need to be determined based on context (e.g. the words around the entity). This can be done statistically if sufficient tagged examples exist.
Sometimes (rarely) entity-tagged text can come from public sources. Other times it may come from tagging in the content (e.g. embedded HTML markup) or by watching the user interact with cut and paste.

Can you afford to manually create entity-tagged examples?

If entity-tagged text is not available, then it may need to be manually created. This can be an expensive process, since, usually 500+ examples are required to achieve good accuracy. To achieve > 80% accuracy, as many as 2,000 to 5,000 examples may be required for good accuracy.