How to Approach your NLP-Related Problem: A Structure Guide

Oksana Tkach
6 min readAug 24, 2022

--

Very often people ask me for an NLP consultation for their business projects but struggle to describe where exactly they need help. This gets even harder when someone had taken one NLP course and knows some terminology, but is applying it in the wrong places. To make sense of what people want, over the years I’ve developed the following structure of how to approach NLP in business.

  1. What is your business problem?

Try to think of the problem you are having practically, not in terms of NLP. In very simplified terms, a business problem is when you are losing value or not creating as much value as you need. From my experience, AI tends to save resources rather than generate value. So people turn to AI to automate or speed up some work they would otherwise pay for.

Let’s look at this example. We have an SEO company. The company employs copywriters who write articles that mention particular keywords. This is in order to fill websites with content so that google would show them higher up in their search ranking. The company decides they can’t afford to pay copywriters and they would like to somehow automate the creation of those SEO-friendly articles.

2. What is the NLP problem?

A business problem can easily be translated into an NLP problem. An NLP problem isn’t defined in terms of saving resources or generating value, it’s defined in linguistic terms. Having something described in linguistic terms makes it much easier to find the NLP task later on.

Let’s say you trade stock and you want me to build some software that analyzes the news and tells you what some publicly traded company is doing with their business on that particular day. Doing that research manually is time-consuming. That’s your business problem. The NLP problem is to get a computer to identify specific linguistic markers of whether the company is doing well or badly that day. What sort of phrases are we looking for? How would we go about extracting them? What other linguistic markers can be useful (like the tone/mood of the article)? And so on.

In our example, the SEO company needs to figure out how to generate text without human intervention. Not only that, they also need the text to be about a particular topic and contain specific keywords.

3. What is the NLP task?

NLP tasks are very well defined within the field. I like to separate two different groups of NLP tasks: basic and complex.

Basic tasks are rarely the solution to your NLP problem. Normally, they are at the basis of feature extraction. Many of them are also “solved”, meaning that for many languages we already have models that give decent results for the basic tasks. These include:

  • tokenization
  • POS-tagging
  • lemmatization
  • word representation (e.g. embeddings)

etc.

Normally, complex tasks use the basic tasks as a part of pre-processing, and will have multiple parts and pipelines. Some of these are:

  • Named Entity Recognition
  • text mining
  • machine translation
  • question answering
  • Natural Language Understanding

etc.

Coming back to our example, the NLP task the SEO company is trying to solve is Natural Language Generation, or text generation.

4. What is the best NLP method to use?

AI as a field of research, and NLP in particular, includes these methods:

  • Dictionary-based (simple substring matching).
  • Rule-based (regular expressions and various transducers).

A transducer in NLP is a deterministic model that takes in some string and modifies it according to the rules you coded. Unlike a statistical model, a transducer has a predictable, controllable output.

Note that the two methods above aren’t really a part of data science because they are heuristic rather than analytical. But they are still a part of AI. The two methods below is where the machine learning starts.

  • Statistical (hidden Markov models, maximum entropy, SVM machines, etc.)

I also refer to these as simpler machine learning methods. They are faster and simpler to train and require less data than neural networks to give some results. These can have workable results when your task has low variability (like very obvious linguistic patterns).

  • Neural networks (LSTM, though rarely used now; CNNs, transformers).

To get better oriented, you can think of neural networks as the same ideas and concepts as the simpler machine learning methods, but reinforced by tons of computational power and data.

Now, this is the part where people often get confused. Any NLP task may be solved by any one of these methods. The methods above are ranked in ascending order by complexity, performance, and the amount of data you’ll need. The dictionary-based method is easy to code and it doesn’t require any data, but it will have very, very low recall. However, it would still be considered AI.

Regardless of your NLP task, the method you choose depends on:

  • how much data you have;
  • how much time you have;
  • how much money you can pay for the development;
  • how much control over the output you need.

To dwell on the last point a little bit more. For something like a chatbot, you can use a neural network to develop it. But it will have unpredictable outputs (you don’t always know how the chatbot will reply). But if you are using a chatbot for sales, you need it to stick to a particular rhetoric, such as trying to sell the user some shoes. Because of this, chatbots are normally developed using simpler methods, more often the rule-based method. Even if you have the data, time, and money, sometimes for your business purposes you need to “dumb down” the NLP solution in order to control it.

Named Entity Recognition is a great example here because a NER application can use all of these methods all at once.

Named Entity Recognition is a task of extracting some named entities from a string of text. Usually people want the computer to identify company names, people’s names, countries, dates, amounts, etc.

Since there is a limited number of countries in the world, you can just use the dictionary-based method for this. Compile a list of all possible countries and look for them in your input text.

For dates, there are limited ways in which we write them down. Date formats are highly regular. Because of this, the rule-based method (regular expressions) would perform very well for date extraction.

Now, company names are much more unpredictable. Extraction of company names in particular is not yet fully solved, but you can often get decent results from a transformer model. You can try extracting companies using NLP rules, and you’ll get decent precision (very little garbage returned), but very low recall (you’ll only extract maybe 20% of company names). So for more complicated entities, machine learning is the better way to go.

Let’s go back to our SEO example. To generate text according to some keywords, the developers can use multiple different methods.

To have a quick working prototype for text generation, you can hard-code some rules where you glue together various phrases in order to construct sentences.

If you want to generate some fake movie reviews using rules, it would look something like this:

Is this elegant, smart, or impressive? No. But it’s quick, it doesn’t need a dataset, and with some linguistic expertise you might just fool the google algorithm.

If we have more time, we can collect a small dataset for each set of keywords we need, and train a few statistical language models.

The training is quick. The model generates each next word based on how frequently it appeared in the same context in your dataset (so based on the word’s probability). It may be less readable than the rule-based method but it has much more variability in the text, so might perform better in the search ranking.

As the next step, the SEO company may invest in collecting and labelling a few gigabytes of articles. They can then fine-tune a pre-trained transformer based on their custom dataset, and get a model that generates very human-like text on the topic that they want. Maybe they’d need to train multiple models. Who knows. This needs research. This also needs time and money for collecting the dataset, getting the model to work as intended, and deploying this monstrosity to make it usable by anyone in the company.

Hopefully, this article gives a better understanding of how to apply NLP in business. Very often, you don’t need to go full R&D right off the bat. Use a simpler, more primitive method until your business is mature enough to take a more scientific approach. And yes, you can still claim that you use AI on your website.

--

--