In Defense of Rule-Based NLP

Not all AI is neural networks, and not all neural networks are effective

Oksana Tkach
Jul 23, 2019 · 7 min read

I have found that we humans are, in a way, prejudiced against simplicity. We tend to believe that more complicated things are more worthy of our attention, regardless of their actual performance. A neural network trained with a terabyte of data on a dozen GPUs for a month sounds like it will have subhuman performance.

Contrary to popular opinion, artificial intelligence is not only neural networks. The term neural network sounds awesome because it hints that the algorithm somehow mimics the way we think (which, it doesn’t). For many problems, neural networks give results of amazing nuance. For many, but nearly not all of them.

See, a neural network is only as good as the data you train it on. In absence of large amounts of well-structured, clean, homogeneous data, your best algorithm trained for days won’t really perform.

This is why there’s never a clear distinction between computational linguistics, Natural Language Processing, Data Science and AI that understands and generates language. Because some rule-based approaches in NLP are so linguistics-based that I hesitate to count it as part of Data Science. Data Science has a process: you collect clean, relatively homogeneous data, you analyze it and get objective statistical information about that data, you extract features based on that analysis, you train a model on an algorithm that will extract the most informative bits from your data in the best way. It’s a descriptive, objective process that only works with and is completely reliant on good data.

In rule-based approach, you rely completely on human expertise, with our biases and blind spots. The rule-based approach is prescriptive and subjective. This is why it is also so much more usable.

Here are four reasons why rule-based approaches in Data Science, NLP in particular, work.

  1. It’s a quick working solution

I could bet my beloved macbook pro that about 80% of startups that market themselves as “AI-powered” and “data-driven” actually just use a bunch of if-statements in their MVP.

So are they lying about their product? Not at all. Rules are still AI.

Is their product working then? Most likely. It is definitely performing better than a baseline. Their developers definitely have enough expertise in the field to come up with some heuristic solution that does work in the real world.

Why aren’t they using a neural network? Probably because they don’t have enough data yet and collecting and analyzing data is in their business plan at some future stage, when they reach some critical user mass. Most likely though, they are spending their VC money on marketing and sales.

2. Because solving a Data Science problem is like trying to reach the speed of light

The baseline is very easily reached. Something like 80% is harder but doable, depending on a task. But as you approach the 100%, every fraction of a percent is exponentially harder to reach, in terms of amounts of data and algorithm complexity. This is because Data Science is trying to explain the real world, and the real world is mostly predictable but unchangeably random. Not even real life humans that label data sets can agree between each other 100%. Human results are more like 97–99% accurate.

Build better voice apps. Get more articles & interviews from voice technology experts at

Take language, for example. When studying a language synchronically, i.e. in a particular moment in time, it has a limited vocabulary — only the words that we use at that moment have meaning. But if you look at a language diachronically, i.e. as a system through time, you’ll see that its vocabulary is endless. Because we come up with new words all the time, be it something that has a logical derivation, like lol, or something cultural, like covfefe.

But I digress.

Iterative development is a great way to ride this speed-of-light phenomenon. Say you’re an SEO company, and you’ve decided to ditch the copywriters. They are expensive and not always as effective as you’d like. So you want an NLP solution that will generate texts based on the keywords you provide.

You can, in a respectably short period of time, set up a rule-based system that will generate sentences based on some grammar and insert your keywords as heads of noun or verb phrases. You can add some more rules to exclude weird combinations of words. The result will be a text that’s not particularly readable, but one which could probably fool Google.

While you’ve got a solution up and working, you can create a system that is a little more sophisticated: for each set of keywords, you can find existing relevant, human-written articles by Google search, then just count how many times each word occurs in them. And voila — you’ve got a statistical language model for your particular set of keywords. You can then generate brand new sentences based on ngram probabilities. The text still won’t make much sense overall, but now you’ve excluded any weird word combinations.

As the third iteration, you can crawl websites to collect a data set with human-written articles, and use the keywords extracted from the HTML as their labels. Now you can dwell into language generation neural networks. They will likely produce texts that are coherent from start to finish. (But still read like my undergraduate thesis — with no real point to it.)

3. The problem is too specific

Some NLP solutions need to be deterministic.

Take Siri, for example. Siri doesn’t generate its answers to your queries based on ngram probabilities. The bot is an information retrieval system, so she has to give you logical, concrete answers. For your each question, there is a logical path of if-statements built in. That’s called a knowledge base.

A knowledge base is a structured collection of facts about the world. It can’t, by definition, be probabilistic. This is why information retrieval chatbots trained on neural networks don’t work (yet).

If a corporation wants to use a chatbot as an additional sales channel, that’s more of a design problem than a Data Science problem. Sure I could train a neural network on a bunch of conversations, but the client doesn’t care if the bot can reply “Polo” to “Marco”, they just want it to be easy to sell something through the bot. The task is more to re-create user interface, but with a conversation. It has to be hard-coded.

A hard-coded niche chatbot is still a linguistics problem though. You have to analyze discourse — how people tend to communicate. You have to model a conversation and all its caveats. You have to predict all users’ possible responses, using linguistic features like sentence structure, part-of-speech tags, etc.

The smartest chatbot I ever built was an enormous project written in ChatScript — a fully rule-based, offline, lightweight engine. The end result was a smart bot because the conversation I was modeling was very niche and couldn’t deviate wildly by design. I just spent A LOT of time thinking about all possible responses the users could give me, testing the conversation on other people to see what they naturally reply that I hadn’t predicted, modeling the chatbot’s words in a way that solicited exactly the response I needed to move the exchange forward, and making sure my algorithm’s dozens of if-statements didn’t conflict with each other. The end result was significantly more impressive than that of a seq2seq neural network trained on actual people’s conversations.

4. Software Limitations

Sofia the robot uses ChatScript to talk. This is because, I am fairly certain, Sofia is just a raspberry pi stuck inside a mannequin with a ginormous marketing team behind it. It is most likely not even connected to internet. ChatScript is fully portable, fully-rule based, lightweight engine that doesn’t rely on any external resources. It’s run as a local server out of the box and it’s very easy to install and compile. Perfect for when you don’t have enough memory to store and unpack a language model.

One of my clients has some insanely outdated software stack. At one point, we had to write a sentence boundary disambiguation algorithm that could only run on python 2.6, with no possibility to install any additional libraries or import any external files, and execution time restriction of something like a minute or two. No Data Science method, apart from rule-based, could possibly work here.

All Data Science problems start with identifying the features of your data. When you don’t have the data to chart onto a nice pandas chart, you have to turn to a tragically over-looked DS skillset component: domain knowledge. Sometimes you just have to think about how language works and assume what features would come up on a pandas chart if you had data to analyze. Then you take those assumed features and write an if-statement for each. There, you’ve got yourself a heuristic algorithm. This is why for a serious NLP project where some parts just can’t be solved with statistical classifiers or neural networks, the expertise of an analytical linguist is vital.

In a perfect world, this shouldn’t work, because any heuristic solution is subjective by definition, and science doesn’t like that. But in the real world, with humans and deadlines and business plans, it does work, and we should use it.

Voice Tech Podcast

Voice technology interviews & articles.