Sitemap

NLP Startup: Where to Begin

6 min readAug 24, 2022

So you have an NLP idea. Here’s some advice to avoid making basic mistakes.

  1. Identify that your startup is indeed in NLP

If you found this article, you are probably on the right track. More often than not, people don’t realize they are creating something NLP-related and that their software could benefit from NLP research. A good rule of thumb — if your idea has anything to do with processing language (text or speech, social media, chats, or articles) — you are probably in NLP. I welcome you in this field, it is incredibly frustrating.

You might be somewhere in between. For example, a popular field of development currently is processing various text documents, such as PDF. Often, there are projects that focus solely on PDF as an image in order to do all sorts of permutations to documents. In that case, you are likely not in NLP, or at least textual data is not at the forefront of your research. However, if you use the text on the page as well as the pixel position of the text in order to process the document somehow, you are both in Computer Vision and NLP. Good luck.

2. Define your problems, tasks, and methods

Very often people get overwhelmed and confused by NLP because they lack structure in their definition of the problem. This is also because the NLP field itself lacks structure, but that’s a topic for a different conversation.

This is why I wrote the article below that describes how you can break down your NLP-related problem to see more clearly what direction you should be moving in.

3. Do not start big

I mean, do start big if you have tons of labelled data. Do start big if you have tons of money to collect a dataset and label it. But, if you have either of those things, you are not a startup. What are you doing reading this article?

Here’s the problem I see time and time again.

Many machine learning specialists are math whizzes who would much rather research matrix multiplication than make an NLP demo for you. These people are amazing, but they belong in academia or in a giant company that can give them money and time for research. When you have a person like this on a team, they immediately want to solve every problem by collecting and analyzing data. This is the correct scientific approach, but for a startup, not only won’t this approach produce a viable MVP in a short time, it will likely tank your project very soon.

A startup’s initial purpose is to either attract some investment or start selling a product that produces some income. Otherwise, it’s not a startup, it’s a hobby. Your initial task should be to have some sort of a demo or a usable app as soon as possible.

Before you do any sort of custom machine learning, you must first research if there are any available pre-trained models or labelled datasets you can use. With a little bit (or a lot) of linguistic creativity, you can use what is out there to build a semi-decent pipeline for your specific NLP idea. If all else fails, you can always fall back to rule-based NLP methods. This is a great way to manually overfit your model on some custom demo dataset that showcases your app in the best light. Hey, if it was good enough for Steve Jobs, it is good enough for you.

Of course, never stop there. Rule-based NLP does not scale and does not generalize on new, previously unseen data. Like, at all. The more people use your product, the worse the experience will become. You must have a growth plan. Expand your dataset continuously and iterate your NLP methods, making them increasingly more accurate and complex. The plan should also take into account the integration and debugging efforts, which will likely take more time than the R&D itself.

Just don’t try to do all of the data science at the very beginning.

3. Modularize

Do not mix NLP and other parts of your code. Always isolate everything NLP-related into a separate module or service. If possible, make the code scalable and universal by moving change-able parameters into separate config or json files. This does not need to be perfect. You’ll find the perfect solution as you improve your code base over the years. It can be primitive, as long as it’s scalable and refactor-able.

This often happens when the developers initially didn’t realize they were coding an NLP product, so every part of the code was back-end to them, and it all got mixed. This will become a huge expensive pain point once you realize you need to change your NLP methods or refactor the code.

I once did some work for a startup that had their app coded in java. They were using some rule-based methods (they of course didn’t know this is what they were doing) to count how many times certain words and phrases were occurring in chat conversations. While reviewing the code, I found the word counters in multiple different unrelated places on the backend, as well as in their database module. They literally had NLP coded in SQL. This app was impossible to improve in terms of NLP without re-writing the whole backend. They of course positioned themselves as a data-driven intelligent AI something-or-other. The startup ceased to exist a few months later.

4. Do not use old technology

No weird java apps from the 90s (looking at you, GATE). There are plenty of old NLP services that are often rule-based, and so might be perfect for your demo. But I would not recommend using them seriously. They might do exactly what you want, but the development will be slow and painful, connecting modern libraries to them will be impossible or very hard, and hosting them will be difficult and expensive. Then at the end, they might ask for tens of thousands of dollars for you to use the service in production. No.

Hire a good python developer who’ll recreate that same behavior in a language that’s modern and easy to integrate.

Yes, python is slow. But your demo doesn’t need to be fast. It’s just a demo. It’s not processing large amounts of data. Optimizing is important, but it is also time-consuming. Start optimizing when users start complaining, not when you don’t even have users. This is why you are modularizing everything at the beginning — to make optimizing easier later.

5. Hire someone who knows linguistics

Have at least a few language geeks on the team. You want those people who religiously watched all Jurafky’s lectures they pirated from torrent after coursera took them all down. You want at least two people so they can talk to each other and not be lonely. I mean so they can talk to each other and come up with new cool ideas for your NLP service. You may not understand them, but you need them. Not just machine learning people, although those are always useful to have on the team, but computational linguists who can analyze your data and make experience-driven decisions. I promise you, this works so much better than endlessly tuning parameters in some complicated model that never works that well in real life.

6. Hire field consultants

Your NLP business must be solving some problem in some business field. Maybe it’s an app that helps diagnose patients, in which case your business field is medicine. If your primary user base is banks, you are in banking (my condolences). Maybe it’s social media, the translation business, the beauty business. Whatever it is, you need to know what your target user base wants, expects, or how they react to whatever you’ve built. You need a trusted consultant, or a few, preferably people who aren’t in tech but deeply in that business field. They can explain every detail of their workflow and give objective feedback about your software, and even if you don’t listen to everything they say, you’ll still learn a ton.

7. Do not promise magical AI

This just doesn’t work. AI works when combined with human reasoning. Build your app in a way that helps a person do their work faster and more efficiently, not replaces the human.

If you promise an app that does everything for the user, you’ll be met with two problems.

The first one is rather prosaic — the technology is just not there. It’s impossible and you won’t be able to deliver on your promise.

The other reason is much more interesting — in some business fields, people don’t want AI to do everything for them. If it’s something very strict, like translating important documents, people don’t trust computers (and rightfully so) to not make mistakes. They want a human to check AI’s work. If “magical AI that will replace your employees” is your primary selling point, you might actually repel potential clients.

--

--

Oksana Tkach
Oksana Tkach

Written by Oksana Tkach

Sometimes I procrastinate so hard I write an article

No responses yet