NLProv: An NLP Toolbox

Martha Bellows
Johnson & Johnson Open Source
4 min readApr 23, 2020

As a programmer, what do you do when you repeat the same code over and over again? You create a function, a library, or package it in some form so you don’t need to duplicate the same code ever again! After working on several Natural Language Processing (NLP) projects in a row and continually copying and pasting the same code from project to project, it made sense to package the code for easier reuse.

Johnson & Johnson is proud to announce our newest open source project, NLProv.

What is NLProv?

NLProv is a new open-source toolbox for NLP tasks such as preprocessing, vectorization, and text similarity. It combines several existing open-source libraries such as pandas, spaCy, and scikit-learn to make a pipeline that is ready to process text data. There are many user defined parameters depending on your type of project such as the ability to choose stemming or lemmatization. Or, you might want to define explicitly what to substitute with NaN text fields. Overall, it is a way to get you started in your NLP task, no matter what you need.

Why did we make NLProv?

We made NLProv because internally, we were tired of copying and pasting the same sections of code over and over again for every NLP project we came across. Where did we get the name? The “NLP” part is obviously for Natural Language Processing. Then the “Prov” part is because the creators work out of J&J’s Providence office in their Healthcare Technology Center.

Why Open Source?

We made it initially as a library internal for our group and other data science groups within J&J and since we found it useful, we decided to make it open-source to share it with you all! We also would like your help and feedback on how to enhance the tool. We have a few ideas already (check out the roadmap section) but would love to hear what you would find helpful.

How does it work?

There are currently three main components of the NLP pipeline: text preprocessing, vectorization, and text similarity. For text preprocessing, there are many options in how to preprocess your text to get it ready for whatever your next step is. The following are just a few examples of what preprocessing steps are available that you can use to tailor to your specific needs.

  • Stemming and lemmatization
  • Lowercasing
  • NaN handling
  • Removal of non-English text
  • Character replacement

The next step in an NLP project is usually turning text into numeric vectors that can be understood by the computer. Right now, the only options in the library are a Count Vector and TF-IDF Vector. Check out the roadmap section for additional ideas we have in the works!

Finally, since we have worked on several projects involving text similarity, we have a few initial implementations in our library: cosine, jaccard, manhattan, dice, and hamming.

NLProv Roadmap

The following list is an idea of what we want to implement for this toolbox in the future.

  • Additional similarity metrics: Word Mover’s Distance and Levenshtein.
  • Ability to add custom stop words. While you could technically do this yourself, we want to make it as an input to the preprocessing pipeline.
  • Incorporate other languages for preprocessing.
  • Additional vectorization techniques such as word2vec.
  • Create visuals to enhance understanding of how similarity metrics work.

Closing Thoughts

I hope this article was helpful and gives you are general idea of how NLProv works. Let us know if there is anything we should add to our roadmap or if you are interested in contributing! Thank you for reading!

--

--