Architecture Design For an NLP Library

For my Google Summer of Code Project with Pharo Consortium, I aim to build a Natural Language Processing Library for Pharo. The project’s end product should be to unite existing NLP packages and implement the missing fundamental tools, into a united library with a uniform API and good documentation.

So, the biggest thing that needs to be considered now is that the code is not for the use of a single individual anymore (the code writer). There needs to be an inherent structure to it. The code needs to be readable and understandable and we would ideally want good documentation such that anyone new can know how to use it easily.

So, before we delve into the coding aspect, we would want to get the structure of the library down right. In this blog I plan on covering 3 main things:-

  1. How popular existing NLP libraries do things.
  2. How I plan on structuring my NLP library.
  3. Why I plan on doing it like this.

Popular Library — SpaCy

SpaCy Architecture

There are 2 central Data structures to SpaCy functionality. Doc and Vocab are the names of there data structures. The Doc keeps track of the order of the words and their corresponding data like annotations. Next is the Vocab data structure. This centralizes all the tokens present in the documents. By reducing this we reduce the number of redundant mappings. For example, an easier to store hash is stored in the Doc object that can easily be looked up by an associated Hash table. By centralizing strings, word vectors, and lexical attributes, they avoid storing multiple copies of this data and save memory.

An in-depth can be found in the SpaCy documentation here.

My Idea for the Library

Design for Pharo NLP Library

In this diagram, the circles represent the Objects and rectangular boxes represent methods.

The structure of this library would be somewhat similar to the SpaCy approach as the library as it is also an Object Oriented approach.


Problems with Design

I hope that this library serves as a means to explain my reasoning to how I plan on structuring the NLP library for Pharo. Any criticisms and suggestions that will help in improve this design and overall the library are greatly appreciated.

4th Year Undergraduate Student at the International Institute of Information Technology, Hyderabad

4th Year Undergraduate Student at the International Institute of Information Technology, Hyderabad