For my Google Summer of Code Project with Pharo Consortium, I aim to build a Natural Language Processing Library for Pharo. The project’s end product should be to unite existing NLP packages and implement the missing fundamental tools, into a united library with a uniform API and good documentation.
So, the biggest thing that needs to be considered now is that the code is not for the use of a single individual anymore (the code writer). There needs to be an inherent structure to it. The code needs to be readable and understandable and we would ideally want good documentation such that anyone new can know how to use it easily.
So, before we delve into the coding aspect, we would want to get the structure of the library down right. In this blog I plan on covering 3 main things:-
- How popular existing NLP libraries do things.
- How I plan on structuring my NLP library.
- Why I plan on doing it like this.
Popular Library — SpaCy
The architecture diagram for the SpaCy library is below.
There are 2 central Data structures to SpaCy functionality. Doc and Vocab are the names of there data structures. The Doc keeps track of the order of the words and their corresponding data like annotations. Next is the Vocab data structure. This centralizes all the tokens present in the documents. By reducing this we reduce the number of redundant mappings. For example, an easier to store hash is stored in the Doc object that can easily be looked up by an associated Hash table. By centralizing strings, word vectors, and lexical attributes, they avoid storing multiple copies of this data and save memory.
An in-depth can be found in the SpaCy documentation here.
My Idea for the Library
In this diagram, the circles represent the Objects and rectangular boxes represent methods.
The structure of this library would be somewhat similar to the SpaCy approach as the library as it is also an Object Oriented approach.
I believe that structuring the library like this allows for a cleaner and more understandable hierarchy for the library. As more and more people start to begin contributing to the project, the package will be easier to work with for new people. The package would have classes such as NLObject, NLModel, NLTokenizer, and such Object Classes. More models can easily be added to this the corresponding task that they are relevant to.
Problems with Design
One of the biggest problems with this model is that we can’t directly work with preprocessed data. For example, if I already had tokenized text and want to work with that, how would that fit into this pipeline.
I hope that this library serves as a means to explain my reasoning to how I plan on structuring the NLP library for Pharo. Any criticisms and suggestions that will help in improve this design and overall the library are greatly appreciated.