Grammaregex library — regex-like for text mining

Krzysztof Fonał
3 min readSep 4, 2016

--

I‘m working currently on a project which core part is Natural Language Processing. I use spaCy python library(maybe in another post I’ll write sth about this great NLP lib) and first need which appeared in my tasks is possibility to check if given grammar pattern match to a sentence and to find words being at the end of pattern. For instance I’d like to find a subject of root verb in sentence or I want to check if sentence tree match criteria tellings that there is place related to the root verb of the sentence.

For such tasks I wrote python library which offers this using sentence’s tree produced by spaCy library.

Get grammaregex library

You can find the library on https://github.com/krzysiekfonal/grammaregex. It’s free, open source and provided on MIT licence.
It’s also published on PyPi so you can easily install it by just run:
pip install grammaregex

How it works?

Idea of grammaregex is to provide operations on tree sentence syntax produced by spaCy using friendly-regex-like format similar to path patterns on OS systems.
Patterns are build in format: node/edge/node/edge/…/node…
So the analogy to path-patterns like dir/dir/…/dir/file is that we also travel by tree like directory tree in OS and you can treat node as file and edge as directory with the differences:
-Instead of directory name we have name of dependency(edge) with parent token
-Instead of file we have token(node)
-path consists of alternate in sequence node/edge/node/edge (opposite to path-pattern where we have only directories with file on the end).
Grammar pattern always starts with node because root token doesn’t have any dependency and ends on final token. Node is one of token part, it can be: pos(e.g. ADV, NOUN), tag(e.g. VBD, NNP), lemma(base of word) or entity_names type(e.g. PERSON)

So for instance to express pattern like:
Verb connected by prep(prepositional) dependency with IN(subordinating conjunction) which is connected by pobj(object of preposition) with NNP(singular noun) we will have such pattern:
VBD/prep/IN/pobj/NNP

You can use ‘*’ char to express any edge or token like:
VBD/*/IN — verb connected by any dependency with IN
or
*/prep/IN — any root node connected by prep with IN

You can also use ‘**’ chars to express any edge on any level like:
VBD/**/DT — verb connected with eny edge n-times with DT at the end

There is also possibility to use list to express “one of …” like:
VBD/prep/IN/pobj/[IN,DT]

Examples of usage

Below you will find a few examples of usage. Examples will base on such sentence:
“Mrs. Robinson graduated from the Wharton School of the University of Pennsylvania in 1980.”

To prepare such sentence in spacy you need to do:

import spacy
from grammaregex import print_tree, match_tree, find_tokens
nlp = spacy.load("en")
doc = nlp(u"Mrs. Robinson graduated from the Wharton School of the University of Pennsylvania in 1980.")
sent = next(doc.sents)

Now basing on this ‘sent’ variable we can do for instance:

>>>print_tree(sent, "tag_")
{ { { Mrs.->compound(NNP) } Robinson->nsubj(NNP) } graduated->ROOT(VBD) { from->prep(IN) { { the->det(DT) } { Wharton->compound(NNP) } School->pobj(NNP) { of->prep(IN) { { the->det(DT) } University->pobj(NNP) { of->prep(IN) { Pennsylvania->pobj(NNP) } } } } } } { in->prep(IN) { 1980->pobj(CD) } } { .->punct(.) } }
>>>match_tree(sent, "VBD/prep/IN/pobj/NNP")
True
>>>match_tree(sent, "VBD/prep/IN/pobj/VBD")
False
>>>match_tree(sent, "VBD/**/DT")
True
>>>find_tokens(sent, "VBD/prep/IN/pobj/*")
[School, 1980]
>>>find_tokens(sent, "VBD/prep/IN/*/[NNP,DT]")
[School]
>>>find_tokens(sent, "VBD/**/DT")
[the, the]

Summary

I encourage everyone who has similar needs to use it and send me any feedback, issues or proposition of extensions.
Contribution are also welcomes.
One of the extension I have on my mind(and I do it if there will be any requirement) is add more NLP libs to support and make current API indpendent of library providers.

You can reach me on github(to leave pull requests or report issues) or via mail: krzysiekfonal@gmail.com

--

--

Krzysztof Fonał

Software engineer, Data Sience and NLP passionate, problem solver, sports lover