Compromise in 2019
One of the weirdest things to happen in my life is the unexpected popularity of this small nlp-compromise library in 2016. I was floundering a bit in my life, and having such positive reinforcement, and such engagement with strangers, meant a lot to me. I’m in my mid-30s, and I know well that success in the software world is a rare thing.
Because version numbers and changelogs don’t tell a story, it’s useful to mention there has been considerable development in the identity and mandate of this tool:
In February 2017 we released v7 which was a humongous re-think. Previous versions simply tagged words, and users found themselves writing gnarly for-loops around these terms:
V7 brought in sub-classing, where results of .verbs() should have new methods like .conjugate(), and so forth. It was about 4 months of work, and it was during a winter in Canada.
I’ve considered the few years since to be incremental work. v8 through v11 involved hardening, improving speed, reducing file size, and piling-on some features. We merged 200 PRs and resolved 400 issues.
Things that are strong (imo):
Verb conjugation:
I can’t believe this works as well as it does. There are some soft parts in niche participle forms, but for the most part it is a solved problem. It’s used in thousands of products and startups. This is wild.
Number parsing:
You have to contort a sentence pretty-hard to fool our .value() parser. It’s strong as hell by now, and is a very proud part of the project.
.people()
This method is surprisingly usable, and somehow competitive with the intelligent cloud-AI services. The first & last stuff breaks-down in some cultures, and edge-case nicknames must be hard-coded, but being able to do nlp(myHugeText).people().out('array') is amazing to me.
Things that are weak (imo):
Phrase & clause interpretation:
people often want ‘the verb’ of the sentence. Detailed grammar results feel overly pedantic. We have weak support for ‘nounPhrase’ and ‘verbPhrase’, but could do way better. Countless people are repeatedly writing this layer themselves.
Normalization is opaque:
where’d the commas go? When is Titlecase effected? .normalize() and .out(‘root’) seem to be riddled with arbitrary decisions, and opinions. There’s even some gross order-of-operation issues. Writing a maze of documentation for this does not seem to be a good solution.
.out(‘format’) is awkward:
getting data out of compromise still involves this frustrating api-layer, and we should just torch it.
plugins failed:
the library itself doesn’t even use them. Few people caught the dream of having highly varied, and highly curated personalities for the library. People want configuration, but don’t need this collaborative-configuration. I’ve always felt uneasy having one-truth solutions in english, but that’s part of it, it seems. This just didn’t happen.
Whitespace and punctuation are muddled:
too many traps like commas within parentheses, and periods after quotes. If there’s punctuation between two words, I always forget if a hyphen gets scooped-up by the preceding, or following word. What if there is no following word? It sucks right now.
Worse is when punctuation creates a pos-tag, like #Comma, then we change the punctuation, and forget to change the tag. This sort of thing shouldn’t happen
so V12 is really happening:
I’m taking some time off from my day-job to do it.
Here’s my plan, so far:
.toJSON() and .fromJSON()
Producing a non-circular JSON state should be a perfect and clean process, at any point. Workers are becoming a big thing, and compromise should support a cleaner I/O model.
Also, sometimes you just want to touch the data yourself, then stick it back in. Reaching in to compromise internals right now is a fraught thing, and many have resorted to it.
JSON is beautiful, and the compromise API is frequently just a wrapper for changing it.
also, having a simple JSON state also will improve the current complexity of .clone(), which is currently an afterthought.
Move some bloat out of core
We have this extend() plugin scheme, and our state is being simplified, so now is the perfect time to move some internal methods to separate libs.
Tentatively:
- .values() to
compromise-number - .people() to
compromise-people - .dates() to
compromise-date - .sentences() to
compromise-sentence
If including these extensions is trivial, I don’t expect too much of a blowback. D3 did it! We can off-load parsing of hashtags, urls, and phone-numbers.
compromise core should still parse and tag things fully. It will still tag January as a #Month, without using compromise-date.
Our compression scheme benefits from having our data in one big bucket, so plugins should not have their own separate lexical data either.
building-up the lexicon still involves conjugation and inflection, so verbs() and nouns() will stay in core.
of course, this will get us closer to a re-usable compromise core for other-language libraries. I’m not making promises, but of course I want to do this, and have always wanted to do this, and will someday do this, and get asked to do this weekly.
Draft implementation:
World:
all configurable linguistic data.
i was going to call this ‘Personality’ but that felt too weird.
Pool:
all the given words, and all their metadata.
this is just a key-value object to dump our words.
the words in the object form a doubly linked-list, to the next and previous words.
Phrase:
a pointer to a starting word (in the pool), and a length.
Doc:
a list of phrases. A ton of methods.
the Doc object must also remember its parent Doc object. This is to allow ‘zoom-in/zoom-out’ navigation for various nested docs.
The ‘root’ doc will be a list of sentence phrases.
What’s still unclear:
#1 .cache() idea:
We’ve often considered a way to freeze, or cache a document, once all transformations have stopped, so that it can be wicked-fast. This way we could do .match() statements without a single for-loop. This would be amazing.
#2 .split() & .insert()
this has been a terror in compromise, and is the source of most its bugs.
The linked-list Pool idea is meant to improve this situation, but I still have no clear understanding of what happens when the word-ordering is messed-with sometimes. Consider this example:
let doc = nlp('i have no idea. It is weird.')
let child = doc.sentences(0).prepend('so literally')
child.parent().terms(0)
The parent document has a phrase that begins with i, but now must be updated to begin with so. In a gross way, all of these word-ordering methods must change every parent document — that means a Doc object can change, even though you’ve done nothing to it. 😬
#3 partial-text outputs
when you’re using words in the middle of a sentence, it’s never clear if you want to include the punctuation and whitespace before and after these words:
nlp('larry, curly, and moe').match('curly').text()
// 'curly, '
I don’t know what to do here.
#4 .out() query-language
It’s always been clear that there will never be enough named output formats for everyone. I’d like to have a way to specify what data you’d like it to return, something like:
doc.out({text:true, terms:{normal:true, tags:true, index:true}})
// [{text:'Larry, curly, and moe', terms:[{normal:'larry', index:0, tags:['Person']}, ... ] }]
HTML, XML, and LATEX output is a frequent frustration, and I think most people eventually bolt-on a custom method themselves.
Other things:
I’m gonna drop the slack group in favour of a google-friendly chat solution.
I’ve put thought towards it, and am not ready for es-modules. Real support in node is absent still (as of June 2019), and proactive support involves tradeoffs that I don’t wanna do. Tree-shaking is now being done on common-js, which is awesome.
Typescript types though - this is happening.
The .match() syntax has some regrettable design-decisions. I considered making breaking changes to this part of the library, but it feels too scary. Gonna keep it unchanged for v12.
People really love having the docs in ObservableHQ. I’m happy with it as well. Runkit is great for debugging issues. Unpkg and npm have been flawless. TravisCI has caught a lot of otherwise-released issues.
Having the library on my personal github has helped bring-in freelance work, which has been nice. No one seems to mind me doing that.
