AcceleratedText: A short guide
In this article I want to introduce, that good ol’ templates are not the only option. Product itself is under intense development, many things can change along the way, but the core principle will remain the same — ability to visually define structure of text, and have machine build fluent text for you.
Intro
AcceleratedText — https://github.com/tokenmill/accelerated-text is an OpenSource tool which allows to generate text.
We define how text is to be generated in what we call a Document Plan:
DocumentPlan creates a structure and custom data is usually filled from CSV (while there’s an option for direct API call with raw data)
What are benefits versus templates?
Templates have static structure, the only modifications can be done by changes in parameters or synonyms. If you build multiple of texts (eg. product descriptions for shop), they will end up looking identical.
AcceleratedText uses AMR (Abstract Meaning Representation) blocks. Depending on the context and AMR itself, it can have multiple representations. Having multiple AMRs in single description, creates even more variation.
In general, it is quite easy to get around 20 of unique sentences for single description. On top of that, end result is modified by our Machine Learning model, introducing even more cases.
But the most vital difference — all of this can be done in multiple of languages
Abstract Meaning Representation
Since it is the driving force behind all AcceleratedText, lets dive a bit deeper into it.
You can read more about AMR in this specification: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
Our implementation
We took following key points as our base:
- AMR captures “who is doing what to whom” in a sentence
- AMR is abstract, it may represent any number of natural language sentences
In essence, it says what to write, not how to write.
In reality this is easier said than done, a common approach is just to have several variations representing same meaning, but having different grammatical structure.
More detailed overview and key differences in our approach can be found in Project Wiki: https://github.com/tokenmill/accelerated-text/wiki/Abstract-Meaning-Representations
Example — restaurant description
For this example, I will be using restaurants data set, which is a part of BLEU (Bilingual Evaluation Understudy) score.
We want to create sentences like these:
- There is a place in the city centre, Alimentum, that is not family-friendly.
- In the city centre there is a family-friendly place called Alimentum.
- Located in riverside area, Alimentum restaurant is a place to bring the whole family.
- If you want to take the children for a meal then try Alimentum they provide a child friendly service in their riverside setting .
- Near the riverside is a kids friendly eatery called Alimentum. There is a Burger King close to it as well.
Only several variables are given: restaurant name, family friendly (True or False) and location. Output needs to be expressed like sentences above.
Main components
Full solution consists of three components:
- DGL— Domain Grammar Library. A very low level grammatical structure, which is compiled into concrete grammar. These are best to be left out for developers.
- AMR — Abstract Meaning Representation blocks. These are built using logic and grammatical patterns block
- Document Plan — a high level document structure, which is build using logic and AMR blocks
In the most cases, user cares only about Document Plan, most of AMR and DGL blocks are already premade. For this example, we will create custom AMR blocks and build a Document Plan, will ignore DGL blocks for now.
To make it a bit clearer, we will do this example in two iterations.
- In first iteration using nothing but preexisting AMRs
- In second — design those AMRs ourselves
1st Iteration
Lets start with simple “Alimentum is(not) family-friendly”
First, we need to select a proper AMR:
This set of AMRs is called ConceptNet (refer to https://github.com/tokenmill/accelerated-text/wiki/ConceptNet for more information). In short, these are very low-level AMRs which (at least in theory) can cover most of possible meanings. They are not very pleasant to work with, usually it is assumed that custom AMRs will be created for specific domain (eg. Restaurants), but should get job done.
We’ve chosen “is-a” AMR. Green blocks are data (comes from CSV, or some other external sources), gray block is called “quote”, simply put — a hardcoded string. This is certainly not ideal, preferred way is dictionary item, however it will do it’s job at the moment.
Additional thing to note is “(polarity)”. Arguments between braces are optional. Polarity gives either positive (by default), or negative meaning. In this case “familyFriendly” is a boolean (true, false, 0, 1, T, F values are accepted as booleans) and can switch meaning.
Voilà — we have our first result! Currently it is only in English and only single variantion.
Now we can talk about location. The most basic way would look like this:
And we get the result:
We got two short sentences. Now we can add some variation. In this case, there’s no difference when to talk about family friendlyness, and when about location, because these sentences are independent. This can expressed by adding list element “in random order”
And now we generate two variants:
Going through data rows, we see that sometimes location needs to be specified as in relation with some other location
Some times we have value inside “near”, usually we don’t. Simple way is to branch these cases and use different AMRs
Note: if data field is not boolean, it can still be used as such. Empty — meaning false, having any kind of value makes it true.
We made the condition, if data field “near” has data — use “located-near” AMR, if it doesn’t — stay with “at-location”
Results with a row having “near”:
At this point we’re generating text, but it‘s not ideal. First of all, it should be “near the Burger King”, and second, full sentence should look like “It is located in the city center near the Burger King”.
2nd Iteration: crafting AMRs
Again, we will start with “Alimentum is a family-friendly”, but with a little twist. There’s no use of having new AMR which does exactly what old ConceptNet was capable of.
Looking further at data set, we can see examples like these:
- Alimentum is not a family-friendly place
- Alimentum is not a family-friendly arena
- this is not a family-friendly venue
Digging deeper, we see, that sometimes we have item “eatType” filled which specifies type of restaurant (eg. Coffee Shop, Restaurant etc) and it should be used instead of those generic “venue”, “place”.
We can implement all of this inside AMR itself, hiding complexity from other users.
Lets start by looking at existing AMR, will use “is-a” as an example:
First of all, “N -> a N” — this is DGL, it operates on low level grammatical structure. This one says, that it takes Noun and adds another Noun.
Second, note those red blocks. These are variables (hence “variable” at the end of block). What’s important, that if they are not implemented internally, they will be rendered as sockets in Document Plan. That’s how we get those “Subject” and “Attribute”.
Ok, now we’re going back to our original intention. We start with defining those generic values for describing place:
Blue blocks means dictionary item (hence Dict.: in front of), “_1_N” part is a bit more complicated. Number specifies sense, in short — same word can have different meanings depending on context, different number means different meaning. And the “N” at the end specifies part of speech.
Usually we will have:
- N: noun
- V: verb
- A: adverb
Going further, we basically copy-paste “is-a” AMR, but add some bells and whistles. It can be a bit overwhelming, but don’t be afraid
First part looks identical, in second part we use “mkN” with two positions instead of single. First position is used to give plain string “family friendly”, and second position is either:
- “eatType” value from data if it is set
- random item from “places” variable
Our document plan now looks like this:
And we get far more results:
Another part is Location AMR
Essentially, this is a combination between “at-location” and “located-near”. It always do “at-location” part, and if “(near)” is given, it does “located-near” part.
Final Document Plan is a lot cleaner and nicer to look at
Our results, while still not perfect, looks like this:
Conclusion
Using same techniques, more variations of text can be created, which is always good.
The big topic I haven’t touched is translations. For this we do need to create proper Dictionary Items instead of quotes (those gray blocks). At the moment this step is not possible via User Interface, manual edition of config file is required.
Don’t want to bother you with these details, though when these kind of problems are fixed, output looks like this:
(Not quite sure if this is correct translation, but at least GoogleTranslate approves it 🙂)
Building initial Document Plan is more complicated that doing a simple template, but efforts starts bearing fruits when you want have more text variations and even different languages.
At the moment, these AMRs are still short sentences. With next release, they will be able to join into single fluent sentence automatically. That’s the good part about this approach — user cares only about telling what to say using blocks, and under the hood, backend can be improved to figure out better ways to make sentences.