Highlighting keywords in context, with dependency parsing

Using SpaCy and a tiny training set to build CoachBot’s “Smart Tips”

Sandy Rogers
5 min readMar 18, 2020


The initial challenge

Show users a few pieces of content, from a library, that are relevant to the short text they’ve just written.

In my context, this was building the “Smart Tips” feature for CoachBot. CoachBot helps managers engage and develop their teams. One headline feature is it’s support of 1:1 meetings, and Smart Tips prepares busy managers and employees to get the most from their 1:1 meeting times.

The constraint

CoachBot helps users to quickly co-create agendas for 1:1s, so we set out to provide smart tips based on those agendas and nothing more, offering guidance with no extra input needed. 🚀️

So a classifier?

The obvious approach to this would be to train a text-classifier model to decide what type of subject each piece of text (here, an agenda item for a meeting) refers to. Given a category, present library material tagged with the same category and job done. Training that was pretty simple, and worked reasonably well (as in, better than random) even with only tens of examples per category.

An agenda item of “I would like your thoughs on my widget-production” is classified as “Feedback” by an algorithm

But this got difficult for three reasons.

  • Real-world agenda items sometimes touch on multiple topics. “Work-life balance since I took over the AT-AT responsibilities”: a manager and employee preparing to discuss that might want tips both on Wellbeing issues as well as Role Development.
  • People use jargon, especially in teams! To a non-team-member, “is AT-AT ok?” might be asking for feedback on a project, a personal issue, a conflict issue, an update on goals or something else entirely…
  • Bad (and opaque) categorisations are confusing. No prediction model will be perfect, especially given those two challenges. But failure is worst when it is confusing. Without any indication as to why the classifier picked the category it did, mistakes appear not just silly but also annoying.

Smarter keywords

This is where a keyword-based approach comes in. Instead of classifying the entire text, just look for words or phrases we definitely want to categorise.

By inspecting a small sample of data (~1000 agenda items) and using some domain knowledge, we manually created several topics that we wanted to identify, and curated a small list of seed phrases that we knew people might use for each. These are like keywords for a topic:

{"wellbeing": ["stress", "engagement", "burnout", ...], ...}

Of course exact matching will usually fail given free-text input, where people write naturally.

So, our predictor instead looks for any words or phrases that are semantically similar to the seed phrases. In other words, in other words 😉️.

Embedding vectors for 70000 English words (projected into 3 dimensions, there are actually 200 dimensions in the model). The 10 words most similar in meaning to “goal” are highlighted. Using http://projector.tensorflow.org

This works using Word Embedding vectors. The model is trained on a huge corpus of text from the web, so it understands how words relate to each other. Words with similar meaning (say, “goal” and “objective”) are close to each other in the model’s vector space, so at some level it understands the meaning of what people write.

In practice, we use a couple of methods to find keywords. The two main ones, one simple and one more complex:

Lemmatized POS-based matching

This fast approach uses SpaCy’s token-based matching to look for phrases consisting of an optional pronoun and a word which shares the same lemma as a seed phrase (so “your burnouts” would match “burnout” from the “wellbeing” category).

Semantically-similar noun chunks

This finds looser matches, using SpaCy’s noun chunks to isolate parts of the sentence and then similarity matching to compare them to seed phrases (so “I think the team’s goals are going to be hit” would have a noun chunk of “the team’s goals” which matches a seed word of “goal” with fairly high vector-similarity.

The context: dependency-parsing around the matches

Our approach here is to find the part of the sentence that relates to the matching category. In our experience, the most natural-feeling collection of words to mark as categorised is the subtree surrounding the match. (What if the subtrees of different categories overlap? Anecdotally, whichever category turned in the longest subtree usually “felt” right.)

For example, take the sentence: “Can I have some feedback on my presentation yesterday?”.

We’ve recognised that “some feedback” is a noun chunk relevant to the Feedback topic, but since the whole sentence has been parsed by the language model, we know that “some feedback” lives in the subtree “some feedback on my presentation”. In other words we know the object of the sentence that the category refers to.

This is useful because, to a user, they’re more interested in their specific subject than on feedback as an abstract thing.

This understanding helps give the tips context. And thanks to SpaCy’s contextual dependency parsing, this works even when jargon is used. “Can I have some feedback on AT-AT proj?” would be understood just fine, even though “AT-AT proj” has no meaning to outsiders, and is out-of-vocabulary to the language model.

In principle, knowing the objects that certain categories refer to might also be useful. For example, if we wanted to report to business leaders what the most common topics of Feedback-related agenda items were. Perhaps somebody in charge would want to know that everybody’s concerned about that AT-AT proj



Sandy Rogers

Reformed astrophysics researcher, recovering marathon runner, and recalcitrant data wrangler @SaberrUK.

Recommended from Medium


See more recommendations