Simple datetime disambiguation

How we quickly spin up new machine learning models at Clara

Jason Laska
9 min readAug 10, 2016

One benefit of Clara’s human-in-the-loop platform is that we can ship machine learning results that are too nascent to drive the customer experience directly. Surfacing “good enough” predictions leads to simpler (or implicit) annotation schemes. In this post, we explore one such application.

Key to scheduling is understanding datetimes (dates, times, or a combination) and whether setting up a meeting at a given datetime is admissible. In messages sent to us, datetimes occur most often in the context of confirming a time, suggesting a time, or various forms of unavailability (travel, another meeting, etc.). There are also often spurious uses of dates such as “See you tomorrow!” that are usually unimportant for scheduling.

Very broadly, we could bucket the interesting datetimes into two opposite meanings: admissible (positive) or inadmissible (negative). The following image depicts negative-sense datetimes in red and positive-sense datetimes in green.

Given a large corpus of data with these binary labels already annotated, there are several good supervised techniques for sequences, such as conditional random fields or recurrent neural nets, that would be effective at predicting these labels.

Our main objective in this exploration is to quickly bootstrap an unsupervised classifier that can be used to help our Clara Remote Assistants more easily annotate whether datetimes are admissible.

Learning datetime senses

Our problem is similar to word-sense disambiguation where we want to identify different meanings of words in context. Certain neighboring words such as “traveling” or “doesn’t” often provide a good proxy to a datetime’s admissibility, so it seems reasonable to apply tools that take advantage of the distributional hypothesis: words in similar context likely have similar meaning. The setup that follows is inspired by (Schütze, 1997).

Training data

Forming (word, context) pairs around “later this week” and “week of the 9th” datetimes.

Message text is first passed through our base NLP pipeline which among other things breaks down messages into logical sections, tokenizes the text, separates out special tokens (such as datetimes), and classifies intents of phrases and messages. We next compile a corpus of all “interesting-for-scheduling” phrases from the text via our intent classifiers. We further break down each of these phrases into a window of up-to-5 context words on either side of each word w in each phrase, giving us the (word, context) pair (w, q) where q is a list of context words. All datetimes are treated as the single word DATETIME.

Word and context vectors

Steps for mapping a set of context words to a context vector.

We learn a vector representation for each word by forming a shifted positive pointwise mutual information (SPPMI) matrix over the context windows (Levy and Goldberg, 2014). An element in this matrix is defined as

(sorry about the large size of math images)

where q_n is a context word in q and the PMI between two words can be estimated as

with #() denoting number of (co-)occurrences in the training set. When k = 1 the matrix remains positive but unshifted, while larger k penalizes infrequent word-context co-occurrences. We use k = 1.5 in the experiments below (we found larger k to be more important as less common words are introduced into the vocabulary).

Each word vector here roughly describes how likely we expect word pairs to co-occur in the same neighborhood. While this representation is very intuitive, one drawback of the PMI approach is that each dimension of the resulting square matrix is the size of the vocabulary and can be quite large. Given the limited diversity of the data in this exploration, we cap the vocabulary to the 8000 most frequently occurring words over the phrases.

The word vectors in a context window are aggregated by averaging them, weighting each word vector by its inverse document frequency (idf). The idf weights encode importance of the word over the observed documents. Words that occur very frequently (e.g., in every document) should have reduced influence over our context representation. We found computing idf weights over the original phrases (not context windows) performed the best.

We measure the distance between context vectors using the cosine similarity which measures the angle between the vectors and ignores magnitude.

The process discussed above is just one of several ways to learn vector representations for words and their contexts:

  • We could choose a one-hot encoding for the word vectors where each word is represented by a 1-sparse vector with its non-zero element at a unique word index. Aggregating these with idf weights as above would be equivalent to forming tf-idf vectors of unigrams for each context window. In this case, it might be wise learn and apply a latent semantic analysis (LSA) mapping so that correlations between like terms are represented.
  • Another popular choice would be Word2Vec vectors (Mikolov et al. 2013). Techniques for learning these vectors have been shown to be related to the factorization of an SPPMI matrix into word and context matrices (Levy and Goldberg, 2014). Jointly learning word and aggregated context vectors has also been explored in (Le and Mikolov, 2014, Gensim’s Doc2Vec).

Sanity checks

Given how simple it is to learn the context vectors above, let’s check that the information we want to encode is somehow embedded there.

Sanity Check 1: Datetimes similar to “unavailable”

To test out our context representation, we list the datetimes with context closest to the word vector “unavailable.” The results of the top matches (on a random subset of documents) are shown in order starting from the closest. The blue datetime indicates which context is being matched in the ordering.

Datetimes in blue have context vectors close to “unavailable.”

As you can see, the closest matches include the word “unavailable” but other phrases use novel words to describe unavailability such as “booked,” “tied up,” “OOO,” “starting on,” and “traveling” among others.

This is quite remarkable as the word “unavailable” only occurs in 0.1% of the phrases used (for reference, datetimes occur in 63% of them). This implies that co-occurrence has a strong correlation to meaning in our task, even on small context windows within the same phrases.

Sanity Check 2: Datetimes with different contexts

We also collect phrases that contain two datetimes. We then choose the phrases that have at least one datetime with a close context to “unavailable.” Of these phrases, we sort by similarity between datetimes within the phrase, showing dissimilar datetime pairs first. In the results shown below, the red datetimes denote the token that was closest to “unavailable” and the green datetimes are then automatically inferred.

Datetimes in red have context vectors close to “unavailable”; datetimes in green are dissimilar from the unavailable red datetime of the same phrase.

This produces some nice examples of the kind of datetime meanings we discussed at the outset. In this dataset in general, phrases with two or more datetimes do not have opposite meaning, but this technique surfaced typical cases that do.

The examples in the image are not the top scoring in the whole dataset, rather a subset of results that highlight diverse and interesting examples. Some of the examples presented above are relatively low scoring, leading to incorrect color assignments.

Sense vectors

Clustering on the unit sphere takes a large set of normalized context vectors and learns a smaller set of sense vectors.

Finally, we cluster the context vectors to learn sense vectors for datetimes. The sense vectors are the cluster centers and represent the average contexts in which datetimes are found. This can also be thought of as a simple parametric topic modeling of token contexts.

Since we’re comparing vectors using cosine similarity, it makes sense to use this distance in the clustering algorithm. The cosine similarity ignores vector magnitudes; by normalizing the context vectors, we’re dealing with points on the surface of the unit sphere. Taking a page from (Reisinger and Moony, 2010, Batmanghelich et al., 2016) we cluster the contexts using a mixture of von Mises Fisher distributions and spherical k-means. While both algorithms are tailored to data on the sphere, we found the former performed better since it takes advantage of the different cluster concentrations. We’re making these clustering routines available here, see the last section for more details.

Results of the simple unsupervised classifier. Datetimes are labelled red if they are closest to one of the 15 “negative” clusters and green otherwise.

The results shown here were obtained with 200 clusters. We eye-balled the clusters corresponding to negative senses; there were roughly 15 of them. The datetimes were automatically labelled by assigning the color red if a context is closest to a negative cluster (sense) and green otherwise.

Discussion

The technique above enabled us to very quickly get a simple classifier up and running (with a relatively small dataset).

There are a few drawbacks to this approach. First, while the SPPMI matrix is very fast to train, it can be quite memory intensive (mentioned above). Second, because we don’t consider the sequential structure of the text, the context representation performs poorly on conditional expressions like “If this week doesn’t work.” Third, the clustering approach is parametric in that the number of senses must be chosen in advance. We point the interested reader to other work in (Moody, 2016, Batmanghelich et al., 2016) that jointly estimates word vectors and topic models as well as (Arora et al., 2016) that learns “atoms of discourse” to expose word vector senses.

Digging a little deeper, it’s clear that certain clauses require outside information to be interpreted correctly. For example, the admissibility of “I will be in LA tomorrow” cannot be resolved without knowing where the meeting is meant to take place or how (is it a call?). Carefully tracking and building logic around these kinds of meeting parameters and integrating them with customer preferences make up the bulk of Clara’s perceived “intelligence.”

Clara is growing

We’re excited to grow our platform and team. Find out more about open roles here: https://jobs.lever.co/claralabs

Learn more about our CEO Maran Nelson here: https://www.forbes.com/sites/clareoconnor/2017/04/18/clara-labs-wants-to-save-your-from-your-inbox-with-cyborg-assistants/

and read about our recent $7M series A announcement here: https://techcrunch.com/2017/07/19/claraseriesa/

A: Clustering on the unit hypersphere

It is quite common to work with normalized vectors in machine learning tasks. I was surprised to learn that there were not too many out-of-the-box clustering packages available in scikit-learn that take this constraint into consideration so we’re open sourcing a set of tools to do this here: https://github.com/clara-labs/spherecluster.

The package provides the algorithms found in (Banerjee et al., 2005):

Spherical K-means

Spherical K-means differs from conventional K-means in that it projects the estimated cluster centroids onto the the unit sphere at the end of each maximization step (i.e., normalizes the centroids).

This trivial modification to K-means can sometimes lead to significant performance gains. We recommend trying this along with your standard clustering tools.

Mixture of von Mises Fisher distributions (‘soft’ and ‘hard’ variations)

Much like the Gaussian distribution is parameterized by mean and variance, the von Mises Fisher distribution has a mean direction mu and a concentration parameter kappa. Each point x_i drawn from the vMF distribution lives on the surface of the unit hypersphere S^{N-1} (i.e., ||x_i||_2 = 1) as does the mean direction ||mu||_2 = 1. Larger kappa leads to a more concentrated cluster of points.

If we model our data as a mixture of von Mises Fisher distributions, we have an additional weight parameter alpha for each distribution in the mixture. The movMF algorithms estimate the mixture parameters via expectation-maximization (EM) enabling us to cluster data accordingly.

We’d love to know about how you’re using this in your applications, and feel free to contribute to the project!

--

--

Jason Laska

Machine learning at @ClaraLabs. Editor at Rejecta Mathematica.