Cutting ourselves some Slack

Published in

Tooso

9 min readJan 20, 2017

Playing around with Slack, threads and search engines

Intro

At Tooso, we love Slack so much we rarely exchange e-mails these days. Having seen different type of companies, the huge benefits of Slack have been clear since the beginning: it allows constant information sharing within the team, simplifies planning, enables quick decisions and, last but not least, it’s the perfect tool to foster our company culture and make everybody feel part of the bigger whole to which we all contribute — which, being a small, but distributed startup, is both indispensable and challenging.

There’s something that always felt a bit lacking, though, and it’s Slack search functionality: with so much back-and-forth and sharing, it’s sometimes very important to retrieve information from the search bar. Unfortunately, the results are often not quite Google-like — so, when we got the chance to play around with a Slack dataset, we used some NLP tricks to build a POC for a Slack search engine.

In particular, our idea was to exploit the conversational nature of the medium — i.e. the division of long discussions into threads — to improve the quality of Slack’s search bar.

While we had to do our project with some manual effort (see below), it’s this week’s news that threads are finally a Slack official feature, making our preliminary results even more relevant (and easy to generalize). If you would like to know more about our own NLP experiments in “the wild” of a real Slack dataset, read on (if you don’t have time or patience or romantic appreciation for a good story, ping us for the more scholarly version or check out our slides for AI With the Best 2016).

Disclaimer

This research work was originally completed at the end of 2016 with Tooso co-founder Ciro Greco and Katherine Yoshida (Foursquare), while I was still Lead Data Scientist at Axon Vibe (a super cool company you should definitely check out). We greatly thank the company for the support provided. Finally, Ciro and Kathy share all the merits and none of the errors on this page.

Information Retrieval 101

“In theory there is no difference between theory and practice. In practice there is.”, Yogi Berra

Before digging into our little experiment, we provide a bit of a background for readers not familiar with information retrieval (IR) in general: there are very few, very simple formulas and we strive for simplicity, so that all readers should be able to get the main points of the section. The NLP-savvy readers can safely skip the next two sections and just know this (spoiler alert): we’ll be using threads (i.e. cluster of logically related messages) to improve smoothing in a language modeling based IR system. And it works.

Search in (vector) space

We all know what IR is: you likely used an IR system, Google, to reach this page. But of course, Google is just one case among countless others: where there is a collection of documents (lyrics, web pages, chats, emails), there is room for IR. Let’s start with some basic definitions:

D is a collection of documents — lyrics, web pages, chats, emails, whatever. We will be referring to any document in the collection with d. Any document d is made of words, which will be designated with w1, w2,… wn.
q is any query we make to the system to retrieve information: q is made of one or more terms, which will be designated with t1, t2,… tn

So if D is the collection of Shakespeare’s works, each d will be a play. When we search for famous phrases, we’ll have that (for example):

q = “to be or not to be”

t1= “to”, t2= “be”, t3= “or”, t4= “not”, t5= “to”, t6= “be”

Ideally, any IR system that deserves the name should retrieve Hamlet first. How would we build a system capable of such a thing? A quick way to frame the problem (in fact, the classic way to), is to think of IR as a problem of similarity: given all the documents in D and a query q, we will rank them according to how similar q is to each of them. Since documents and queries have natural representations as vectors in a words space, we could just measure the distance between the d-vectors and the q-vector! There are several little tricks to make this really work (TF-IDF transformations, exploring new metrics if you’re sick of hearing ‘cosine similarity’, etc.) but in the end it’s all pretty straightforward: if you’re interested in learning some of these tricks, gensim tutorial on text similarity exploits exactly this intuition.

Interlude: in case you wonder if this simple idea is actually worth your time, consider that any website using the open-source engine Lucene in any of its enterprise incarnations (like Solr, or Elasticsearch) is still based on the vector space model. Yeah, that may explain why nothing ever works outside of Google (and why we founded Tooso, but that’s another story).

IR: a new hope

In more recent years, different approaches to IR have been put forward to overcome the limitations of the vector space model. In the language modeling approach we see the relations between d in D and q as a probability. Before the boring details, let’s start with an analogy. Suppose our D is made up of lyrics by famous artists: when we receive a query q, we could rank our lyrics by how likely we think the artist of those lyrics is to write q. So, for example, “you do the kind of stuff / that only Prince would sing about” is much more likely to come from the Bloodhound Gang than Justin Bieber, “My heart is drenched in wine / But you’ll be on my mind / Forever” is classic Norah Jones, not Eminem — and so forth. Now back to the general case: we think of each document as being generated by a different probabilistic model; when we receive the user query q we ask ourselves: “which is the document that is most likely to have generated q?”. Given a document d, its model will be Md, and the probability of q given Md is the following:

Naive language model (without smoothing)

In other words, the probability of q is just the product of all the individual probabilities for the terms that make up q. In turn, the probability of terms in d are just frequencies: how many times t occurs in d divided by how many words are in d. All done? Not quite, as the product above has a big problem: whenever a term in the query is not in the document, the final probability becomes zero. In our music example, if you misheard Jimi Hendrix singing

“Excuse me while I kiss this guy”

and search for that, nothing will come up, as no document contains all those words. In particular, Purple Haze does not have “guy”, so it will receive a total probability of zero according to the above formula. So how do we fix this? We need a plan B for our calculations when there is no available frequency: this is called “smoothing” and looks like the second term in the formula below:

Language model with corpus-based smoothing

What the formula is now saying is that probability that d generates our query is the sum of two distinct probabilities (weighted by the lambda parameter): the probability of the doc and the general probability for the entire collection C; in this way, even if “guy” is not in Purple Haze, some other document will contain it, thus provide a non-zero probability for it.

Congratulations, you’ve just survived 30 years of information retrieval!

Interlude: if you want to know more about IR, this is an excellent book. On NLP in general, this classic is a bit outdated on some topics, but we’re just too sentimental to ignore it.

Language modeling with document expansion

A different way to improve smoothing is document expansion: intuitively, the information about the “true” language model of d is exploited by using information from a cluster of documents that are, in some sense, similar/relevant to d. We manually annotated a subset of our Slack data dump and clustered the uninterrupted stream of messages into threads based on their content and the flow of conversation; the result is a dataset of messages with a corresponding thread id tag.

So, now that we have threads, we can plug them in our language model formula!

In particular, messages in the same thread will likely share a lot of semantic similarity and relevant context and so they can provide effective smoothing:

Language model with thread-based smoothing

What this final formula is saying is that probability that d generates our query is the sum of two distinct probabilities (weighted by lambda, as usual): the probability of the doc and the general probability for all the messages in the thread T to which d belongs.

Improving Slack search with threads

“98% of statistics are made up.”, Anonymous Data Scientist

Our tagged dataset contains a total of 6236 tagged messages (almost 5 times bigger than the industry standard), 45 users and 654 threads: threads have a median length (# messages) of 5 (min: 1, max: 129, mean: 9.54). To test our intuition, we implemented three different search algorithms:

a baseline frequency-based search model
a language model search with the entire corpus as background model
a language model search with thread-based document.

Our tech stack is MongoDb, Python, NLTK (for quick implementations for NLP functions), Flask and Redis. Since there is no search history API available in Slack (yet), we filter n-grams in the tagged dataset and manually select reasonable queries (N=30) from those with frequencies above a preset threshold, based on domain specific knowledge. We choose “Precision at K”, which ranges from 0.0 to 1.0, as our metric: how many of the top K items in the result set are relevant given the query?

We report our results in the table for K= 3, 5, 10:

Notably, the thread model always performs best and the baseline model always worst. So, problem solved?

Not really. While we think these are very encouraging numbers, warranting some deeper analysis, our experiment is far from being conclusive.

Our test sample is very small (N=30), but this is partly due to the time consuming nature of evaluation (which is worsened by the confidential nature of the system, so only certain people are both allowed and qualified to evaluate search results). Luckily for us, thanks to the new Slack thread feature, everybody will soon be able to replicate our experiments with some human threads, without any manual tagging (or fancy clustering effort).

Search like no man searched before

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”, Alan Turing (not necessarily about this work)

As for us, NLP nerds, what are the next steps? Surely, we would like to run some more comparisons: the same intuition could be used to improve other IR paradigms, for example the vector space model we explained above. Furthermore, to the curious eyes of language lovers, Slack contains lots of interesting “linguistic-related” data: for example, Slack allows emoji reactions to messages. What can be learned from how people use emoji? Finally, Slack naturally embeds a social network through its chat-like nature: by combining NLP and network analysis, we could study emerging behavioral patterns to shed light on personal dynamics within an organization (Update: we kinda started doing that here).

Our own experience — as well as abundant, if anecdotal, Internet evidence — suggests that Slack is playing a crucial role in the life of the companies adopting it. If we add to this Slack’s grow rate, it is clear that any NLP advance in this (or related) setting could result in a massive real-life impact.

See you, space cowboys

For comments and feedbacks, please feel free to reach out directly at jacopo.tagliabue@tooso.ai.

For updates on Tooso, follow us on Twitter.