The Suggested Reply Feature

Tyler Lanigan
Hootsuite Engineering
9 min readApr 27, 2020

The Machine Learning team at Hootsuite released the Suggested Reply to a small number of our customers (~200) on April 8th, 2020. The following details our design process for the system, as well as the final system architecture and preliminary usage metrics.

The Suggested Reply feature is used in Inbox in Hootsuite. When our customers receive messages from a social network they can respond to them in Inbox. Often, they use the same responses for many messages, and a major pain point that they had was storing and accessing these responses. Customers would use an external tool like Google Docs to accomplish these tasks. This workflow is very awkward so our Engage team build a feature that allows them to save and reuse those messages. But what happens if an organization has 100s of these saved messages? It is really inconvenient to have to scroll down and find the messages that you are looking for each time. Our team decided to augment the Saved Reply feature by using machine learning to sort the list so that the most relevant messages are shown first. Have a look at what this looks like below in Figure 1:

Figure 1: Suggested Reply UI Design

Final System Architecture

The following diagram in Figure 2 shows the flow from start to finish for our Suggested Reply system:

Figure 2: The Suggested Reply System Architecture

Let’s walk through what is happening:

  • Users select a saved reply on the front end. This causes a CommonResponseSelected event to be emitted on to our local Kakfa based event bus.
  • An ingestion service picks this event up, and dumps it into an S3 bucket. This allows us to persist this data and use it for multiple projects. It also allows us to combine it with other data that is present in the data lake.
  • A processing service built using Spark’s Structured Streaming Engine detects that new data has been added to the data lake. It then processes that data, filtering it and turning it into a form that is more suitable for ingestion into a machine learning algorithm. It dumps that data into a processed bucket. Having a streaming service means that we can retrain the model at any time, as new data is constantly being added to the production folder. It also means that we save a lot of compute time by not reprocessing large batches, essentially duplicating reprocessing work. Using spark allows this processing operation to scale. As this feature is available to all of our customers using Inbox, we expect traffic to increase significantly when Saved Replies is released more broadly.
  • The processed folder is then ingested into Sagemaker to train the machine learning algorithm.
  • The trained models get saved to S3, where it is easy to revert back to them if something goes wrong in production.
  • We load the saved models into an endpoint hosted on Sagemaker infrastructure. This endpoint is exposed to the customer through our GraphQL layer. The GraphQL layer also calls a few other services to fill in the data requirements for our client on the front end.

Creating a Labeled Dataset out of Thin Air

When we first started this project, we didn’t have any training data coming in because the system had not yet been built. Having an initial set of data is important for experimentation so we can determine if it is possible for the machine learning algorithms to make the predictions that we are asking them to do. In addition it helps with the design of the system, for example do we want to have a two-model system, such as Googles’ Smart Reply, or would one model be sufficient? It also helps us answer important experimental questions like how we want to threshold the model, or how we are going to handle the imbalanced classes.

Our theory was that we could create our own experimental dataset from our existing data. At this point, we had stored the incoming and outgoing messages for a given organization in our data lake. We just needed a way to figure out which ones were common responses that they used all the time, and which ones were just infrequent one-off answers. We theorized that we could find this using clustering. If we ran a clustering algorithm on the data (e.g DBSCAN) answers that are really similar would be clustered in larger groups. We could then look at these groups and then make a judgement if they were template answers. If they were we could assign them a label if they were a certain distance from the centroid of that cluster. The reason that we took this approach was that we found that sometimes answers were templates, but they had small details changed in their final version. E.g the two responses:

Thanks Tyler, I hope my service today was helpful and that all your needs were met.

is the same response as:

Thanks Sam, I hope my service today was helpful and that all your needs were met.

With just a name changed.

So the whole process we used to create our labeled dataset was as follows:

  1. After performing some initial cleaning steps, we run the data through a clustering algorithm. We used DBSCAN, but there are others, such as KMeans. We used the Levenshtein Distance to calculate the distance between points when running DBSCAN.
  2. Evaluate each cluster using an average Fuzz Ratio for distance to the centroid. Clusters that had above a certain score were deemed acceptable for our evaluation.
  3. Look at each passing cluster and assign it a template. If a cluster had names, tokenize them, for e.g <name>.
  4. Run through all the data points in a labelled cluster, if they had a certain fuzz score (above 90), automatically assign them the label. If there were below a certain threshold (e.g 70), don’t assign them a label.
  5. Data points that fell in between these ranges, we hand labelled using AWS Ground Truth.

In this manner we only needed to hand label ~5000 data points and ended up with a dataset of over 100,000!

BERT vs TFIDF Embeddings

With our experimental dataset in hand, we set out to figure out which machine learning techniques would produce the highest accuracy predictions.

We knew that we wanted our system to have a single model trained for each organization. It wouldn’t be desirable to have one organizations saved replies be recommended for a different organization. With that in mind we knew that our resulting models needed to be small, and thus using a large network for each organization would not have been feasible. Instead, we could take advantage of a large pre-trained model by using it as a general embedding model for everything, then training smaller classification layers for each organization.

We started by making a baseline model using a simple algorithm. TFIDF features and a logistic regression learning algorithm. We then set out to try and beat it using a more state of the art approach. We chose BERT, because it was getting some amazing results and there was already wide support from the machine learning community which meant tooling was quickly becoming available.

The result surprised us. The following in Table 1 details an experiment where we measured the accuracy of each model. A score of 0 represents the model getting none of the answers correct on a held out test set, and a score of 1 means that it correctly labeled all the data points. The models compared are:

  • our baseline model — ta_baseline_model
  • BERT embeddings with a logistic regression classifier — ta_bert_model
  • BERT embeddings with a multi-layer perceptron (neural net) classifier — ta_bert_mlp
Table 1: Experimental Results for Baseline Model and BERT

For our dataset, we did not see a huge improvement using BERT. In fact, in some cases we saw a lower accuracy. This does not mean that with real production data we would not see BERT getting better results, however; with this data-backed evidence and the fact that implementing BERT in production would add another layer of complexity we decided to go with the simpler model for our first version.

Class Imbalance

Another issue that we ran into with our first experiments was a severe class imbalance. Most NLP problems suffer from this affliction, where a single class is represented far most than other classes in the data set. I.e A template response is used way more than other template responses. The reason that this is undesirable is, let’s say you have a dataset that has 90% of all the labels as one class, 7% as another, and finally 3% as the last class. A classifier could learn to only predict that over-represented class and get a fairly high accuracy score on a held out test set. In reality, the machine learning algorithm isn’t really doing any predictions based on its input. It’s only ever predicting one class!

There are many ways to get around this, see this excellent blog post for more details. Our team ended up using down sampling of the majority class, and evaluation using an F1 metric to overcome this problem.

We also looking into using the SMOTE algorithm, however the performance costs of this wouldn’t have worked in production.

Thresholding

When we deploy this system, we want it to automatically determine if the models should be active for a given organization. My colleague wrote a more detailed blog post on what we ended up implementing. If you are curious, check it out here.

Metrics

Metrics for our service are displayed in Interana. We keep track of the acceptance ratio (how many times our suggestion is accepted divided by how many times it is shown) of our service daily, and get alerted if it ever falls below a certain value. As of now the service is hovering from 50–70%, a number we hope to improve in the future. Figure 3 shows how this metric is tracking daily.

Figure 3: Acceptance Likelihood for Suggested Reply Predictions

Retraining

As we have a streaming service constantly streaming new training data into our processed S3 bucket, we can set this training operation to be as often as necessary for our customers. We felt that it was important for the models to be able to predict new saved responses at least the day after they are made, so we set the training cadence to once per day. However, due to using a streaming architecture for the input data, we could theoretically have this operation happen as frequently as the training job can run (currently ~10 minutes). This is a cost-benefit analysis that our team will perform as the feature rolls out to a larger audience.

Future Improvements

Even though our initial experiments were not promising for BERT, we have not given up on it. Theoretically it should work well, so we would like to perform some experiments using the production data that we have accumulated instead of the labelled dataset that we created. There are some great open source projects out there for using BERT in production such as BERT as a service that would make productionizing BERT easy. We envision adding a BERT embedding server (or whatever the current state of art is) into our architecture as follows in Figure 4:

Figure 2: The Suggested Reply System Architecture

The BERT server would embed all the data being streamed into the production bucket as well as new data coming in from the front end.

We would want to test these model differences using Sagemaker’s A/B testing capabilities in production.

Conclusion

I hope that this blog post helps people in their design process for making their own machine learning based systems. The process of building predictive features is far from being standardized and I know that our team has benefited largely from the community putting out details of what they have implemented.

With that being said I have a list of people I would like to thank my teammates on the Machine Learning Team for working hard to deliver the Suggested Reply project (Honto Ming, Sam Reh, Andrew Gormley, MoHan Zhang, Aman Bhatia, Imtiaz Jadavji, Celine Liu, and Anthony Donohoe) as well as our Production Operations and Delivery team for helping us set up our infrastructure.

For more information on our deployment process please refer to our previous blog post on the Suggested Tag Feature.

--

--

Tyler Lanigan
Hootsuite Engineering

Senior Developer working on machine learning at Hootsuite.