The Suggested Tags Service
In 2018 Hootsuite took steps to increase our machine learning capabilities for our product by forming a team to deliver the Suggested Tag Service. In addition to this feature, our team, along with teams from New York and Bucharest, was tasked with creating a GDPR compliant data lake for machine learning data, and a deployment pipeline template for future ML services. The following is an account of the final system design for this project.
Hootsuite, as part of its offering, amalgamates all the messages directed through social channels such as Twitter, Facebook and Instagram into one platform. All manners of comment types (reply, posts and private messages) can be seen and responded to, and in addition, Hootsuite also offers an additional functionality: the ability to tag the messages. Tags are used for a variety of purposes by our customers. For example, reports are generated on the semantics of incoming messages (complaints, praise, enquiries). Our team thought it would be possible to suggest these tags given the message content, type, and social network. A suggested tag feature would save our customers time and let them more effectively use tags as we would suggest ones they had used before.
In addition to providing customer value, our suggested tag feature would serve as a springboard to help us develop the process for deploying ML models at scale for Hootsuite.
An example of the UX for the Suggested Tags feature is shown in Figure 1:
The project began with the following considerations in mind:
- Performance — The system must be performant in both speed and accuracy of tag predictions. It is serving predictions in real time and must not cause annoying delays in returning predictions. It must also be accurate.
- Scalability — The system must be able to scale to all our enterprise-level tagging users. This means that we are processing and training on hundreds of GBs of data and serving tens of thousands of suggestions per day.
- Re-training — Initial analysis of the tagging behavior of our customers showed that new tags are created / archived daily. Our system must be able to pick up these new tags and stop suggesting ones that have been archived.
- Experimentation — In order to iterate on the design, the system must allow for us to perform experiments without affecting users.
- Model Continuous Integration / Auditability — Any changes to the code used in the service must be trackable. New models must be able to be swapped in and out of production without affecting traffic.
- GDPR Compliance — Any data stored must be able to comply with GDPR.
- Roll-back capabilities — If a new model in production has a bug, there must be the ability to revert back to a previous version quickly.
- Different Environments — Having local, staging, and production environments allows for development to take place with the least chance of affecting production traffic. Code that is developed in the local environment should work on staging, as well as on production in the same manner.
An additional criterion for the project was to make it similar as possible to the current workflows for developers at Hootsuite. This way the project could easily be maintained and used as a template.
Final System Architecture
It is said that you should never reinvent the wheel. We took this proverb to heart as much as possible with this project by utilizing a host of different technologies in our deployment pipeline. The exhaustive list is: Github, Jenkins, Docker, AWS SageMaker, AWS ECR, AWS EMR, AWS Elasticsearch, AWS S3, Spark, SKLearn and Kubernetes. The basic system architecture is as follows in Figure 2:
I’ll leave detailed discussion of each of the services until later sections of the post but for now, let’s walk through the main flow:
- Raw data flows from various sources and is deposited into S3 using a service written in Scala and hosted in Kubernetes.
- Metadata (such as date, and data owner) is stored in AWS Elasticsearch.
- A service written in Spark and hosted on AWS EMR is used to process data into a form suitable to be ingested by a machine learning algorithm.
- The processed data is saved in S3.
- At this point, AWS SageMaker spins up a training instance and uses the processed data to produce training artifacts (trained models) and saves them in S3.
- The artifacts are checked for quality then loaded into a serving instance.
- The serving instance receives requests from an inference service that acts as an endpoint for all incoming traffic. Services such as the Hootsuite dashboard will call this service, and it will forward requests to our inference server. The Inferences service acts as a request converter, as well as a way of storing incoming requests and outgoing predictions for future analysis.
Although the trials and tribulations of storing and processing large amounts of data for this project could fill up an entire post in itself, I’ll just briefly describe our process for preparing data for the machine learning portion of our suggested tags feature.
Data is stored in our S3-based data lake. We ingest data from various sources (such as our event bus powered by Apache Kafka) and store it in Parquet format. We initially looked into using ORC; but while performance for loading and writing was slightly better with ORC, we decided to use Parquet as it has better support with Python libraries, schemas can be defined using Protobuf, and nested data structures are supported.
In order to comply with GDPR, we store metadata in Elasticsearch. This allows us to easily find data and delete it if we get a request to do so from any of our customers. We keep data in our raw bucket for 6 months as this was deemed enough for training.
We host and run a Spark processing job on AWS EMR which converts data from our raw bucket into the required form needed for ingestion by the model training service. For a new project, we write new processors that utilize the same raw data. At the beginning of the suggested tag project we wrote a custom Scala job and hosted it in Kubernetes; however, we found that due to the size of the dataset (~500GB) parallel computation was necessary and thus we moved to Spark.
The Machine Learning Algorithm
To suggest tags for social messages is a multi-label machine learning problem. The algorithm that is used to predict tags given message content is based on the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. We use a modification of this algorithm that used TFIDF vectorization instead of bigram count based features, and logistic regression instead of SVMs. This technique is discussed by Jeremy Howard in his fast.ai course for machine learning. We train one of these models for each of the tags (see OneVsRest strategy). We then packaged up each of these tag models in a Predictor class for each of our customers as shown in Figure 3 below:
Packaging up the models in this manner allows us to write some handy functions around the models. For example, each model is trained, then tested on a held out dataset. It is then scored using the AUC value. If the lower bound of a confidence interval on this AUC value (see Confidence Intervals for the Area Under an ROC Curve) falls below acceptable levels, then we turn off the model. This way, only the tag models that are actually good at prediction are allowed to return predictions in the final system. This threshold was tuned using SageMaker’s handy hyper-parameter tuning capabilities.
Training machine learning models take quite a bit of computation power and this is what SageMaker is made for! Our current training process uses a ml.m5.4xlarge. To train a model we simply package code up in a Docker image, save it to AWS ECR, point SageMaker to the image and the training data, and specify which resources to use. As we want this process to be accessible to all of our developers, but also auditable and subject to code review, we require that all related code is persisted in Github. After the code is code reviewed and pushed to a master branch, our Jenkins server kicks off the training and deploying jobs. The whole deployment process is shown below in Figure 4:
Our Jenkins server is able to run any of the SageMaker commands that are needed using the AWS CLI commands, or the Python SDK.
There are many advantages to using this pipeline to perform these training operations:
- Our developers don’t need IAM permissions to spin up and down EC2 instances (as required by SageMaker). Thus any of our developers could deploy a machine learning service. This is a major goal at Hootsuite as we want to enable all our teams to use machine learning as a tool to solve their problems
- As training code is stored in images, it is easy to revert back to previous training jobs.
- The pipeline can be run autonomously each day for retraining.
- The developers never have access to PII data and only look at the results of training operations through aggregate statistics. This is important for compliance with GDPR.
In production we rely on a single guiding metric to let us know how our service is doing: the acceptance rate of the predictions by our customers (i.e how many time are our predictions accepted divided by how many times they are shown). When it comes down to it, this is the most accurate way that we can approximate the true value to our customers!
However, we do include other metrics at the end of our training run so we can get an idea of the behaviour of the models. As I mentioned before, each of the tagging models is evaluated on a held out validation set, on which we calculate the F2, AUC, Hamming Loss, as well as a host of other metrics. This allows us to get a good idea of the behaviour of the system as a whole.
System Monitoring and Detecting Crashes
In order to detect when performance is dropping, or when our inference service is not returning predictions, we built dashboards using Interana. If our service begins returning errors at an anomalous rate, we are notified in our Slack channel. These steps are important to ensure that we have as little downtime as possible.
Our customers make new tags daily, so our system must be able to retrain daily as well. As deploying each of the necessary jobs (processing data, training models, deploying a new inference server) would be tedious to manually do each day, we created a Cron job in Jenkins that will run each step for us. To ensure that we don’t deploy poor models, we write model quality and integration tests that will stop that deployment if something is wrong. This way we can be hands-off, and only get notified if there is an error.
The Final Balancing Act and Performance
After an extensive beta trial with a select group of customers, we were able to decide on a final design for our suggested tagging system. As each prediction model for our customers returns a probability for each tag that they use, we only suggest tags to our customers that are above a certain threshold. Thus, sometimes our service does not return any tags for a given message, or returns fewer than our default amount of three. In these cases, we fill in the blanks with the most recently used tag, creating a hybrid system.
In order to validate our machine learning service, we conducted an A/B test where we presented half of our beta cohort with the most recently used tag as a suggestion, and half with our machine learning prediction. While the machine learning system performed better than then recent tag system, we found that the hybrid system performed even better than the machine learning, or recently used tag systems alone.
Our performance varies slightly day-to-day, however, we consistently have an acceptance rate of 60–65%.
Building a production level machine learning systems is not easy. Luckily the tools that are available to help are becoming more prevalent day by day. Keep a lookout in the future for more projects from Hootsuite’s Machine Learning Team. If you have any questions about this project, please feel free to reach out to me on LinkedIn: https://www.linkedin.com/in/tylerlanigan/.
I’d like to thank Jonathan Meyers from our New York Office, Ethan Aubuchon from our Toronto office, and the Bucharest and Vancouver Ops teams for their help with designing and building the data lake. Shoutouts to my teammates Honto Ming, Sam Reh, and Andrew Gormley for being generally awesome and working hard to deliver the suggested tag project with me.