Building an Async Event-driven Slack Bot on GCP for Engineering Support

drew
Google Cloud - Community
10 min readMay 16, 2020

An overview in going from idea to published in the App Directory in Slack in just a few steps.

This assumes reasonable familiarity with Bolt (Slack’s JavaScript Framework) and Engineering Support Channels in Slack.

What We Will Build

Architecture Summarized

Background

One of the common challenges in managing Infrastructure teams at any sort-of scale is keeping your customers happy and questions acknowledged / answered. No matter how amazing your platform is, customers will always be pushing the boundaries, you will always have some edge that is unexplained in documentation, and you will need a place to provide reliable and expert support.

Unlike IT teams or HR teams that can script many common support tasks and interactions — in Infrastructure the complexity of conversation can quickly go from, “Hey, I am seeing this weird error from my service” to, “Site is down, starting an incident.”

Having had to solve this challenge multiple times at multiple companies has led me to build Ackly.io.

Force Threads

This was a tough migration for me over the years growing up on BBSs (threads!), then IRC (no threads), then Campfire (no threads), then HipChat (no threads), and then Slack (threads!) — but you must train your customers to use threads.

Don’t tolerate or put up with customers (or team members) who spam your support channel with one-line verbal garbage — take them aside in a 1:1 DM and explain that we don’t do that in this channel and if you have a follow-up to a question- use a thread.

If you train them on the importance of threads not only will they adopt them in your channel, but also carry that learning into other channels resulting in company-wide improvement.

When not using threads — sanity is lost.
When using threads — sanity is maintained.

Threads are the only reasonable method to keep your customer support channel(s) clean and readable.

Choosing a Framework

There are a lot of different Slack Bot frameworks with a variety of pros/cons to each. For us over at Ackly, Slack’s Bolt framework fit the problem space nicely.

We reviewed Botkit and found a pretty large gap between the documentation and reality. Most of the plugins have not upgraded to the latest refactor to 4.0 and working with Azure’s Bot Service was clunky at best.

We reviewed omnibot which was very cool but unnecessarily complex and extremely AWS-focused with minimal paths to use other public cloud providers.

We reviewed many other purpose-built bots, many deeply in the natural language understanding (NLU) space and found them all to have oddities that rendered them extremely hard to work with, extend, or simply get up/running and submittable to the Slack App Directory in a timely manner (Lex, Dialogflow, Botpress, Watson Assist).

Building a Bot to Track Unanswered Questions

On any given day, there can be a flurry of customer questions in engineering support channels. It is highly likely that some questions go unanswered and are lost forever due to the lack of any type of summary or tracking mechanism of free-flowing text.

Luckily, Slack has Event Subscriptions. You can subscribe a bot to listen for specific event types.

For the bot that we have built at Ackly, we subscribe to the message.channels and reaction_added events. This gives Ackly the ability to know when someone posts a message and when someone adds a reaction.

Subscribing to these events is quite easy in Bolt — for messages:

app.message(/^().*/, async ({body, message, say}) => {
//
});

And for reactions:

app.event('reaction_added', ({body, event}) => {
//
});

The documentation for Bolt and ability to read/see what is happening is fantastic.

Filtering Messages & Reactions

Defining what data you actually need to store is important both for privacy reasons but also because data storage is not free.

With JavaScript you can specify a regular expression and then match for that content:

const regexDaily = /Ackly Daily Summary/ig;if (message.text.match(regexDaily)) {
//
}

You can also check if a message is a thread by checking to see if the message object has a particular property:

isThread = true ? message.hasOwnProperty('thread_ts') : false;

And if the message is a bot or is a bot sub-type:

isBot = true ? body.event.hasOwnProperty('bot_id') : false;
isBotSubtype = true ? message.hasOwnProperty('bot_message') : false;

For reactions, you could choose to filter just to a particular set:

validActions = ['ack', 'heavy_check_mark'];
isValidAction = true ? validActions.includes(event.reaction) : false;

With this information you now have the ability to filter text from messages in various necessary ways to build a bot that can ignore further messages in threads, can detect whether or not the message posted was by a human or another bot (or itself!), and filter reactions to a specific set that might be interesting for your application.

Store only what you need and when you need it. This will save you headaches further down the road when you try to onboard customers.

Choosing a Platform

We chose Google Cloud Platform (GCP) to expand our technical knowledge as a team given that we already have experience with several other large public cloud providers and haven’t used a few of the services that this would give us a chance to utilize.

Our suite of services that we utilize are as follows:

These services have a variety of documentation and supported/unsupported libraries which also gave us some necessary and desired hurdles to climb.

Platform Must Reads

There were a couple must reads that were invaluable in getting started with Cloud Run:

  1. https://github.com/steren/awesome-cloudrun
  2. https://github.com/ahmetb/cloud-run-faq

Bot Must Reads

There were a couple must reads that were invaluable in getting started with writing Ackly:

  1. https://github.com/robbytaylor/slack-bot-template
  2. https://github.com/seratch/serverless-slack-bolt-aws
  3. https://github.com/IBM/slack-wrench

Getting Up and Running

A couple tools and techniques were used to get up and running quickly.

The first was ngrok. By using ngrok, a local instance of the bot could be running and receiving events from Slack. This enabled quick testing. We opted for a paid license which gives you a dedicated sub-domain.

The second was configuring Ackly to run one way locally (with a token) and one way when remote/in production (with oauth).

This was done as follows:

const {App, ExpressReceiver} = require('@slack/bolt');const expressReceiver = new ExpressReceiver({
signingSecret: process.env.SLACK_SIGNING_SECRET,
});
const makeAppWithToken = (token, expressReceiver) => {
return new App({
token: token,
receiver: expressReceiver,
logger: logger,
logLevel: logLevel,
});
};
const makeAppWithOauth = (expressReceiver) => {
const oauth = require('./lib/oauth');
const app = new App({
authorize: oauth.auth,
receiver: expressReceiver,
logger: logger,
logLevel: logLevel,
});
oauth.install(expressReceiver.app, app.client); return app;
};
const app = (process.env.USE_OAUTH === 'true') ?
makeAppWithOauth(expressReceiver) :
makeAppWithToken(process.env.SLACK_BOT_TOKEN, expressReceiver);

More on OAuth below.

OAuth v2 and Slack: Not a Happy Dance

One of the more confusing elements of building a Slack bot for us was the circular and unclear documentation from Slack on OAuth, redirect flow for requests, and permission scopes.

In addition, the Bolt framework doesn’t have any example of how to configure this flow for a production application.

I’d really love for someone at Slack to maybe someday see this post and reach out to discuss.

Starting with Using OAuth 2.0:

  • The page’s URL has /legacy/oauth which raises the question, is this the correct page?
  • At the top it states, “New Slack apps can act independently of a user token. Build a bot user powered by only the specific permissions it needs.” — which if you follow takes you to /authentication/basics — which then leads to Installing with OAuth — and then links back to that page at the top.
  • Part of the confusion is that “new apps” use a V2 of Slack’s OAuth 2.0 version.
  • Furthermore, you really do need to read both pages as the first covers information that the second does not. Or maybe it was just the step-by-step nature of the legacy page, or lack of it trying to be cheeky (“Extra credit” + “A little motivation”) on the newer page. Developer documentation needs to be direct, concise, and extremely easy to parse — We don’t need cheeky.

There appear to be no reasonable code snippets of working Slack OAuth v2 implementations that I could find. Each one that I found was incorrect in multiple different ways or didn’t work in Production.

Here is an example of what we came up with after much trial/testing:

oauth.js

This is trimmed for clarity removing DB operations and other meta pieces that weren’t necessary for functional display.

Scopes were also interesting in that there was conflicting documentation on chat:write vs. chat:write:bot which eventually sorted out to using the former after trial/error.

Secret Storage

Berglas has become our go-to tool when working in GCP to work with Secrets on managed offerings.

Setup a couple variables:

export PROJECT_ID=ack....
export BUCKET_ID=ack....
export KMS_KEY=projects/ack..../locations/global/keyRings/berglas/cryptoKeys/berglas-key

Create your secrets:

berglas create ${BUCKET_ID}/some-key "some-value" --key ${KMS_KEY}

Grant access to them:

PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format 'value(projectNumber)')
export SA_EMAIL=${PROJECT_NUMBER}-compute@developer.gserviceaccount.com

berglas grant ${BUCKET_ID}/some-key --member serviceAccount:${SA_EMAIL}

Update your env vars to reference them:

gcloud run services update ack.... --platform managed --update-env-vars "SOME_KEY=berglas://${BUCKET_ID}/some-key,ANOTHER_KEY=berglas://${BUCKET_ID}/another-key"

Use file references if required by tools/libraries:

gcloud run services update ack.... --platform managed --update-env-vars "SOME_FILE_REF=berglas://ack..../some-file-ref-json?destination=tempfile"

If you need to revoke:

PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format 'value(projectNumber)')
export SA_EMAIL=${PROJECT_NUMBER}-compute@developer.gserviceaccount.com

berglas revoke ${BUCKET_ID}/some-key --member serviceAccount:${SA_EMAIL}

Removing and deletion:

gcloud run services update ack.... --platform managed --remove-env-vars FIREBASE_ADMIN_JSON,SOME_KEY

berglas delete ${BUCKET_ID}/some-key

Pub/Sub

Configuring Cloud Pub/Sub was one command to create the topics and then one command to trigger Cloud Functions on these topics.

Create:

gcloud pubsub topics create ack....

Trigger:

gcloud functions deploy ack.... --trigger-topic ack.... --entry-point=subscribe --runtime nodejs8

Publishing events was extremely easy to do:

const {PubSub} = require('@google-cloud/pubsub');
const pubSubClient = new PubSub();
module.exports.publish = async function (obj) {
const topicName = process.env.PUBSUB_TOPICNAME;
const dataBuffer = Buffer.from(JSON.stringify(obj), 'utf8');
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}

Subscribing to events in a function was also easy to do:

exports.subscribe = (pubsubMessage) => {
const messageObject = JSON.parse(Buffer.from(pubsubMessage.data, 'base64').toString());
//
};

Packaging

Choosing Docker made it easy to package and test both locally and debug failures within Cloud Run. The only complexity was wrapping with berglas for secret fetching/decryption.

Our Dockerfile is fairly simple (and trimmed):

FROM node:12-alpineRUN apk add ca-certificatesWORKDIR /usr/src/appCOPY package*.json ./
RUN npm install --only=production
COPY index.js ./
COPY lib ./lib/
COPY --from=gcr.io/berglas/berglas:latest /bin/berglas /bin/berglas
ENTRYPOINT exec /bin/berglas exec -- npm start

Deployment

Scripting deployment was pretty easy. We wrote a small script that handled the operations with Cloud Build, Cloud Run, and Cloud Functions. This is triggered by our CI build and can also be done manually:

export PROJECT_ID=ack....
export PROJECT_NAME=ack....
# set configuration
gcloud config set project ack....
# deploy bot
gcloud builds submit --tag gcr.io/${PROJECT_ID}/${PROJECT_NAME}
gcloud run deploy ${PROJECT_NAME} --image gcr.io/${PROJECT_ID}/${PROJECT_NAME} --platform managed
# deploy functions
cd functions
gcloud functions deploy ack.... --trigger-topic ack.... --entry-point=subscribe --runtime nodejs8
cd ..

What about those Unanswered Questions?

You have to first define business logic on what it means to be unanswered as a question. If you are storing some subset of messages from a public channel and also reactions from this channel you will be able to discern what is unanswered.

If it can be assumed that an answered question is a message that doesn’t have a particular reaction added, and you are storing all of the relevant messages and reactions: you can confirm whether a particular message has these reactions.

The reaction_added API has two timestamps: one that is the item, and one that is the event. The item’s timestamp is the time in which the associated item was published. The event’s timestamp is the time in which the reaction was published.

This data allows you to make assumptions about unanswered/answered questions assuming you have implemented a system culturally that utilizes reactions for questions in your support channels.

Ackly’s Daily Answered/Unanswered Tracking

Build a Bot to Recognize The Front-line Support

Whether you are supporting one customer or a hundred customers there is someone on the other side of the screen providing answers to all of those questions. Providing recognition that is timely (daily) is important in the age of constant buzz of gratification/recognition expectations.

As part of the reaction_added API there is a user id (item_user) for who added the reaction. With this, you can easily keep a summary of each person who acknowledges and then answers questions on the team.

This enables you to have a hero that is celebrated each day providing that gratification and recognition that they deserve.

Ackly’s Daily Hero Recognition

Built a Bot to Celebrate

As mentioned very early in this post, defining what data you actually need to store is important both for privacy reasons but also because data storage is not free.

At Ackly, we very much believe that if you don’t need data that you are storing it should be deleted. This is why we don’t store message data beyond a day (and one day for cleanup processing).

Metadata is kept for longer to enable longer time-frame celebration. This is particularly important for recognition beyond a day (weekly).

To perform this summarization you no longer need any message data and only require other metadata such as event timestamps.

Ackly’s Weekly Summary

Conclusion

This post covered a bit about a business problem we were trying to solve and building a product around it, detail re: the technologies we chose and how we configured them (with code snippets!), and ended with some of what this enabled in the product.

If you are interested in learning more about Ackly, check us out over on ackly.io.

If you found parts of this post interesting or would like to discuss further, feel free to comment/reply or reach out on Twitter at @HeyAckly.

--

--