How I Built an Unscalable Slack Bot (and How I Fixed it with R̵̵̵e̵̵̵l̵̵̵a̵̵̵x̵̵̵ Async Events)

This is a retrospective on building a Slack bot that worked, broke, and finally scaled. I don’t dive too deeply into either the analysis or the implementation of the solutions, but just recount my path towards an understanding of the problem and the solution. If you have questions of your own around these things, I’m happy to address anything I can help with in the comments.

UPDATE: I (somewhat) recently made some changes to how Callie works (including *not* using the Relax library) that will follow the main article. The rest of the article still feels relevant, hence the addendum in lieu of a rewrite.

In The Beginning

The way my first Slack bot started is probably the way 90% of them start. I wanted some functionality for my team, so I went and wrote a little node app that afforded it. I followed along with one of the many Slack bot tutorials and within a day or two…voila! The Count was live.

“Ahhh…TWO days until your fantasy football draft!” (Also, real talk: the Count should probably be counting in Base 8, no?)

The Count was a pretty basic piece of software. You could tell him about an upcoming event and the date of that event, and he would keep a countdown of days until your event. He was immediately pretty popular with my friends (yes, my friends and I use Slack for fun), so I thought…hey, I could probably put this on the Slack app directory and have something out there that people actually use! Finally, one of those opportunities for a side-project to get real-world exposure in an area and language I don’t usually work in. This will be fun!

First things first: I had to get rid of The Count, because he’s extremely…er, copyright protected, and Slack is not about that game. So, Callie the Calendar Corgi was born. Arf!

Callie was built with node on express, using this tutorial as a starting point. I started out with the same slackbots package as a base, but used MongoDB for the database, mainly because I wanted an excuse to work with a NoSQL DB for a project. For a couple non-bot related Slack API requirements, I used slack-node. I deployed Callie on Heroku, on a hobby-level dyno for 7 bucks/mo.

And now that I was extending Callie to the world at large, I had to start thinking about actual programming challenges like persistence — since I could no longer just assume that every countdown was relevant to a given team (which was a safe assumption when it was only my team using the bot), I had to restructure the data model for a multi-tenant bot.

There were other problems that I had to solve, which I will briefly note in a following section, then I will focus on my main issue: memory consumption as my app’s user base scaled.


Scheduled Reminders

Callie would automatically send a message with the countdown your events. The default was 10am on Monday, but you could make it daily/weekly and pick the hour and day for your reminder.

Heroku (where I run my app) doesn’t have cron, instead opting for an add-on that I didn’t like, so I ended up using node-schedule to accomplish this…but made a mistake in that I basically created a new Scheduled Job object (essentially an EventEmitter) for each and every countdown. So, if there were a thousand countdowns scheduled for an automated reminder at 10am on Mondays, there would be a thousand separate scheduled jobs running to handle that. This is something I solved later in my scalability push.

User Inputs

To schedule something for Callie to remember/count down for you, you let her know by saying:

Yes, I am aware that halloween is actually October 32nd.

When you first booted her up, she’d tell you how to do this in a (slightly verbose) welcome message that instructed you to issue your countdowns as such:

Upon going live, I noticed within HOURS that users were actually saying things like @callie start date: <2017–10–30>, event: <halloween eve>, when resulted in goofy “&lt;” strings littering my database.

There are other instances where user input could result in garbage data as well. This was primarily because I was parsing inputs myself rather than using (or trying to implement) a more natural language processing system. There was some flexibility in input handling, but not enough!

This is still the case with Callie, unfortunately, but there are at least some safeguards in place. Callie can recognize when things are blatantly wrong in the most common variations and gently chastise you towards a less incorrect command.

The lesson here is that when you have users for a Slack bot that aren’t going to be extensively coached by you (i.e. any bot that other teams will use), expect them to make dumb, repetitive, catastrophic inputs. Yes, we’re all aware to some extent that user input can’t be trusted, but it seems like it’s another level when the user believes they can talk with your application like “a person”.

Memory Issues and App Failure

Callie first started experiencing issues when she had around 1,400 team installations. I was using Papertrail for logs and receiving R14: memory quota exceeded a handful of times a day, particularly around 6am (when my dynos would restart) and around 10am (when a bunch of event emitters for scheduled reminders would fire off).

This is not my actual app (I did not take a screenshot at the time) but this is close enough. Taken from some other poor memory issue sufferer on Stack Overflow

At the time, I let it ride, since my app was still experiencing 100% uptime. But as the bot count continued to grow, memory issues became rampant, and my memory metrics chart looked a lot like this example: just constantly at LEAST a little above quota, which meant I had plenty of mornings with emails from Papertrail saying [Papertrail] “Platform errors” alert: 3361 matches. Yikes.

Finally, the app started crashing after vastly exceeding memory quotas. Even though it was a very niche, non-critical, free Slack bot, users cared when it was down. It was time to make a change.

What Went Wrong

First of all, I thought about garbage collection. I didn’t think this would ultimately solve my problem, but I wanted to give it a try. Node lets you configure the old space size, so I thought maybe something like the below would get me some breathing room.

node --max-old-space-size=460 app.js

Alas, the problem was not solved that way. The excessive memory usage was persistent and apparently not just a bunch of old references that I could get rid of more aggressively.

I considered doing some more advanced investigation into the state of the heap in my app, but I felt like at this point, the primary cause was probably the slackbots library I was using.

Each team that installed Callie opened up a real-time messaging (RTM) websocket connection with Slack. Given the relatively small size of my dyno (512 available memory), concurrency in my app via the Cluster API probably wasn’t going to help. I needed a way to manage all of these websocket connections without spiking through the roof of my memory ceiling.

Relax

This is when I discovered Relax — a Go program for handling exactly this scenario. The key is that Relax uses Go routines to handle these connections in conjunction with a set of Redis hashes to manage state. In short, Relax leverages a strength of Go to take the burden of managing high numbers of websocket connections off the developers who are primarily interested in the actual business logic of their bots.

From ChatBotsMagazine:

The Go language has the powerful concept of Go routines which are lightweight threads managed by the Go runtime. A Go program can spawn thousands of Go routines with very little cost and take advantage of multi-core server architectures.
With Relax and Go, we achieve the following design goals of running a scalable Slack Bot:
Being able to spawn 1000’s of Go Routines (one for each team that connects to your Slack Bot) enables us to scale cost efficiently.

Pretty straightforward. In practice, it was also pretty easy to implement. Rather than instantiating a new instance of slackbots for each new team installation (which was likely a source of high memory usage), I simplified the slackbots module to just provide a very basic wrapper for posting messages to slack channels without any inherent knowledge around bot access tokens. This meant I could pass in the message to be posted, the destination, and authentication information to the wrapper methods as needed.

I used Ben Whittle’s fantastic package relax-js to make incorporating my existing app with Relax very simple. I took the opportunity to also revamp how my node-schedule implementation worked. Rather than creating a new job for every scheduled countdown, I just schedule a single job that checks for qualified countdowns each hour. This was a super obvious optimization in retrospect and freed up space that was previously used by thousands of Job objects.

Overall, the results of these changes were immediate and significant:

This IS my app and it looks very nice from this angle!

Memory usage plummeted. Some of that was taken up by the Relax app instance I am running on the same heroku pipeline now:

The spike in the beginning was mostly a bunch of debugging garbage as I got things up and running.

Other than the spikes during development as I ironed out some issues, it’s been super smooth running. After pruning disabled/disqualified bots (via Relax’ disable_bot event), I still have nearly 2,000 teams with installed bots, and don’t anticipate needing to make any changes moving forward unless many, many people read this and decide they need a countdown bot.

Update: No More Relax

I enjoyed learning to use Relax, and it ultimately did solve my problem for some time, but over time, the development experience (particularly debugging) became enough of a problem that I wanted to move away from it.

I started seeing message drop-offs that seemed to fail silently at some point after hitting the redis cache that Relax uses to manage messages. Despite (IIRC) unique hashes of the message + timestamp to identify messages, Relax was erroneously labeling messages as duplicates and declining to pass them along to my node app, so Callie got very quiet for her users.

I have spent many, many hours messing around with Callie and did not want to start debugging the third party library I was using, particularly since the team that “owns” it did not seem to be maintaining the library anymore. The original binaries that they hosted were not even available any longer, and I felt like I was in danger of diving into some unnecessarily deep and uncanny rabbit holes.

The Solution (For Now)

I ended up moving away from using the Real-Time Messaging framework for managing Slack events, and opted for an Events API-based framework instead. Rather than maintaining a persistent, stateful web-socket connection wrapped in a slackbot library for each time that installed Callie, I did the simple thing that I arguably could have done in the first place: registered some API endpoints to receive messages asynchronously. Any lag attributable to non-real-time messaging is virtually impossible to notice, and my memory consumption actually plummeted *again* from what I was already happy with using Relax

Ultimately, I think asynchronous event handling was the obvious choice in terms of solving for scalability, but I’m not unhappy or regretful that I tried to solve my problem with Relax instead. It was an interesting (if potentially over-engineered) solution that introduced me to some neat third-party code and worked well! Until it didn’t.

Takeaways

I still have a lot to learn about memory management in Node. Things I am particularly interested in:

  • More detail on Websocket and EventEmitter memory usage in Node programs.
  • Investigating the Slack Events API as a viable alternative to any of this (at the cost of real-time messaging). ** this is what I did! I am glad that I did.
  • Heap analysis tools for Node

Overall, this was a relatively superficial (but still challenging) introduction to the problem of scaling an application. The issues were pretty apparent from the start, and the solutions existed for those issues in a way that I could implement without requiring an entire-app rewrite.

I found it super valuable to get exposure to problems of scale in a one-man project before the application became a mission-critical product for users (versus a pretty lightweight/niche Slack bot primarily used for fun). I believe this is probably a danger/opportunity for many Slack bots that get listed on the App Directory, and there weren’t a ton of stories out there re: solutions. Heroku has some resources for memory optimization, but they don’t fully apply when your problem is managing something like 2,000 persistent websockets.

If you have questions or suggestions, let em fly. And if you need a cute corgi to count down to upcoming events in Slack, there’s a newly reliable option for you here.

Like what you read? Give Brian Gerson a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.