Enabling flexible work habits with chatbots

Gavin Perrie
DataReply
Published in
6 min readFeb 15, 2019
Image by Peyri Herrera via Flickr, Creative Commons License

When building a new platform, there comes a point — hopefully — where the usage scales and your creation jumps into life becoming an important part of people’s working toolset. What happens when that growth happens quicker than anticipated and the support team can’t be scaled at similar speed? In many cases there is a scramble to scale with interim support processes put in place that often dampen the initial excitement around the new tool. Learn how we solved this issue by exposing platform operations to end users through a chatbot and APIs.

‘Why can’t Slack just answer these requests itself?’

That was the statement by one of our Platform Engineers when we were sitting in a meeting one Tuesday morning and for the third time in the last hour one of the users of our platform had pinged him asking to have their cluster resized. It wasn’t a difficult task to scale out the cluster; we’d created a collection of management Lambdas that automated the majority of the daily tasks that we performed, but it was still a distraction and there were better things the team could be doing to extend the platform.

So how did we get to the situation that users were resizing their clusters so often?

Our project was started to replace an existing Hadoop cluster. There was nothing particularly wrong with it, the customer just wanted a more flexible way of developing their use-cases and the current platform had fixed resources and schedules for new workloads had to be carefully thought out in order to not negatively impact the current ones. It’s a problem that we commonly see in on-premise solutions and there are various techniques to manage resources and execution times of scheduled tasks.

From the beginning we had some targets that we wanted to achieve with the replacement;

  • flexible, elastic architecture
  • no technology lock-in
  • DevOps style support

By removing technology lock-in, we meant that the use-case developer should have the freedom to choose the best technology that supported their needs. Many of us have seen mature projects where we’ve questioned a particular technology choice and in many cases it’s been made due to restraints set by other factors; whether that technology is already in the company, knowledge level in the team or even pricing.

A big focus was also put on the DevOps style support, basically you write it you run it. We wanted to give as much autonomy to the application developers as possible and remove the dependencies to the platform team. If we could keep the team free from support tasks they could be further developing the platform instead of trying to understand multiple applications and their subtleties.

So we went and built the new Big Data platform.

And made it flexible and elastic.

We decoupled the storage layer from the compute layer using AWS S3 for our data lake. Then we gave every data scientist their own AWS EMR cluster (AWS hosted Hadoop offering) that was independent from all other clusters, allowing them to take all the resources for their jobs whenever they needed them. And we built a set of scaling parameters that allowed each user to individually define what size of cluster they needed as well as decide which type of EC2 instance they wanted for both the master nodes and the worker nodes. On top we added a set of automation scripts to control and manage everything and all was good in our project world.

Then came our first users and they loved the flexibility to scale when needed. They also made regular use of the ability to change instance types. And the number of nodes. And the cluster run-times. With all of these changes being made by involving our team, either through support tickets or directly writing to us on Slack, as the number of users grew so did the amount of time we spent supporting them and their new love of scaling their clusters in, out, up and down until it was impacting the delivery of new features on the platform.

Image by Alex via Flickr, Creative Commons License

‘Why can’t Slack just answer these requests itself?’

The Eureka moment came when a fed up engineer muttered the words above having scaled a cluster for the umpteenth time that week. We all sat back, looked at him, and realised that having Slack answer was a genius idea that would not only free the team from the responsibility of scaling, but also allow the users to work independently and flexibly.

We decided to build a chatbot that could handle the majority of the standard requests but before we could do that we had to make a few decisions.

  • What actions can the users perform themselves?
  • How to expose those actions through an API?
  • Choose a bot framework
  • How to authenticate the users?
  • Which frontend to use?

The last two were the easiest and decided quickly. We already use the Google authenticator to generate one time codes to login to other services so let’s reuse that. Then the requests were already being sent to us on Slack so let’s make the switch as seamless for our users as possible and keep that in place.

The choice of bot framework was also an easy decision. AWS offer the Lex service that provides Natural Language Understanding(NLU) and which integrates nicely with AWS Lambda to fulfil the users requests, or Intents in chatbot speak.

For the choice of user actions we followed the ‘think big, start small’ mantra and offered only the following core actions to begin with;

  • scale up/down
  • scale in/out
  • stop/start

With all this in place we had to simply expose each of the relevant lambdas through an API. For this we used the AWS API Gateway, you might be spotting an AWS theme to our solutions by now :)

Now we had all the pieces we needed and only had to stitch them together into a working solution. Simple.

Architecture Overview

With our bot working and reliably answering the users standard requests we could concentrate on extending the platform functionality, the Data Scientists had the freedom to manage their own clusters, and as a nice side effect we could stop unused clusters from being started unnecessarily reducing our daily costs by 75%!!

We went through some quick learning paths building the bot, some of our main takeaways were;

  • Keep the bot task focused
    Build multiple bots if necessary
  • Expose management services through APIs
    It makes integration with other systems and processes easier while also allowing you to switch out the backend logic if needed without affecting downstream processes
  • Build in security from the start
    Its much easier than going back later and trying to squeeze it in
  • Cost shouldn’t be the main driver
    The goal was to free up the team to perform more value add tasks, not replace them or remove them.

So now everybody was happy and the bot has become an integral part of the project with some cool feature extensions planned in the next weeks. Reach out and let us know if you’ve built, or are planning to build, anything similar, we’d love to hear how other projects are tackling similar problems.

Data Reply is a Reply Group Company specialising in applying Advanced Analytics tools and techniques to address client needs in three key areas: Speed, Efficiency and Insight. We are an agile network of Data Scientists and Engineers in the UK, Germany and Italy with a very practical focus on technical solution delivery.

--

--

Gavin Perrie
DataReply

Solutions architect focused on AWS & Big Data technologies. Cloud Evangelist. Alexa developer. Spartan.