MLOps for Conversational AI with Rasa, DVC, and CML (Part I)

Matthew Upson
MantisNLP
Published in
3 min readNov 29, 2021

--

This is the first part of a series of blog posts that describe how to use Data Version Control (DVC), and Continuous Machine Learning (CML) when developing conversational AI assistants using the Rasa framework. This post is mostly an introduction to these three components, in the next post I’ll delve into the code, and how to get everything connected for Rasa MLOps bliss.

What is DVC?

If you’ve not heard of Data Version Control (DVC), you’ve been missing out. DVC is an exciting tool from iterative.ai that brings all the benefits of source code version control to the management of your data. DVC extends git’s functionality to cover your data wherever you want to store it, whether that is locally, on a cloud platform like AWS S3, or a Hadoop File System.

Like git, DVC is language agnostic. It doesn’t matter what language you are developing in; DVC has you covered. We could extoll the benefits of DVC for a whole post, but instead you should check out the DVC blog or youtube channel.

DVC also works a bit like gnu make to allow you to develop workflows as Direct Acyclic Graphs (DAGs). As well as tracking dependencies and outputs, you can also use these pipelines to track all the parameters used in a particular model run.

What is CML?

Continuous Machine Learning, or CML, is another offering from iterative.ai, the team behind DVC. CML is essentially a github action and a series of docker containers that help you to run jobs on runners (instances) on github, a cloud platform of your choice, or machines in your local network. A workflow using EC2 instances triggered by github would look like:

  1. You push a commit to github with a code change, or a hyperparameter tweak
  2. Github launches the CML action which uses terraform to provision a cloud instance to your specification (the runner)
  3. CML runs a docker container from pre-prepared CML images (or your own custom image)
  4. Your arbitrary commands are executed in the docker container using the source code from your repo (This can be a DVC pipeline)
  5. Any artefacts produced by your code can be stored in DVC in the usual way and metrics reported automatically back into the PR.

What is Rasa?

Rasa is an open source framework for developing Conversational AI — No, not chat bots…(let them explain why). Rasa leverages SpaCy, but also has a lot of its own cool innovation going on under the hood. If your Conversational AI needs to cater for complicated use cases aside from customer support, etc. then Rasa is an excellent choice.

Why DVC and CML with Rasa?

With Rasa, it’s common practice to store all of your training data within a git repo. This is typically because the training data is not very large, and because Rasa’s UI tool Rasa X can connect directly to your git repo and sync training data as it is received and annotated. It doesn’t make much sense to use DVC here, but it is helpful for creating a DAG to manage training.

The people behind Rasa extol the virtues of Conversational Driven Development (CDD), an iterative approach that favours interactions between users and AI to drive the development cycle. Two of the key tenets of this approach are

  • Test: that your assistant always behaves as you expect; and
  • Track: measure its performance over time.

Rasa have created a github action to run its native testing functionality as part of a CI/CD pipeline, which is good, but it doesn’t natively track the metrics produced by the tests: essentially you get the Test part, but have to do something else to get the Track part. With DVC we can do both, and with CML we can easily add it to our CI/CD pipelines.

This post is just setting the scene for what is to come, in the next blog post I’ll get right into the details on how to get started.

--

--