Lessons Learnt From Shipping An AI Application

Daniel Tan
AI2 Labs
Published in
5 min readNov 28, 2020

This is part of the MLOps series.

We finally launched a limited public beta for our Autopaint tool, https://creart.innovrepublic.com/ , and these are lessons learnt during the past few months of development. Some of them might seems weird, but bear with me.

TLDR:

  1. Don’t skip design sprints.
  2. Treat AI deployments as independent deployments
  3. Enable non-developers to edit static configuration of your application.
  4. Mirror the deployment setup as close as production

Don’t Skip Design Sprints

Typically developers like to take a bunch of requirements and begin hacking away. This is the advice to stop and think more about the application before beginning, especially for an AI application. Most products probably don’t need “machine learning” or “artificial intelligence” as their core technology, because all they really need is some good statistics algorithm (discounting how AI/ML can bring some good marketing, but that’s a business decision). However, if your application uses AI/ML as the core, then you probably want to take a step back and understand what does your users want from AI/ML, and how would they be using it.

An unused product feature is good as dead.

Users typically want to use AI/ML for any of the following reasons (check out this report):

  1. The product can learn. (57% of people surveyed thinks AI can do this)
  2. The product can replicate human interaction and interpret speech (35% and 37% respectively)

So, if your product claims to have AI, and it misses the baseline of what people think AI can do, you’re going to disappoint your users. By then, you might as well say your product has a proprietary algorithm and users will be none the wiser (Looking at you Uber).

As such, you probably want to reflect that in your product design, including UI/UX. You probably want to provide and highlight some of the following in your application:

  1. The ability to replicate human results (approximated results).
  2. The ability to learn (error correction).

Non-AI applications typically don’t provide this because their results typically carry a low risk factor. When was the last time you expected a calculator to fail? And yet AI applications will fail, because they only provide a good approximation of the results the users want. As such AI applications should be designed as a loop, instead of a straight line.

This is where the design sprints come in, to provide some buffer for the development (including designers, artists and techies) team to think about how their program works.

Treat AI Deployments As Independent Deployments

Now the typical SDLC goes like this: planning→analysis→design→implementation→maintenance.

Even though it’s supposedly a cycle, maintenance doesn’t always go to planning. Once a program hits maintenance, it more or less stays there, and a new version of the program restarts the process as a new process.

As for AI deployments, that’s usually not the case. During the development of AI applications I drew from my experience in game development where there’s some stuff that’s different from the usual SDLC:

  1. Artists and game designers have their own development cycle, just like AI scientists, and it is separate from the tech cycle. This means that you actually have to plan your time accordingly to stagger between artists and game designers. This also means there are constant mini-cycles of iteration and communication going on instead of one big cycle with clear, designed stages.
  2. The maintenance step is a misnomer because you have to constantly monitor for user behaviour to ensure that your application is doing what you think it should. Just like how you can’t expect players to play according to what you think the game is capable of, you should never expect the AI to produce results the users expect. This means your deployment must be capable of being patched outside of the tech cycle, and at arbitrary short times (Remember the racist AI bot?). For games, it’s stuff like asset bundles to update game mechanics and art, and for AI it is splitting model deployment and code deployment.

Enable Others To Edit Static Configuration Of The Application

AI engineers are usually focused on making sure their models run well, and AI development is never-ending. They are always in development mode, and the line between production and development is pretty thin. In addition to that, most AI engineers are Python based, so you can’t expect them to understand Golang code or whatever.

This is fine and dandy because most AI applications are micro-service based architectures. So teams don’t usually come into conflict… until they do, because the AI engineer is likely the domain expert, and they will need to tweak some variables of your application. During development of our application, we began to understand just how much needs to be tweaked, especially since AI is the core of our application.

As such, we abstracted out configuration so that our AI engineers can tweak JSON files to what they needed, without asking the developers for help. This helped us immensely, especially when the developers were busy fixing bugs during a tight sprint while the models are trained for deployment, so the AI engineers can come fix some of the smaller items just by changing the configuration.

The same happened when it comes to deploying the applications to different environments. The AI engineer abstracted the file paths for models and other micro-services to some JSON file, and the operations person just needed to maintain different configuration for different environments without bothering too much of the core code.

Mirror The Deployment Setup As Close As Production

We chose to use Kubernetes via Microk8s as our core method of deployment as it paid off greatly when we moved from Aliyun to Azure. It took us a total of a week to get everything setup as it should, and by the end of the week everything was working as it should without a hitch, save for a gotcha where Azure doesn’t actually mount your new disk for you (https://docs.microsoft.com/en-us/azure/virtual-machines/linux/attach-disk-portal)

I created my own templating tool for managing environments (https://github.com/danieltanfh95/clojube) but you’re also welcome to use Kustomize or Helm. In general this is what we did to achieve this:

  1. Containerization. This allowed us to prevent “it works on my machine” sort of problems. You especially don’t want to deal with issues like this during a tight deployment schedule.
  2. Decentralized, versioned configuration for K8s. Kubernetes allows you to save your deployment as YAML files and it’s very powerful, because now you can tie your deployment with your code version! This means given some kubernetes cluster you can ensure that for a particular version of your application the cluster will be setup exactly as you intended for that version.
  3. Decentralized CI/CD configuration. We used Gitlab for this. This allowed to build and maintain separate docker containers of our application, specific to some environment. This is especially important because you may have multiple production environments and they might need some change in configuration in some way. Doing this at the CI/CD level allows your developers to focus on getting features out instead of trying to fit their code to the environment. Also, gitlab-runners are decentralized, so it means we can setup some configured GPU machine will build our models that the AI engineers can also change.

So there you go, some unconventional learnings from developing and deploying an AI application for public use.

--

--