This is in continuation to previous story. If you haven’t read it yet, please do read it here.
In previous post, we discussed why chose kubernetes and touched upon various methods of installation. In this post, we will discuss how we are using kubernetes for both engineering and data science purposes.
As you all might know, omni:us is a leading company in ML/AI space that is catering to insurance sector. omni:us serves many insurance companies around the world. To be a market leader in ML/AI, we need to do various jobs like annotation of the data, training the ML model, evaluation of the model by running prediction pipelines multiple times a day for each customer(s). There are different teams to handle these tasks. Running infrastructure at this scale poses a very practical data science and ML problem: how do we give every team the tools they need to run these tasks without requiring them to operate their own infrastructure?
We will discuss at high level how we are using kubernetes for all the tasks mentioned above.
What is Annotation? Annotation in machine learning / artificial intelligence is a process of labelling the data on text and images. The images may contain any objects such as text, images or any data to make it recognisable for machines. We require annotation because text-based data mining and information extraction systems that make use of machine learning techniques require annotated data sets for training the algorithms. At omni:us, annotation is considered the most important task as this is the step that is used as stepping stone for next steps.
We have developed our own product for annotating the data as per our requirements at omni:us. It has multiple services operating in tandem that enables data scientists and data scientist interns to annotate the data using UI.
Here are the infrastructure requirements for our annotation system
- Customers upload data to annotation system using API. Our workflow system then triggers pipeline which does OCR and other pre-processing operations. The data is then persisted to filesystem.
- Users of annotation system (in our case, they are data engineers and data scientists) use git protocol to interact with annotation system. We chose git protocol as it provides revision control natively. This data should be persisted.
- System should be scalable as and when number of annotation operations increase.
- Our data interns and data scientists can access system from office or vpn. So, system should be available over the network. Security should be paramount.
- Entire system should be easy to setup and manage.
Kubernetes cluster is setup as per best practices. We have Ansible playbooks and Terraform modules for this (depending on where the cluster is being provisioned). We use helm for installing our services. We have helm templates for creating persistent volumes and our services. All our helm templates are as generic as possible, meaning same charts can be used to install across cloud providers and on-premise installations. We have script that expects list of services as argument and install all required service to kubernetes current cluster context.
What is training in Machine Learning? It is a process of training an ML model which involves providing an ML algorithm (that is, the learning algorithm like NLP, CV etc.) with training data to learn from. The output of training process is a ML model which is the model artifact that is created by the training process.
Most data scientists divide their data (with data from annotations phase and for incremental training, data comes from prediction) into two portions: training data, and testing data. The training data is used to make sure the machine recognizes patterns in the data and the test data is used to see how well the machine can predict new answers based on its training.
This training process may involve GPU depending on algorithm used computational time required to run the experiment. The more the time constraints, the higher the need to run on GPU.
At omni:us, we believe in code reusability. We took a decision of having same code for both training and prediction (this we will discuss in few minutes). Each algorithm will get it’s own micro service for training and prediction. ML code is front-ended by flask with batteries included for custom logging. This code is containerised and helm charts are created.
- Training ML service need access to the data that should be trained on (Data of Annotation step). As mentioned above, data is split into two datasets (training and testing). We have one another micro service called splitter that takes care of this operation.
- Training service generate ML model which should be persisted to disk (for long-term needs and revert if new model is not performing well). Another important step is ML model should be sent to prediction system and this operation should not have any manual intervention.
- Since training involves GPU and GPU servers are expensive to run, we don’t want training service to be online always. We designed system where training jobs are submitted by users which provisions GPU node on cloud, attach it to kubernetes, run training, persist model to Disk and GPU node is de-provisioned.
- We are using MLFlow that let users (in this data scientists or ones who run trainings) to version models, check experiment statistics at any point of time. It provides UI to check training operations too and it has API for tech savvy operators to interact with.
ML training runs as Kubernetes job. Our script use terraform templates to create new GPU node pool, then use helm chart to deploy K8S jobs to clusters and then delete node pool. Model generated from this step are persisted to disk and then we have service that lets user choose which model to move to prediction system.
Prediction” refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when you’re trying to forecast the likelihood of a particular outcome. The word “prediction” can be misleading. Sometimes, we predict on future outcome like in the case of weather prediction or “items you may like” in e-commerce site and some other times we predict on already completed or about to complete ones like fraud detection.
Here at omni:us we are catering to insurance companies and are providing software to support automation of claims handling with AI.. As part of the initial complex data ingestion to support claims handling, we have an deep learning prediction service to perform data extraction from complex unstructured documents and sometimes even from hand written forms such as European Accident Statements. And we run ML service along with other engineering services in a pipeline. We run a microservice architecture with base services such as OCR, Pre-processing services, redis, postgres DB, Airflow for orchestration and other analytics and monitoring services. Airflow is used as workflow engine that orchestrate all these services. Here are our infrastructure requirements.
- Predicted data, Postgres metadata needs to be persisted.
- Models pushed from training system should be stored on disk that prediction service can access.
- Models used by prediction service should be upgraded at will.
- Since we are dealing with insurance sector which is strictly regulated, data from prediction service should not kept long-term on prediction cluster and moved to training system for re-training.
Keeping all above requirements, we have designed best possible strategy/deployment model from deploying prediction service on kubernetes. This cluster also use same persistent disk model like annotation and data is moved to annotation system after prediction either directly or using intermediate service. This could be manual or automated depending on customer needs.
One thing to note here is we took a modular approach while designing or building these micro services, since our installation could be using SaaS model wherein we manage customer services or on-premise models wherein customers install and manage services themselves. If it is on-premise models different companies use or implement different software for same purpose. For example, one company may use ELK for logging and other company may use Splunk. We recommend and ship our solution with, say ELK for logging. At the same time, with our modular approach customers can integrate these services with Splunk too without the requirement of re-architecture. This is applicable for monitoring, data transfer between annotation and prediction systems etc. Needless to say that we use helm charts for deploying prediction services too.
Given that we use kubernetes and helm so extensively for shipping services, we will discuss on how we use and their best practices in following posts.