A Practical Guide to AI Product Management: Part 2
ML team management, product planning, development process
In Part 1 we covered the groundwork an AI PM needs to do. In Part 2 let’s start with how to manage your ML team.
In addition to your typical application development team structure, you will need engineers dedicated to ML development and deployment. The exact number, skillset and experience depends on the project specifics. You can either have ML engineers and application developers in 2 units within a single team or as 2 separate teams. In either case, both should work parallelly but certainly not in silos. A single PM should own the outcome of the entire ML feature and act as a bridge between the two units/teams to ensure everyone is on the same page and easily able to easily integrate their code.
Ensure everyone from both units is clear about the ML feature’s job-to-be-done and constraints around implementation, preferably by including these in the product’s first principles. Does your feature need real-time inference? Does the model have size constraints? Is there an accuracy benchmark? How often and at what scale will inference run? This clarity will drive essential architecture and development decisions. For example, the ML unit will understand input data and inference result formats, how to tackle the accuracy vs size vs inference time tradeoff (more on that in Part 3), how frequently inference pipeline will run, etc. Similarly, the application developers will be able to decide on things like optimal data communication (sockets, REST API, pub/sub, etc) while UI/UX designers will know how to best communicate results to the user.
As with any software product, an architecture diagram is extremely helpful for the team to understand data flow and processes. If your use case is quite complex you should consider creating a separate architecture diagram for ML and application development.
Kanban for machine learning development
Two week Sprints for application development
Run both in parallel but in sync with each other
Machine learning software development is an experimental process. While you may have an idea of which models will work for your use case, it’s difficult to pinpoint the optimal one without experimentation. Generally, teams will choose some models to train depending on the problem statement and then shortlist a few based on initial inference results. During this process, they may have to iterate several times by changing parameters and training techniques. The shortlisted models may be trained on a larger dataset for improving accuracy or depending on the application, optimized for a shorter inference time, size or computing device. Training time can widely vary according to the selected model and the ML technique. For example, neural networks can take anything from hours to days to finish training depending on the model architecture and data volume. Based on the experiment results and the previously defined constraints, a single model will then be finalized for deployment.
Because of these factors, the entire development roadmap often cannot be clearly defined beforehand. Rather, every stage depends on the results of experiments in the previous one. Time-bound sprints are not a great option in this case. A Kanban board to track current tasks + a regularly updated backlog works best for managing ML teams. The backlog should ideally have a single story for each experimental approach. The recommended Kanban stages are — On hold, To-do, In progress, Review, Done.
On the other hand, for application development, it is relatively easier to create a roadmap and organize development into 2-week sprints containing user stories with clearly defined outcomes. The application team can work efficiently using this tried and tested approach that PMs should already be familiar with.
Now that you know how to set up the team, let’s break down each step in the development process.
Data ingestion pipeline
The very first step in building an ML model is ensuring you have access to the right data. A data ingestion pipeline involves collecting data and storing it in a usable format in an easily accessible location for ML model training and inference. Here’s a (non-exhaustive) list of questions you should ask at this stage:
- What is the format of the data?
Data can be in the form of images, video, audio, text, CSV, JSON, etc.
- What is the data source?
The data can originate from cloud storage, database, data warehouse, logging tool, user analytics tool, camera, microphone, etc.
- Is the data generated digitally or in the real world?
An example of data generated in the real world is a video of cars moving through a traffic intersection captured by a camera.
An example of data generated digitally can be a user’s purchase history or browsing patterns on an e-commerce website.
- Where will you store the data?
This is the location from where the data will be accessed for pre-processing and training.
- Is it continuously generated or in batches?
Identify whether your data needs to be constantly streamed or whether it can be moved in large batches. Both will require different cloud architectures. In either case, make sure you know how much data you will be receiving per unit of time (example — 5 GB per day).
Based on answers to the above questions, your team will build the pipeline to send data from source to destination. Depending on your problem statement some options are using a REST API, a cloud function, a pub/sub service, a data streaming service, an offline to online service like AWS Snowball or whatever else works best. Once your data is where you want it and in the appropriate format, move on to pre-processing.
Real-world data is messy. It needs to be scrubbed clean to make it work for ML models. Depending on your problem statement, the pre-processing steps vary. For a regression model, your team will do an exploratory data analysis to visualize the data and its correlations, finalize which features (columns) are relevant to the model and whether they need to be scaled. Categorical (text-based) features will need to be encoded appropriately. Identify the independent variables — the features input to the model, and the dependent variable — the value to be predicted based on the input features.
For a computer vision application, you may need to extract frames from videos, work on denoising and identify classes for classifying the images or objects in them. In a data set with images of animals, each animal (dog, cow, cat) will be a separate class.
At this stage, you may also want to mask any personal data like people’s faces if required for compliance reasons. Pre-processing techniques vary widely depending on the problem statement and can be learned as a part of online ML courses. If your team doesn’t pay careful attention to get this right, the model output will be flawed.
Finally, the data also needs to be split into a training set (used to train your ML model) and a test set (used to evaluate the performance of your ML model). A rule of thumb is 80% of data goes into the training set and 20% into the test set.
Data annotation, or tagging, is the process of telling your model what the data you feed it contains. This step is necessary for deep learning ML applications like computer vision or natural language processing (NLP). Let’s say you want to train a computer vision model to count the number of cars and bikes that cross a traffic intersection. In every frame of video captured by the camera, you will need to draw a bounding box around the vehicle and label it with the appropriate class — car or bike in this case.
For a self-driving car or cancer detection application, object detection using bounding boxes is just not good enough because the models need to detect exact shapes. Instead, the annotation involves creating polygon masks over the shapes of objects as shown below for use in semantic segmentation models.
The result of the image annotation process is typically an XML or JSON file that contains the coordinates of the annotations along with a class tag that identifies the object. The images along with the annotation data file are used to train the model. During the training process we are teaching the model that in an image, a box with the coordinates (x1,y1), (x2,y1), (x1,y2) and (x2,y2) contains object A.
Ensure that your team annotates only the training data set. The test data set will be used to evaluate the model performance after training. Annotation is a time-consuming process. For applications with huge data sets, it is usually done by a dedicated team of annotators which may be in-house or outsourced to an external agency.
As an AI PM one of the most important things you need to do is set up streamlined annotation workflows and ensure that there is a rigorous QA process before the annotated data is used for training. If possible, have one annotation QA reviewer for every 10 annotators. This reviewer’s only job is inspecting and approving annotated datasets one sample at a time. Even then, before starting model training ensure that the ML engineers take another final look at the annotations to make sure everything is good to go.
In machine learning, garbage data in = garbage results out
In most cases, you will not need to build your own data annotation tool and can go with an open source option. An external agency might have its own annotation platform. Here are some open source tools that your team can use:
Once your pre-processed and annotated data is ready, it’s time to train selected models on the training dataset. Depending on which models you’re using and the size of the dataset, your ML team may decide to train on their local systems or in a cloud instance.
To train massive models, especially deep learning neural networks for computer vision or NLP, you will usually use a GPU. GPUs have an advantage for deep learning training because of their parallel computing architecture. Deep learning training consists of millions of matrix multiplication operations (recall your linear algebra class if you have an engineering background). Instead of performing these operations sequentially as in a CPU, we can do them parallelly in a GPU because of its distinct architecture. Nvidia GPUs are pretty much the industry standard for deep learning. Nvidia has also developed tools like CUDA, a parallel computing platform and programming model for working with GPUs.
Training time varies widely and can span from minutes to hours to days depending on the model, data volume and hardware used. Work closely with your ML team to work out a training strategy that covers which models they are starting with, the number of experiments to be conducted, and the hardware used. Based on this they will be able to estimate training time. Log all experiments in the backlog and move the ones actively being worked on to the Kanban board. Iterate through the framework described in the team management section.
Once a model completes training, feed it the test dataset but without the feature that you want it to predict (only the independent variables). If you’re predicting car prices, leave out the price column which is the dependent variable. For annotated datasets, just feed the images and not the annotation files. The model should return predictions based on the training process, aka inference results. Compare the inference results with the actual data that you didn’t feed it. The closer the match, the more accurate the model is. This, of course, is a simplified explanation of the process. There are mathematical methods to evaluate model accuracy but you can pick those up from any ML course.
Always ensure that your team is documenting training and inference metrics for all your experiments because this data is invaluable. Meticulous documentation can save you precious time and money while finalizing the production model. Some metrics your team should be recording are training & inference time, number of training epochs, training loss, inference accuracy, hardware used, model size, cost incurred, and engineers’ qualitative assessment comments.
That wraps up Part 2! Part 3 covers how to pick the final model, integrate it with your application and deploy it in production.