MLOps at Edge Analytics | Model Development

Part Three of Five

Published in

Edge Analytics

7 min readApr 25, 2023

As machine learning models become more widely deployed, ML practitioners have shown increasing interest in MLOps. In our introductory blog, we give a brief background on how we think about MLOps at Edge Analytics.

In previous posts, we examined tools for data storage and I/O and data processing; here, we look at model development. During this step, machine learning models are trained and evaluated on the processed data. This is part three of our five part series on building an MLOps pipeline.

You can find the other blogs in the series by following the links below:

Part 1: Data Storage with AWS S3 and Boto3
Part 2: Data Processing Pipeline
Part 3: Model Development
Part 4: Model Tracking
Part 5: Model Deployment

We first need to decide on a framework for training and evaluating models. This framework will need to answer the following questions:

What library do we want to use for modeling (TensorFlow, PyTorch, JAX, etc.)?
What modeling architecture is appropriate for our data format and size?
How do we define a successful model?
What are appropriate evaluation metrics?

Recall that we are using the Blood Images Dataset from Kaggle as our example problem where we predict cell type from a microscope image. We use convolutional neural networks (CNNs) for this task since they are the standard architecture for image classification. To build the model, we can choose from a number of deep learning platforms like PyTorch, JAX, or TensorFlow. In this example, we use Tensorflow.

Model configuration

CNNs, even simple ones, are composed of many configurable hyperparameters. We can build limitless flavors of CNNs by varying the number of convolutional layers, max pooling layers, dropout layers, etc. Furthermore, each of those layers has its own hyperparameters, including convolutional kernel size and stride, pooling kernel size and stride, dropout rate, and more.

We believe that a good pipeline for model training should have a fair amount of structure while giving an ML engineer access to the knobs necessary to succeed on a project. For that reason, we typically wrap the code to create and compile the model into a main script that accepts an easily edited configuration file. We generally use a YAML file for configuration because its structure is human-readable, and its contents translate into intuitive Python base types. The configuration for our simple CNN is written in the YAML file like this:

cnn_config:
    # The first three lines specify the input data shape.
    input_x_pixels: 480 
    input_y_pixels: 480
    input_channels: 1
    # The convolutional layers.
    num_convolutional_layers: 3
    num_convolutional_filters: 5
    convolutional_kernel_size: 3
    convolutional_kernel_stride: 1
    convolutional_activation_fxn: "relu"
    # The max pooling layers.
    num_max_pool_layers: 1
    max_pool_size: 2
    max_pool_stride: 2
    # The output activation and number of labels.
    output_activation_fxn: "softmax"
    output_len: 6

We can also use this YAML configuration file to specify other model-related information such as batch size, number of epochs, loss functions, and evaluation metrics. We can produce an entirely new model by simply editing the config file and passing it from the command line, like this:

python train.py --config model_config.yaml

Having all this information accessible in a single file makes editing the model structure convenient and reproducible since we log the config file with the other outputs of the training run.

The training process

Model training has been simplified significantly in recent years by packages like TensorFlow, PyTorch, and JAX. High level interfaces in these packages abstract away many of the pain points of deep learning, so that we can more rapidly iterate on new ideas rather than reinventing the wheel each time. We start with our model configuration file and the processed data (split into training and testing sets):

Overview of the model training and evaluation process.

When we run cross validation, we repeat the training process by splitting the data into train and validation sets and complete the process of model building, training, and evaluating. Note that most projects will typically have an additional holdout test dataset that is only used for evaluating the final model prior to deployment.

For each training run, we log relevant information to help us reason about the model in post-hoc analysis. In this basic example, these outputs include the training log, model file, and model evaluation results. Altogether, they help us keep track of the models we have built and how well they perform. We’ll dive deeper into model logging and tracking in the next blog.

Optimizing model hyperparameters

All models have hyperparameters that affect model performance. In the case of our simple CNN, we can modify the model layers until we are happy with how it performs on the validation dataset (note: not the final holdout test set!). Manual tuning of hyperparameters is often not feasible, so we turn to solutions for scaled hyperparameter searching. These solutions will train and evaluate many architectures to find those with the best performance.

The KerasTuner package is built on Keras functionality and provides simple and efficient methods for hyperparameter searching. With it, we can specify ranges of hyperparameter values over which to search. This includes discrete values and continuous distributions for any given hyperparameter. We often configure our pipelines to run searches using the same configuration file as a typical model training, where the hyperparameters of interest contain a list rather than a single value. Downstream, the code contains the logic to treat these lists as axes of a search space.

Configuration settings for a single model build versus hyperparameter search.

KerasTuner allows the user to specify how many trials to run, which metrics to use for model evaluation, and which search method to use for selecting hyperparameters (e.g. random search, bayesian optimization, hyperband, etc.). And conveniently, if the tuner chooses an impossible model configuration, it will skip that trial without interrupting the entire search. The KerasTuner package also includes a class for hyperparameter searches on Scikit-learn models!

As with running the simple train.py script above, all values needed for the hyperparameter sweep are saved in a single tune_config.yaml file. So if we need to change our sweep values, we can do so in the config file and run:

python tune.py --config tune_config.yaml

Remote training with Ray

Local development is excellent for writing code and unit testing models, but it is typically resource-limited for training and tuning purposes. Ray is a library that offers the ability to scale your Python applications with ease, spinning up remote instances on AWS, GCP, Azure, and more to handle workloads of any size. Ray strikes an impressive balance between being easy to use and configurable. At Edge Analytics, we’re big fans of Ray, and we’re in good company.

Getting started with Ray is easy. We recommend this tutorial to launch your own Ray cluster on AWS. Using the tutorial will give the Ray workers access to your S3 buckets auto-magically. Once the cluster is properly configured, you can send your local Python training script to be run on a remote instance.

Ray offers interfaces for Python and command line. With our goal of maintaining as much independence from any individual platform as possible, we use the CLI here. Since our data I/O for this example is cloud-based, a simple command can kick off a training job on the Ray cluster without any modifications to the local training code:

ray job submit --working-dir . \
               --runtime-env runtime.yaml \
               --entrypoint-num-gpus 1 \
               -- python train.py --config model_config.yaml

Here are a few helpful tips when working with Ray:

Spin up a cluster with a small head (m5.large) to keep running costs low. This instance will run constantly and wait for new requests. You can also stop the instance when not in use.
Virtual CPU allowance with AWS needs to be high enough to accommodate the number of workers you need, and you will likely need to request more.
S3 permissions for your Ray workers on AWS can sometimes create issues. This thread is extremely helpful for troubleshooting.
Setting up a runtime environment YAML file can help clean up your job submission command while adding environment variables and specifications for pip/conda environments.

Many other great resources for training models on remote instances exist, such as Vertex AI at Google and SageMaker at AWS. We have experience running model training scripts on AWS SageMaker Script Mode instances and have found it to be an involved and restrictive process. Comparatively, Ray is easy to set up and flexible, making it a great resource for our example pipeline.

Up next

In order to get our model just right, we might have to train countless variations of it. When we have hundreds or thousands of model results to sort through, how do we find the best ones? And once we’ve identified those models of interest, how can we ensure reproducibility? These questions are addressed by our next step in the pipeline: model tracking.

Machine learning at Edge Analytics

Edge Analytics helps companies build MLOps solutions for their specific use cases. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.