MLOps at Edge Analytics | A Data Processing Pipeline

Part Two of Five

Published in

Edge Analytics

6 min readApr 24, 2023

As machine learning models become more widely deployed, ML practitioners have shown increasing interest in MLOps. In our introductory blog, we give a brief background on how we think about MLOps at Edge Analytics.

In this post, we’ll focus on approaches for transforming raw data into a form we can feed into a model. Cleaning, standardizing, and appropriately formatting raw data are all critically important in the machine learning development cycle. Building data processing pipelines that are flexible, robust, easy-to-understand, and scalable is one of the best investments an ML team can make. This post is part two of our five part series on building an MLOps pipeline.

You can find the other blogs in the series by following the links below:

Part 1: Data Storage with AWS S3 and Boto3
Part 2: Data Processing Pipeline
Part 3: Model Development
Part 4: Model Tracking
Part 5: Model Deployment

In the first part of this series, we discussed how we can store our raw data with AWS S3, a widely used object storage service. Now, we turn our focus to preparing that data for modeling. Some key considerations for choosing data processing tools include:

What format is our raw data currently in?
What format does the data need to be in for modeling?
How fast does our processing pipeline need to be?
How will we version our processed dataset so that we can re-use it in the future?

Of all the steps in an MLOps pipeline, data processing is the least likely to be plug-and-play and often requires a bespoke pipeline. The answers to the questions above will likely depend heavily on the data type (images, time series, tabular, etc.), file format, and size. As such, there is no single best tool for all of your processing needs. Still, we’ll highlight a few of our favorite data wrangling tools in Python — native dataclasses, abstract base classes, and pyserde.

Dataclasses

One of the first considerations when building a data processing pipeline is how the raw, intermediate, and fully processed data will be represented in memory. We find that structuring the data into Python dataclasses results in highly expressive data structures with clean syntax and minimal engineering overhead. Recall from the prior blog, we’re demonstrating components of our MLOps pipeline on the Blood Images Dataset from Kaggle. Representing white blood cell images and associated metadata can look like this:

Example dataclass for images.

For each image, we create an Image instance as an information rich format for storing data. The full dataset can then be represented as a list of Images. This approach has slightly more overhead than dealing with raw arrays, but is much less likely to introduce bugs. Dataclasses also have lower engineering overhead than other solutions, such as protocol buffers. Flatter data structures and protocol buffers are effective tools that we use when appropriate, but for quick and effective pipelines we prefer native dataclasses.

Abstract base classes

Although data manipulation methods will vary from project to project, certain overarching steps are standard across data pipelines. For example, we expect to use methods for loading raw data, standardizing it, and dividing the processed data into train and test sets. Python abstract base classes help in such cases where code is project specific, but we want a broadly applicable pipeline architecture.

An abstract base class acts as a roadmap for all children classes that inherit from it. An example abstract base class for a data processing pipeline might look as follows:

Example abstract base class for a data pipeline.

Note that the BasePipeline class inherits the ABC class from the abc package, and all major methods we will define for a specific data processing pipeline are decorated with @abstractmethod.

Now say we want to write an example data processing pipeline for our white blood cell images and call it BloodImagesPipeline. We can write this class as a child of BasePipeline. Doing so will enforce that the three abstract methods above are always implemented. The BloodImagesPipeline class might look like this:

Example data pipeline.

We can now instantiate this BloodImagesPipeline and use it to download, process, and split our dataset for modeling. If we had failed to define any of the three abstract methods in the parent, or attempted to give any of them a different name, the BloodImagesPipeline would fail to instantiate. In this way, abstract base classes can be used to guide development of data processing pipelines for various projects, no matter how different they are.

At the end of the data processing pipeline, we may find it necessary to store the resulting dataset prior to model training. It is a common practice for us to store processed datasets to ensure future modeling experiments are trained on exactly the same data. We may also need to capture additional information including the one-hot label encoder (if one is used), the random state used to generate the train/test splits, and other metadata associated with processing steps. Once again, we can create a dataclass for structuring this information.

Serialization with pyserde

We invest heavily in dataclasses that give our projects high structure, so we also need methods to serialize those dataclasses to disk. The pyserde library provides easy-to-use functions for serializing/deserializing Python objects into common formats including JSON, YAML, and MessagePack. To use these functions, we add the @serde decorator to our dataclass. Then, we can easily write the serialized object to the appropriate AWS S3 bucket or local storage. Pulling objects from S3 and deserializing is similarly straightforward.

Example dataset dataclass.

There are a couple of caveats to using pyserde:

Certain package-specific objects, such as the OneHotEncoder from the scikit-learn package, cannot be serialized. When using pyserde, it is best to stick to primitives, containers, standard library objects, and user defined dataclasses. You can find supported data types in their documentation.
Serializing with pyserde is a good option assuming the entire dataset can be loaded into memory all at once. With large datasets, this may not always be the case. Under those circumstances, other storage solutions like Tensorflow sharded datasets will be more appropriate.

With pyserde and dataclasses, we can turn entire train and test sets into serialized objects that can be uploaded to and downloaded from cloud storage. In this way, we avoid processing raw data with each new model training. This helps save compute time and increases confidence in results by allowing for versioned datasets to ensure the exact same data is being fed into repeated model experiments.

In this post, we looked at our solutions for setting up data pipelines using dataclasses, enforcing structure with abstract base classes, and serializing/deserializing with pyserde. In the next blog, we’ll dive into model training and development.

Machine learning at Edge Analytics

Edge Analytics helps companies build MLOps solutions for their specific use cases. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.