MLOps at Edge Analytics | Data Storage with AWS S3 and Boto3

Part One of Five

Connor Davis
Edge Analytics
7 min readApr 14, 2023

--

Image created with DALL-E 2.

As machine learning models become more widely deployed, ML practitioners have shown increasing interest in MLOps. In our introductory blog, we give a brief background on how we think about MLOps at Edge Analytics.

Here we look at the first pillar of an MLOps pipeline: data storage. This post is part one of a five part series. For the example dataset below, we store raw data for model training and evaluation in an AWS S3 bucket and access it using the Boto3 SDK in Python.

You can find the other blogs in the series by following the links below:

All ML projects start with data. This data, whatever file type it is, however clean it is, must be stored somewhere. When deciding where to house the data, you might consider the following questions:

  1. What is the scale (MB / GB / TB) of our dataset?
  2. What format is our data in (tabular, images, time series, text, etc.)?
  3. What security features do we need for our dataset?
  4. Do we want the dataset to be publicly available or restricted in any way?
  5. What functionality exists to access the dataset as quickly as possible (I/O)?

The answers to these questions can begin to provide guidance on how data should be stored. At Edge Analytics, we have experience using both data warehouses (e.g. Google BigQuery) and data lakes (e.g. AWS S3). For this example pipeline, we’ll lean towards data lakes for their ease of use and flexibility.

AWS S3

Object cloud storage services like AWS S3, Microsoft Azure Blob, and Google Cloud Storage are widely used and very flexible as data lakes. Although all three provide similar features at similar prices, AWS has led the cloud market in recent years. For this example pipeline, we’ll store all data in AWS S3 buckets. Regarding the questions above:

  1. What is the scale (MB / GB / TB) of our dataset? AWS S3 storage is scalable. Although there are limits to how many buckets can be created per account (up to 100, unless a service limit increase is requested), there is no upper limit to the size of a bucket. Additionally, it is relatively inexpensive to request more storage space.
  2. What format is our data in (tabular, images, time series, text, etc.)? S3 can be used to store data for countless use cases and arbitrary data types. Setting up a data lake of raw or processed data files for your project is very straightforward.
  3. What security features do we need for our dataset? S3 has many security features that ensure the protection of data like automatic encryption and Macie, which discovers and reports on sensitive data in your buckets.
  4. Do we want the dataset to be publicly available or restricted in any way? Access to S3 buckets is easily managed through services that dictate who can see what data and when. Credentials with varying levels of access can be granted by your AWS manager. Contrarily, if you want to make the data public, S3 will allow that too!
  5. What functionality exists to access the dataset as quickly as possible (I/O)? There are several ways of accessing files stored in S3 buckets including a command line interface (CLI), the AWS S3 GUI console, or the Boto3 software development kit (SDK) in python.

The product offerings for S3 are expansive. We touch very briefly on a few features we use frequently. AWS offers more comprehensive documentation on the S3 landing page.

An example storage solution

Prior to interacting with data in S3, you’ll first need to set up AWS credentials and install the AWS CLI. Once you have credentials for accessing S3, run `aws configure` in your local terminal. A `config` and `credential` file will be created, and you’ll be free to use the S3 CLI. If working in Python, you can install the Python Boto3 SDK instead.

The data for our blood images classifier example was borrowed from this Blood Cell Images dataset on Kaggle. It consists of 410 JPEG microscope images of white blood cells, 410 XML files with image annotations (one per image), and a single CSV file that gives the cell type label for each image. These files were downloaded to a local machine, then uploaded as-is to an AWS S3 bucket via the CLI. The file structure is shown here:

File structure of the blood images dataset.

Helpful tip!

S3 buckets are data lakes; they have a flat file structure. So although it is easier to think in terms of a hierarchical file structure, like the one above, the reality is a bit different. Rather than treating files as located within folders, S3 treats cascading directories as prefixes to file names. So the file structure in S3 actually looks like this:

File structure of the blood images dataset in an S3 bucket.

Because of this flat structure, there is a tradeoff between finding files easily in the AWS GUI console and accessing files quickly via code. Using a hierarchical folder naming convention can help a user manually find files but slows programmatic access if many files must be read. Alternatively, giving files unique names without folder prefixes speeds up recursive file reading but makes it harder for a user to identify a file based on name.

Although the AWS S3 CLI and GUI console are great for mobilizing data files, the Boto3 SDK in Python works best for loading them into working memory of a Python application, which is our use case here. The functionality of Boto3 is extensive, so much so that we decided to distill and simplify only the functions we needed into a separate class called S3Operations. This class includes ergonomic methods for:

Data I/O methods in the S3Operations class.

As mentioned, the methods in the S3Operations class are wrappers over core Boto3 functionality. We find these convenience methods speed up development time by abstracting hard-to-remember Boto3 syntax into method names we can easily reference. It turns out, for most data I/O we do, we typically use a fairly small subset of Boto3’s total functionality.

Here is a short snippet of how we use the S3Operations class.

Example usage of S3Operations class.

Helpful tip!

The Boto3 SDK requires either a “client” or “resource” to be instantiated in your Python program (e.g. as client = boto3.client(“s3”)) for making S3 service requests. We recommend always using “client”. The Boto 3 “client” is a low-level interface that has broad utility and returns metadata for your S3 request along with the data. The “resource” is a high-level interface built on top of the “client” that makes accessing files simple at the expense of some flexibility. As of January 2023, AWS does not intend to add new features to “resource.”

Up next

Now that we have a place to store the raw data and a way to access it, we will move to the next step in our MLOps pipeline: data processing. Stay tuned for the next blog which we expect to publish the week of April 24.

Machine learning at Edge Analytics

Edge Analytics helps companies build MLOps solutions for their specific use cases. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.

--

--