I built a Machine Learning Platform on AWS after passing SAP-C01 exam

After getting my AWS Certified Solutions Architect — Professional certification, I was wondering: was it worth it? And does it give me enough knowledge to architect a platform?

Salah REKIK

Published in

The Startup

6 min readJul 2, 2020

So, I decided to put it to the test: I tried to build a Machine Learning Platform on AWS.

The result? This certification is definitely worth it: Properly preparing and passing it gave me the necessary tools to look at the big picture, see all the little moving parts. But it is not enough… In order to achieve this, I got inspired by what big companies with great expertise in this domain did, like Uber with their Michelangelo, Netflix, Comcast, and many others. Amazon’s client’s use cases helped me have a good vision of how they harnessed AWS’s power in order to face this challenge.

This article is the first of a few upcoming others in which I am going to describe my journey in building a Machine Learning platform on AWS.

In this first article, I am going to give a high-level overview of the platform I am trying to build.

1 | Why a Machine Learning platform?

Nowadays, companies are starting to take machine learning models more seriously. These models are being seen as a potential, if not the best, solution for a lot of business use cases. As a result, a lot of data scientists emerged to tackle these use cases: All it takes is a Jupyter notebook and some data, and we have a functioning model with a low error rate and a set of optimized hyperparameters.

However, the moment companies decide to really use this model and take it to a production environment, they realize that getting a functioning model is just a first step. A bigger challenge is yet to solve: how to take this functioning code from a Jupyter notebook to a highly available, scalable and secure production environment? To face this challenge, building a Machine Learning platform is mandatory.

2 | What is a platform anyway?

A platform is a set of ordered layers communicating with each other in order to produce a result. Each layer has one and only one dedicated role. These roles define the boundaries of each layer: the more these boundaries are clear and well defined, the more the platform is solid, scalable, and easy to maintain.

Four layers can form a platform:

Layers of the Machine Learning Platform, by the author

Infrastructure layer: Layer bringing energy to the platform: it contains the bare metal servers, virtual machines, storage systems, etc. Architects & DevOps Engineers need to build the right architecture for the infrastructure layer so that this platform can survive disasters and guarantee continuity of service.
Software Layer: Layer containing software solutions: these could be the operating system (Linux, Windows), software for load balancing, software for security management, database solutions, etc. Architects & Software Engineers are responsible for choosing the right software stack to cover the platform’s features.
Framework Layer: The brain of the platform: this layer contains the framework capable of bringing a machine learning model to production. This framework is a set of well-architected patterns and well-developed features that answer a use case landing on the platform. Architects & Software Engineers are responsible for building a reliable design for the framework and designing strong data patterns.
Use cases layer: this layer contains the description of new use cases landing on the platform and the Machine Learning models solving these use cases. Business Analysts & Data scientists work together to understand use cases and build strong machine learning models that produce convincing results.

3 | Now, what are the key steps when dealing with a Machine Learning platform? And what challenges can be faced in each step?

Four key steps to land a use case on a Machine Learning Platform and each step involves a lot of challenges:

Key steps to land a use case on the Machine Learning Platform, by the author

Use case study: before a use case lands on the platform, it needs to be well studied. Some of the questions to answer would be: What data is needed for the use case? What encryption aspects should be applied? Is there a real business value in working on the use case?
Data & Model preparation: this is the part where we bring the raw data defined in the previous step to the platform, search for and extract the good features that describe well the data, benchmark, and choose the right model. Some of the challenging questions to answer: How to reuse a model’s resources? How to reliably isolate different teams/projects resources? How to accelerate the models’ preparation phase?
Model Operationalization: after getting the right model, an even more challenging but rarely well-done step emerges. This step consists of designing the pipelines capable of taking the model into production. Among the challenges to solve: How to debug a failed production run? How to run models at scale? What possible automations to consider? What about versioning?
Model serving: In the Machine Learning world, even after getting a model with the best possible performances, exposing this model to the real world is not an easy process. Some of the faced challenges: How to monitor the model in production? How to deploy a new version without service downtime? How to improve serving performances?

Getting a model into production is not a one-time task: it is a continuous process. As new data appear, the model may underperform and must be updated to keep up with this new data. So, reiterations must be considered and scheduled to continuously improve the machine learning model.

4 | Assumptions

To build this Machine Learning (ML) platform, I made one important assumption:
I am going to suppose that a Data lake is already present besides this platform: this ML platform will not be responsible for bringing the data, it will consume it by using another Data Platform capacities to transform it. I do believe that bringing the data is a huge challenge and could be the job of a complete second platform. For the last couple of years, I have been working on a data platform whose goal is to effectively collect and centralize data from different sources and, believe me, it is really challenging. Among the challenges we faced:

Guaranteeing the data quality
Cataloging the data and managing schemas evolutions
Defining and managing data ownership
Securing the data

So here is how I see things:

Separation between a Data Platform and a Machine Learning Platform, by the author

A Data Platform: responsible for bringing the data and providing advanced data processing capacities: batch processing and stream processing. These capacities will be useful for online and offline feature extraction jobs. This data will be well cataloged and well-governed.
A Machine Learning Platform: the main subject of this series of articles. This platform will solve the different Machine Learning use cases and take a Machine Learning model into production.
Integration protocols: Defines rules of protocols to respect when integrating the two platforms.

It goes without saying that bringing the data could be integrated as a capacity in the ML platform, but this could be way more challenging.

In the next few articles, I will concentrate on building the ML Platform according to this vision.

Conclusion

In this article, I tried to give a definition of what a platform is and gave a high-level overview of the platform’s layers. Then, I presented the key steps involved in the process of solving a use case landing on the ML platform. Next, I made some considerations that will be taken into account during my journey in building this ML platform.

In the next article, I will talk about the first two layers of this ML platform: Infrastructure and Software layers.

If you have any questions, please reach out to me on LinkedIn.

Note: The illustration diagrams are inspired from PresentationGO.com templates.