Enterprise MLOps platforms for Insurance Companies on AWS
Over the past few years, we’ve been working extensively on building MLOps platforms on AWS, refining our approach with every project. As machine learning becomes more widely adopted across industries, investing in MLOps is essential. MLOps, or Machine Learning Operations, ensures that machine learning models are not only developed but also deployed and maintained efficiently in production environments.
In this article, we want to share key insights we’ve gained, focusing on how we’ve developed strategies to streamline the development, management, and deployment of machine learning models specifically for insurance companies. In industries like insurance, companies often operate in multiple countries or regions, each with its own set of regulatory requirements. This adds complexity to their operations, requiring MLOps solutions that can accommodate diverse regulatory standards and ensure compliance across all markets.
The Challenge: Building an Enterprise MLOps Platform Across Multiple Regions
The main challenge in building a scalable MLOps platform for international operations lies in balancing flexibility with standardization. For instance, when deploying across regions like Spain, Italy, and France, each country may have distinct teams of data scientists, stakeholders, service level agreements (SLAs), and specific requirements. Moreover, these regions often contend with unique regulatory demands, diverse user demographics, and varying data structures. The objective is to develop a centralized, standardized approach that not only accommodates these differences but also ensures consistent and efficient workflows across all teams.
AWS as the MLOps Backbone
Given our need for scalability and efficiency across multiple regions, AWS emerged as the ideal foundation for our MLOps platform. Its flexibility and comprehensive toolset enabled us to deploy models in a reliable, automated, and cost-effective manner, allowing us to meet the diverse demands of each region without compromising on consistency. With AWS as our backbone, we could ensure that our platform remained adaptable yet uniform across all operations.
Now, let’s break down the three critical components of our platform:
🧠 Model Training
The first priority was to create a standardized development environment that could be shared across the company’s different branches. Our proposal aims to provide a customized development environment for the various branches, leveraging the power of cloud computing while maintaining order and consistency through a centralized model registry in the central account. This approach ensures that the development and approval of models are standardized processes, with each branch responsible for their own models, while allowing us to maintain an inventory of approved models and keep track of their products across the organization.
On AWS, SageMaker Studio serves as the core development platform:
• Ease with Jupyter Notebooks: Data scientists are usually already familiar with Jupyter Notebooks and Python, which makes SageMaker Studio easy to adopt. Leveraging this familiarity accelerates integration into existing workflows, reducing the learning curve and increasing productivity.
• User Management: Integration with identity federation systems, like OpenID, simplifies user onboarding and management. Configuring SageMaker Studio to work seamlessly with these systems ensures that permissions are correctly assigned and managed by the central governance team consistently with the other applications of the organization.
• Custom Library Support: SageMaker Studio supports python kernels built on custom images, allowing teams to work with specific versions of libraries they require. To manage this effectively, we generally implement a tailored CI/CD process for creating and attaching custom images to SageMaker Studio environments. This ensures that each team has access to the precise tools needed for their projects, fostering consistency and efficiency.
• AWS Tools for Scalable Model Training: SageMaker Studio enables data scientists to leverage cloud resources for training large and complex models. They can utilize tools such as SageMaker Pipelines, SageMaker Processing Jobs, and SageMaker Feature Store, with all their experiments being tracked and easily reproducible.
• Centralized Model Registry: For companies operating across multiple regions, maintaining a centralized SageMaker Model Registry is critical. This registry keeps production-ready versions of models that are deployed in various regions. We typically propose an event-driven architecture, where models that have been manually approved and validated for production in local accounts are replicated to a central account. This setup supports versioning of models, manual approval processes, event-driven replication, and centralization, enabling a standardized and efficient deployment process across all regions.
🚀 Model Deployment
Deploying machine learning models in a production environment presents unique challenges, especially within the insurance industry where stringent requirements and diverse regional needs must be met. Our deployment process is designed to address these challenges, ensuring that models are not only deployed efficiently but also adhere to strict performance standards and regulatory requirements. As usual, we utilize a CI/CD pipeline, enhanced by AWS SageMaker’s capabilities and Infrastructure as Code, to manage the entire deployment lifecycle. Below, we introduce the key features and challenges of our model deployment strategy:
• CI/CD Pipeline Integration: Our deployment process is fully automated through a CI/CD pipeline that uses infrastructure as code (IaC) to create SageMaker endpoints. These endpoints serve the models via an API Gateway and Lambda functions. The models are pulled from the centralized model registry, ensuring that only approved and validated models are deployed. This centralized registry is crucial for governance, preventing any local branch from deploying models without going through a formal approval process.
• Meeting Real-Time Requirements: A significant challenge in the insurance industry is meeting strict real-time requirements, as models are used for critical tasks such as calculating risk scores, pricing, and underwriting. These tasks often demand response times within a few hundred milliseconds, with slight variations across the different regions. To achieve this, we perform rigorous performance testing using tools like Apache JMeter and fine-tune the endpoint configuration. This process involves adjusting the number of instances and autoscaling settings to ensure the infrastructure can meet the specific SLAs required by each branch.
• Inference Pipelines and Multi-Model Servers: One of the more complex challenges we faced was managing the deployment of both legacy and latest models while keeping costs down and performance high. Our solution combines the use of SageMaker’s inference pipelines and multi-model servers, an uncommon but highly effective approach in this niche use case for insurance companies.
- Inference Pipelines: An inference pipeline allows multiple containers to operate behind a single endpoint, each handling different steps in the inference process, such as preprocessing and model inference. This setup is particularly useful in our scenario, where we need to apply different business logic depending on whether the request is being processed by a legacy model or the latest model version.
- • Multi-Model Servers: Multi-model servers enable us to host multiple models on the same infrastructure within a single container. These servers dynamically load models from S3 into the container as needed, keeping them in cache for future use. This allows us to maintain as many legacy models as required without significantly impacting performance. However, one limitation is that all models, both legacy and current, must share the same framework (e.g., XGBoost), which is generally reasonable since we’re dealing with different versions of the same model family.
• Cost-Effective Infrastructure Sharing: By leveraging inference pipelines and multi-model servers, we can efficiently manage both legacy and latest models on the same infrastructure. This strategy ensures that the models meet their SLAs while minimizing costs. The legacy models, though less frequently used, share the infrastructure with the latest models, ensuring they remain available without degrading the performance of more actively used models.
Conclusions
We’ve explored the intricacies of building a robust MLOps platform on AWS, focusing on the critical components of model training and deployment. We’ve highlighted how a centralized development environment, combined with a well-designed deployment strategy, can address the unique challenges faced by insurance companies operating across multiple regions. From creating standardized development environments to deploying models with strict real-time requirements, our approach ensures consistency, efficiency, and compliance.
However, model training and deployment are just the beginning. To create a truly end-to-end MLOps platform, these capabilities can be further enhanced with additional layers of functionality. Monitoring and logging are essential for maintaining the health and performance of deployed models, enabling proactive detection of issues and ensuring models continue to meet SLAs. Automatic retraining can be implemented to keep models up-to-date with new data, ensuring their relevance and accuracy over time. Furthermore, advanced features like A/B testing, continuous integration, and integration with business intelligence tools can drive even greater value from the platform.
By integrating these elements, organizations can build a comprehensive MLOps platform that not only meets today’s demands but also evolves to handle future challenges. The combination of a solid foundation with ongoing enhancements ensures that the platform remains scalable, reliable, and capable of supporting the dynamic needs of the insurance industry. As we continue to refine and expand our approach, the possibilities for innovation and efficiency in MLOps are limitless.