A Journey of Machine Learning Practices

Hamza Mushtaq Mian
If Technology
Published in
5 min readMay 5, 2021

An overview of evolving ML practices at If

Photo by Matt Howard on Unsplash

As the largest property & casualty insurer in the Nordics, If has a long history of applying advanced analytics and predictive modeling for insurance products and their pricing. In the past few years, the practice of applying predictive models and machine learning has expanded to other areas such as customer acquisition and sales, customer services operations as well as claims handling and operations.

Starting from a small number of disparate data scientists, today we have a few dedicated squads working towards applying and iterating the use cases in these non-pricing domains. This journey has been reflected by a continuously maturing set of practices and platform to support these teams.

How it all began
Around 2017 we found that data, platform and machine learning (ML) engineering workflows were a key bottleneck around progressing with ML use cases. With multiple data scientists working on different models, there was a lot of one-off development needed to develop and maintain models and relevant pipelines in production. Moreover, with use cases involving access via APIs, areas around authentication, performance, monitoring and privacy all became very relevant to us.

While we already had data engineering support towards data scientists, the engineers and data scientists were further reorganized into a single team to support a common goal, share practices and concerns, as well as to avoid co-ordination across team silos. At this point the work towards the first building blocks for an ML platform at If began.

The ML platform development was based on four key principles:

1. Enable data scientists to work with their preferred tools and languages, e.g. Python.

2. Approach the development and deployment of models as self-service features for data scientists. It should not require a data or ML engineer to deploy a new model and data pipeline. Rather the engineering focus should be on self-service enablers with an emphasis on automated and standardized workflows.

3. Setup a resilient and performant platform, which can scale for real-time serving, batch scoring as well as follow best engineering practices for APIs and data integration.

4. Adopt dev-ops related practices and adapt them to ML workflows, now commonly referred to as ML-Ops.

Overview of developed platform and services for Machine learning

Current focus: ML with image, audio and language data
Advancing on image, audio and language modeling has meant our data scientists and engineers adapting the way that they have been working with different use cases. This has led to changing the ways of working, as well as extending the technology in use.

Areas of focus for ML development

With the availability of many external models and services overall, there has been a wider set of approaches to consider for the utilization of ML. On the one hand, we have both cloud providers and several vendors providing high-quality APIs which can be used out of the box as well as end-to-end products for specific purposes. On the other hand, the practice of transfer learning and self-supervised learning has also opened up the possibility to utilize high-quality open-source models. We have found models and frameworks maintained or curated by leading companies to often be very competitive when pre-trained on very large quantities of data and then further fine-tuned on our datasets. Overall, we have a pragmatic approach to this area and have been incorporating external products, models and services where it makes sense to do so.

Adapting the above also has an impact on our ML engineering practices and platforms. Storing unstructured data has lead us to set up data lake storages in Azure. For training models, we have had to adapt utilizing GPUs for training purposes. To this end, utilizing Databricks has worked well for multiple teams: especially with the possibility to readily provision and scale up and down the GPUs as needed for training. At the same time, we have also looked at various methods to setup the scoring process in an efficient and scalable way. This has ranged from making the models more compute and memory optimized using quantization, ONNX runtime as well as exploring other related techniques. We are also looking to iterate further on the use of server-less options to serve the models at scale.

The collection of data for modeling with image, audio and language also goes beyond the typical case of aligning on business definitions for tabular data sets. Working with training and validation with unstructured data typically requires involvement with the consumers of our use cases in to gather labeled data. The complexity of this varies case by case but it involves agreeing to common definitions and annotation guidelines with the different annotators and domain experts involved in the labelling process. This requires a strong degree of collaboration and alignment right at the start of the process.

This kind of collaboration is not only necessary from a training and validation perspective, but the transparency and understanding brought by it also helps to demystify the process. Further, we have created a set of AI guidelines in If that we adhere to in our process. Here the focus goes beyond just agreeing on labels but also having a pedagogical explanation of the use of the models as well as the metrics and bias, etc. that come into play. With a strong emphasis on privacy-by-design we have also adopted various methodologies for each class of data to pre-process and store data in a compliant way already before we began the annotation and modeling process.

Now what?
The focus areas for ML practices and related engineering efforts have naturally followed the maturity of ML development as well as the use case being pursued at If. However, as time goes on we naturally see several paths to continuously improve further.

The introduction of standardized and capable analytics tools and services is both an opportunity to make platforms more widely accessible within If as well as further standardize across several teams. Similarly, we see a plethora of tools being developed externally and maturing of products around feature stores, model management and others.

We continue to be on the lookout for best practices, tools and services which can learn or adapt to improve the methodology and tools of our data scientists.

Note: this article is related to machine learning in If’s B2C business.

--

--