Why we have a dedicated data platform @ Getsafe

Published in

Getsafe

4 min readMay 4, 2021

At Getsafe, data is at the core of our decision making. The amount of data we collect and incorporate into our products is growing every day. Being in the insurance industry, we collect and process raw data from our core product — the app, and then processed and structured data is used to facilitate different services like Business Intelligence (BI), CRM, marketing which at the end of the day, contributes to a better user experience with our product. This also helps us make sure that we launch and bring features to life that serve in the best interest to our users.

The problem

Back in the days, we could not make the best use of the data. Our workflows involved having to go back and forth with a lot of teams. We were constantly using SQL through our BI tool, or modeling and analyzing our data using Python by manually exported csv files. While this workflow still got the job done, it was definitely not the most efficient and user-friendly approach. We knew things had to change if we wanted to become more data focused as a team!

Over time, the necessity of having a data platform grew bigger, use cases that relied on data got a lot stronger, and hence we decided that it was finally time to invest in changing from the BI tool at the center of the data, to a dedicated data platform being at the heart of the data.

Our main concerns when looking for a solution to this were:

We wanted to make minimal changes to our existing infrastructure to accommodate the platform.
The platform should in the long run facilitate building and deploying of Machine Learning models in production.
The platform should support Python development out of the box and also be extremely user friendly.
The platform should have the ability to restrict employees’ access to specific data.

The solution

Upon putting some time into research and testing different solutions, Databricks proved to be the most apt solution for the following reasons:

Databricks is a fully controlled platform that makes big data and Machine Learning simpler. It makes use of Spark at its core, which includes higher-level libraries that support Machine Learning, SQL, and data streams.
The platform provides a collaborative workspace where you can build data pipelines, analyses and models in a variety of languages, including Python, Scala, SQL, R. It further allows you to train and prototype Machine Learning models.
To protect data at all levels, Databricks employs a Defense in Depth (DiD) security model. Auditing, support for regulatory requirements, data protection, identity management, and role-based access controls are among the security features.

Current setup

Now that you are caught up, here is how our current setup looks like:

To separate frequent querying of the production database from the data platform, we created follower databases and used them to access data on Databricks. By doing this, we added an extra layer of security so that raw data in its original form can never be modified unless our customers choose to.

Different data is generated at Getsafe, based on our customers’ interactions with us, and are collected using different sources:

Databases/ backend: The data collected pertains mostly to insurance, user and claims and is the core of our business. All this data is stored in relational databases, which is Postgres in our case.
Event related data: We have all our events collected and cleaned on Segment, which are dumped as files into our object storage on AWS S3.
Claims related attachments: Whenever an insurance claim is filed, our customers could upload attachments that assist their claim, which are also stored in object storage.
ETL pipelines : Even though there are many well known ETL tools, we had to build our very own pipelines. This is mainly because of how unique the use case is for Getsafe and the partners involved.

Processed data is stored in our analytics database as well as on S3 buckets as Delta Tables, which are used on Databricks for Data Science.

Databricks acts as a one-stop-shop to facilitate the use of data from the various sources. We have configured Databricks to the AWS ecosystem. Since our platform uses cloud resources, it makes our setup very easily scalable as the use cases and volume of data increase.

Here are some of the most important uses for the Getsafe data platform:

Ensuring privacy and security toward the access of data
Should be reliable, scalable and able to quickly prototype analyses to models in production
Continuous Integration/Continuous Delivery (CI/CD) of Data Science projects
Automating the reporting process
Building up custom ETL pipelines
Assisting CRM with custom data based logic
Gathering market intel and serving Business Intelligence

Ever since we set up our data platform, we are now able to perform analyses in half the time, and hence the business makes data driven decisions a lot faster. Adding to this, we can also rapidly prototype Machine Learning models deployed to a REST endpoint within a couple of hours, which took a couple of weeks before we built up the platform.

We’re really excited about what we’ve done with this project so far! We understand building infrastructure is a mammoth task, especially when such sensitive information is stored and processed. We are not done just yet and have a lot more interesting things to tackle in the future. We are excited to continuously try and keep the whole system concise, precise and easily manageable. If you want to be part of that journey, check out our job openings and apply!

Why we have a dedicated data platform @ Getsafe

The problem

The solution

Current setup

Written by Kiran Vasudev