Launching Databricks at If
We use Databricks as an analytics accelerator and a status quo challenger. Here’s why, and how we did it.
At If Insurance, we make sure our customers are correctly insured and feel confident enough to face their risks. We specialize in risk management, data & analytics. Wanting to reach a higher altitude, take our data & analytics to another level, we made an addition to our technology portfolio: Databricks. Here’s why, and the stepwise approach we have taken.
Databricks is an evolving analytics platform. It started as a hosted Spark environment to expand later with Delta Lake, MlFlow, SQL Analytics. The collaborative environment allows all data personas to run analytic processes in one place. Databricks is an impressive tool used across various industries and companies of various sizes.
Transformation is far more than just deploying new technologies for the sake of it. A genuine competitive advantage can only be gained through the combination of an organization’s culture, its strategic choices and way of operating. — Henrico Dolfing
For us, Databricks is not only a set of technological capabilities. We see it as an analytics accelerator and status quo challenger.
The goal is to provide data available at analysts’ fingertips and enable new ways of working. To succeed, it’s required to review existing processes, data integrations, data management and development practices.
Before embarking on a journey with Databricks, we already had SAS, Tableau, Power BI, Excel. Why did we need yet another tool?
If is, arguably, a data driven company. We are the largest property & casualty insurer in the Nordics. The key ingredient in our products and their development is data. It doesn’t get much more data intensive than that. We are also heavy users of data in sales, marketing and customer service operations, including our claims handling. — Jaakko Mikkonen
Over 150 analysts and data scientists across all the business lines kept our engineering and platform teams on their toes.
“Analytic teams will always have a variety of roles, skills, favorite tools, and titles. A diversity of backgrounds and opinions increases innovation and productivity.” — Data Ops Manifesto
Here are the main reasons for selecting Databricks:
1. Support the most popular analytics programming languages and libraries
One of the key missing capabilities was lack of Python support — one of the most popular analytics languages. Local Python installations were not recommended from security, compliance, and privacy standpoint.
Databricks emerged as a solution with baseline security measures by default. Also, it allowed open-source extensions on top of prebuilt images.
2. Drive cooperation between IT and business
Python is a general-purpose language, very popular among engineers and analysts. Common code base and frameworks decrease isolation between the teams.
Also, Databricks provides more benefits. Data Engineers build batch and streaming data pipelines using Spark. Data analysts explore data and produce dashboards. Data Scientists build feature marts, train models and control ML lifecycle using MLflow. All within the same environment.
3. Streamline ML and AI use cases
Databricks doesn’t replace BI tools used across our organization. It rather complements the existing toolset. Use cases like Natural Language Processing or risk analysis are hardly achievable with Power BI or Tableau only.
Machine Learning is not an unknown territory to us. Our teams build descriptive and predictive models already, e.g. pricing, customer acquisition and fraud. Also there is in-house experience with more advanced scenarios related to image and speech analytics.
We think ML and AI use cases will increase in volume and importance, thus it is essential to have the technical capabilities. Also, the advanced scenarios will become more deeply integrated in business processes.
With Databricks, use cases earlier described as too ambitious or too complex are within our reach.
4. Use cloud flexibility
Cloud is de facto a prerequisite for all modern data solutions. A managed infrastructure unties engineer’s hands to do more and execute activities impacting the business.
At the same time, managers need less scrupulous upfront planning and can switch to agile execution. Yet, the shift from old cost allocation to usage-based pricing is not straightforward.
We found it very simple to get started with Databricks on Azure. There was no need of hosting virtual machines or Kubernetes clusters. Instead, one gets Active Directory link by default, on-demand & scalable clusters, network to connect to data.
Phase 1 — Validate Databricks
Just like NASA didn’t fly its rocket to the Moon on the first try, so we didn’t deploy Databricks to all users from day 1.
To confirm our assumptions, we executed a Databricks MVP (Minimum Viable Product) project on Azure Cloud.
Throughout 2019, the team delivered Renewal Churn, Sales Optimization, Abandoned Offer use cases.
The outcomes of the first phase gave us more confidence about the tooling. We introduced the setup to Security, Network and Cloud teams, which put us in a good position for future implementations.
Also, we identified organizational challenges that could hinder to do the same in the rest of our organization.
Phase 2 — Provide Databricks blueprint to other teams
It takes minutes to deploy a single Databricks workspace in Azure. But it might take weeks to get a compliant and approved instance ready for analytics.
Based on Phase 1 learnings, we have created a Databricks blueprint — a set of baseline standards, principles and templates. It defines deployment, networking, security, storage, user & secret management.
As the interest in Databricks grows within different units, the blueprint is a recommended place to start.
Nodes are typically split by departments and can be instantiated based on the blueprint. Each node might be customized to serve respective business requirements. Teams expose their data products via their chosen methods. Based on collected feedback, we keep improving the blueprint.
We have a few Databricks instances already:
- Customer — churn, interactions modelling, recommendations
- Product — pricing and tariff analysis, risk rating
- Integrations — Salesforce, partner integrations, etc.
- Claims — soon to be included
Phase 2 is a workable option for a low number of users and teams, avoids premature scaling and unnecessary complexity. The blueprint accelerates the start, but teams still have to adapt it and take full ownership of it. Some of the activities can be executed or automated centrally.
Phase 3 — Databricks as an enterprise analytics platform
Our goal is to enable many groups within If to operate their own analytics whilst adhering to common policies and standards. The established blueprint is not enough to enable Databricks at enterprise level.
First, Databricks needs data. We need a solid, cloud hosted, data platform. A subset of our data is already available on Azure and Google Cloud Platform, but on-prem solutions are still there.
Second, provide common services like data catalog, audit, monitoring, access control, integrations.
Third, reduce the entry threshold with a dedicated team of engineers. They should onboard all users and guide them through the technical challenges.
The target architecture above is based on James Serra’s and John Mallinder’s Harmonized Mesh. Which originally was heavily inspired by the Data Mesh article by Zhamak Dehghani.
Building knowledge while modernizing analytics
To sum up, we have delivered a few use cases and gained a lot of experience with Databricks. The outcomes make us confident that we are on a correct path towards enabling new ways of working.
To my surprise, Databricks sparked not only data analysts’ or data scientists’ interest. Engineers and managers, not involved in “data projects”, reached out to learn more.
Software Is about Developing Knowledge More than Writing Code —
The more employees develop knowledge about our data, the faster we build smarter and more innovative data products. Databricks seems to be the right enabler to do that.