How we analyzed car insurance retention in Qatar

Islambek Temirbek
QIC digital hub blog
9 min readApr 4, 2024

Hello! I’m Islambek Temirbek, a Data Analyst at QIC digital hub. Our team was faced with questions: how car insurance policies are renewed from year to year, which sources the incoming policies are more loyal and which ones we should work on. To answer these questions, we decided to conduct a customer retention analysis. In this material, I will tell you how we did it and visualized the results.

How the purchase and renewal of policies occur

The purchase and renewal of insurance policies are carried out from sources such as:

  • Agents
  • Online
  • Offline

This can be observed from the following two graphs of policy purchases and renewals in 2023. The purchase of new policies was mostly done online and through agents (52% and 36%, respectively), while renewals occurred through online and offline channels (66% and 29%, respectively). We have such data from 2017 to 2023 with three main sales sources, so it is necessary to conduct a detailed retention analysis to understand how car insurance policies are renewed from year to year from different sources.

The data is deliberately skewed.

Choosing a Retention Analysis Method

Retention rate is a key metric indicating user loyalty to a product. According to the Pareto principle, 20% of repeat customers generate 80% of revenue. For our company, as well as for many others, acquiring new users often costs more than retaining existing ones.

There are two main methods for calculating retention:

  • N-Day Retention: This method evaluates how many users remain active through N days after their first interaction with the application. For example, when analyzing 7-Day Retention, we look at how many users continue using the application on the seventh day after installation.
  • Rolling Retention: Unlike N-Day Retention, measures the percentage of users who remain active over a specific time interval after their initial app login. For example, in a 7-Day Rolling Retention analysis, we assess how many users stay active during any seven-day period after their initial login, not just within the first seven days.

In our case, we will use N-Day Retention in cohorts, using a Sankey diagram. This way, we will be able to see multiple sources of policy purchases and renewals over the years.

Cohort analysis allowed us to segment users into groups identical by a certain criterion, in our case, by the date of their first policy purchase. We then applied N-Day Retention methods to each cohort to assess how long policies from different groups remained active.

We used a Sankey diagram to illustrate the flow of resources, reflecting both their sources and destinations. We chose to use the Sankey diagram in our work due to its ability to support multiple levels of display. This allowed us to effectively present an overall picture of the data and analyze details. If you’re interested in learning more about the capabilities of the Sankey chart, I recommend reading an article by Allison Stafford, an analytics manager at ThriftBooks.

Using our graph as an example, you can observe how policies purchased in different years are renewed in subsequent years through different purchasing sources. The graph indicates that only 26% of new policies sold by agents are consistently renewed, whereas online and offline point-of-sale purchases show over 40% renewal rates.

Preparing Data for Retention Analysis

Now let me delve deeper into how we wrote SQL queries and processed data to obtain such visualization.

Here’s what the original table we need looks like:

The data is deliberately skewed.

The query to obtain the raw data is as follows:

with motor_policy as (
SELECT
date
, policy_id
, transaction_type
, product_name
, business_source
FROM `db.datamart`
where date >= "2017-01-01"
and lob_code = "01"
and transaction_type <> "Endorsement"
)

select * from motor_policy

Let’s take a closer look.

The first table, `motor_policy`, contains information about policies: purchase date (`date`), policy ID (`policy_id`), transaction type (`transaction_type`), product name (`product_name`) and business source (`business_source`). It contains information about all transactions with policies, including both new purchases and renewals.

However, for cohort behavior analysis, we need to transform the data so that for each policy, its initial purchase and any subsequent renewals are represented. Thus, we create a new table cohort_data.

The data is deliberately skewed.

In the cohort_data table, each row represents a specific policy in a particular year (date_year). The future_year column indicates the year of the next renewal or extension of the policy; if it’s empty, the policy was not renewed. The period column displays the difference in years between the initial purchase of the policy and its renewal, helping us understand the stage of the policy (e.g., “year 1” signifies the first renewal after purchase). Therefore, for each row in cohort_data, we can see the purchase and renewal history of a specific policy.

Thus, the new cohort_data table helps us aggregate data on policies and their renewals for further cohort behavior analysis. To create this table, the following query is needed:

, cohort_data as
(SELECT
date_trunc(curr_data.date, year) as date_year
, curr_data.policy_id as policy_id
, curr_data.transaction_type as transaction_type
, curr_data.product_name
, curr_data.business_source
, date_trunc(future_data.date, year) as future_year
, 'year '|| DATE_DIFF(future_data.date, curr_data.date, year) as period
, future_data.transaction_type as future_transaction_type
, future_data.product_name as future_product_name
, future_data.business_source as future_business_source
FROM motor_policy as curr_data
left join motor_policy as future_data
on curr_data.policy_id = future_data.policy_id
and DATE_DIFF(future_data.date, curr_data.date, year) = 1
order by 1
)


select * from cohort_data

The concept of LEFT JOIN is used in the query on the same table with keys policy_id and DATE_DIFF (future_data.date, curr_data.date, year) = 1. This means that policies of the current year are merged with their renewals in the following year.

Next, we will be using a concept called GROUP BY GROUPING SETS, which allows us to group data by different combinations of columns, which is useful for aggregating data at various levels of detail.

For more information on what the SQL GROUPING SETS Clause is and how to use it, you can read Andrew Bone’s article.

, pre_final_data as (
SELECT
date_year
, transaction_type
, business_source
, future_year
, future_transaction_type
, future_business_source
, count(distinct policy_id) as policy_count
FROM cohort_data
where date_year <= "2023-01-01"
GROUP BY GROUPING SETS((date_year, transaction_type, business_source),
(date_year, transaction_type, business_source, future_year, future_transaction_type, future_business_source))
order by 1,2,3,4,5,6,7 desc
)


select * from pre_final_data

We use GROUPING SETS to group data by different combinations of columns: date_year, transaction_type, business_source, as well as their combinations with future_year, future_transaction_type, future_business_source. GROUPING SETS groups the data into two groups with and without the future prefix.

Another important point: if you look at the data in the table, you can notice that rows without “future” have values of policy_count = 13150 and 19003. A policy_count of 19003 indicates all new policies sold by agents in 2022, while a policy_count of 13150 represents churn, policies that were not renewed in 2023. If we sum up the policies renewed in 2022 and subtract from all new policies sold by agents in 2022, we get: 19003 — (17+1130 + 1650 + 2500) = 13150 churn.

The data is deliberately skewed.

Although the table can be used for analysis in its current state, we can improve its appearance and exclude churn. For example, we can exclude policies with policy_count = 12047 (not renewed in 2022) by using a LEFT JOIN with the previous table but only with data containing the future prefix.

, curr_data as (
SELECT date_year
, transaction_type
, business_source
, MAX(policy_count) as curr_policy_count
FROM pre_final_data
WHERE future_year is null
GROUP BY 1,2,3
ORDER BY 1,2,3
)


SELECT *
, round(policy_count / curr_policy_count*100) as retention
FROM curr_data
LEFT JOIN pre_final_data USING(date_year, transaction_type, business_source)
WHERE pre_final_data.future_year is not null and future_business_source is not null
ORDER BY 1,2,3,5,6,7

If we combine all the previous steps, the complete query to get the final table looks like this:

with motor_policy as (
SELECT
date
, policy_id
, transaction_type
, product_name
, business_source
FROM `db.datamart`
where date >= "2017-01-01"
and lob_code = "01"
and transaction_type <> "Endorsement"
)


, cohort_data as
(SELECT
date_trunc(curr_data.date, year) as date_year
, curr_data.policy_id as policy_id
, curr_data.transaction_type as transaction_type
, curr_data.product_name
, curr_data.business_source
, date_trunc(future_data.date, year) as future_year
, 'year '|| DATE_DIFF(future_data.date, curr_data.date, year) as period
, future_data.transaction_type as future_transaction_type
, future_data.product_name as future_product_name
, future_data.business_source as future_business_source
FROM motor_policy as curr_data
left join motor_policy as future_data
on curr_data.policy_id = future_data.policy_id
and DATE_DIFF(future_data.date, curr_data.date, year) = 1
order by 1
)


, pre_final_data as (
SELECT
date_year
, transaction_type
, business_source
, future_year
, future_transaction_type
, future_business_source
, count(distinct policy_id) as policy_count
FROM cohort_data
where date_year <= "2023-01-01"
GROUP BY GROUPING SETS((date_year, transaction_type, business_source),
(date_year, transaction_type, business_source, future_year, future_transaction_type, future_business_source))
order by 1,2,3,4,5,6
)


, curr_data as (
SELECT date_year
, transaction_type
, business_source
, MAX(policy_count) as curr_policy_count
FROM pre_final_data
WHERE future_year is null
GROUP BY 1,2,3
ORDER BY 1,2,3
)


SELECT *
, round(policy_count / curr_policy_count*100) as retention
FROM curr_data
LEFT JOIN pre_final_data USING(date_year, transaction_type, business_source)
WHERE pre_final_data.future_year is not null and future_business_source is not null
ORDER BY 1,2,3,5,6,7

The result is the following table with curr_policy_count and policy_count. Here we immediately calculate the retention rate in percentage terms to understand how many of the 16903 new agent policies or 35215 previously renewed online policies were renewed in the following year.

Voilà, our final table is ready.

The data is deliberately skewed.

For example, only 14% of policies purchased through agents in 2022 were renewed online in the following year, compared to 65% of policies purchased through online channels. To see this granularity, the right visualization is crucial.

Now the important thing is to get the desired visualization, and to do that, I used Sankeymatic. There’s no magic here; I entered all the figures manually. Perhaps you have a better suggestion on how to automate this process, I would appreciate your comments.

As a result, I got a diagram like this. On it, you can see where policies from different years were directed and, most importantly, how many percent of them were not renewed and left. Now the product managers are taking steps on how to retain new customers after attracting them, as acquiring a new customer is more expensive than retaining an existing one.

Pros, Cons, and Conclusions

Pros of this retention analysis:

It’s visually understandable, especially when it comes to contracts or customers from different sources and acquisition years, showing whether their subscription is annual or monthly.

Cons:

At this stage, the visualization requires a lot of manual work that could be automated.

Conclusion:

  • It is important to understand how contracts are renewed or how users return to the product.
  • If you have multiple sources of purchase, using a Sankey diagram with different cohorts is very convenient.
  • Writing a query for retention analysis uses join to combine policies by ID and by the first date of purchase with subsequent renewals. Additionally, the query has a grouping sets function to group data by different combinations of columns.
  • The simplest online tool, Sankeymatic, is used for visualization, which requires manual work.

--

--