CAR INSURANCE POLICY ANALYSIS

18 min readJul 15, 2024

TOOLS USED: MySQL, Power BI

INTRODUCTION

An insurance policy is a type of financial agreement whereby a person or entity (referred to as the insured) pays an insurance firm (referred to as the insurer) a premium in exchange for protection against certain risks or possible financial losses. The insurance provider consents to pay the policyholder for losses, damages, or obligations that are covered, according to the terms and circumstances outlined in the policy.

CAR INSURANCE POLICY

Car insurance, sometimes referred to as car insurance, is a legal agreement that offers financial security against liabilities and losses associated with owning and operating a vehicle. It is a contract between an insurance firm (the insurer) and a vehicle owner (the insured). A variety of hazards are usually covered by auto insurance policies, such as third-party liability, vehicle damage, and injuries to the driver and passengers.

LEARNING ABOUT THE STRUCTURE OF THE DATA

Since the dataset talks about insurance policies, it should be noted that the company covers the damages or losses related to vehicles. The dataset contains 37,542 rows and 16 columns. It should also be noted that the dataset is what we will be using throughout the analysis to give proper analysis and insights.

APPROACH

DATABASE CREATION: The first approach was to create a database where the dataset is to be stored and queried properly. The database name for the above analysis is InsurancePoliciesDataDB.

-- CREATE A DATABASE InsurancePoliciesDataDB 

   CREATE DATABASE InsurancePoliciesDataDB

DATA CLEANING: Noting that data cleaning is one of the most important aspect of Data Analysis, there is need to understand the dataset properly before data cleaning takes place.

STEPS TAKEN:

Datatype change: Changed the datatype of claim_freq to decimal(10,2).

-- Changing the claim_freq datatype to decimal datatype 
ALTER TABLE [Insurance Policies Data]
ALTER COLUMN claim_freq DECIMAL(10,2);

Age column and datatype: Created an age column to know their precise ages and also the datatype.

-- FOR AGE

ALTER TABLE [Insurance Policies Data]
ADD age INT;

DATA VALIDATION: Ensuring data consistencies in a dataset tells you how valid and reliable the data is. Proper assessment of data must be done to ensure how unique and accurate the data is.

DATA DICTIONARY

ID: Unique identifier for the policyholder.

Birthdate: Birthdate of the policyholder.

Marital_status: Marital status of the policyholder (Single, Married, etc).

Car_use: Purpose of car use (Private, Commercial).

Gender: Gender of the policyholder (Male, Female).

Kids_driving: Number of kids of the policyholder who are driving.

Parent: Whether the policyholder is a parent (Yes, No).

Education: Education level of the policyholder (High School, Bachelors, etc).

Car_make: Make of the policyholder’s car.

Car_model: Model of the policyholder’s car.

Car_color: Color of the policyholder’s car.

Car_year: Year of the policyholder’s car.

Claim_freq: Frequency of claims made by the policyholder.

Coverage_zone: Area of coverage (Highly Urban, Urban, Suburban, Rural, Highly Rural).

Claim_amt: Amount claimed by the policyholder.

Household_income: Household income of the policyholder.

Overview of the dataset after database creation and importing the CSV file.

SELECT * FROM [Insurance Policies Data]

ANALYSIS ON THE DATASET

Claim Frequency and Amount Analysis

What are the average claim frequencies and amounts for different demographic groups (e.g., age, gender, marital status)?

For Age: The main reason for updating the table, was to create an age column that displays the exact age of every ID on the dataset.

UPDATE [Insurance Policies Data]
SET age = DATEDIFF(YEAR, birthdate, GETDATE());


SELECT AVG(claim_amt) AS AVG_claim_amt,
       AVG(claim_freq) AS AVG_claim_freq,
   CASE
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) < 30  THEN 'Young'
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 30 AND 45 THEN 'Adult'
  WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 46 AND 60 THEN 'Mid_aged'
        ELSE 'Aged'
    END  AS age_group
FROM [Insurance Policies Data]
GROUP BY
        CASE
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) < 30  THEN 'Young'
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 30 AND 45 THEN 'Adult'
  WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 46 AND 60 THEN 'Mid_aged'
        ELSE 'Aged'
    END
ORDER BY AVG_claim_freq

Output:

From the output, it shows that the ‘Aged’ has the lowest Average claim frequency (0.506115), followed by ‘Mid-aged’ with (0.508192), ‘Adult’ with (0.510247) and lastly ‘Young’ with the highest Average claim frequency of (0.522091).

For Gender:

 -- FOR GENDER
 SELECT gender, 
 ROUND(AVG(claim_amt),2) AS AVG_claim_amt, 
 ROUND(AVG(claim_freq),2) AS AVG_claim_freq
 FROM [Insurance Policies Data]
 GROUP BY gender;

Output:

From the output, it clearly shows that ‘Male’ has a total claim frequency of (0.510000) and the ‘Female’ has a total claim frequency of (0.510000).

For Marital status:

 -- FOR MARITAL STATUS
 
 SELECT marital_status, 
  ROUND(AVG(claim_amt),2) AS AVG_claim_amt,
  ROUND(AVG(claim_freq),2) AS AVG_claim_freq
  FROM [Insurance Policies Data]
 GROUP BY marital_status;

Output:

From the output, it shows that the ‘Separated’ has the highest Average claim frequency (0.530000), followed by ‘Married’ with (0.520000), then ‘Single’ with (0.500000) and lastly ‘Divorced’ with the least Average claim frequency (0.490000).

Are there any specific vehicle characteristics (e.g., make, model, year) that correlate with higher claim frequencies or amounts?

For car_make: Firstly, we have to select the car make, average frequency claim, and the average claim amount to ensure we get the desired output. The next step is to group all by the car make in order to get the various types of car.

-- FOR CAR MAKE

SELECT car_make,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY car_make
ORDER BY AVG_claim_amt DESC;

Output: The output shows the various types of car make with their respective average claim frequency and average claim amount.

From the analysis, it shows that ‘Hillman’ has the highest average claim amount of ‘$79,049.4266' and an average claim frequency of ‘0.000000’ while ‘Rambler’ has the lowest average claim amount of ‘$16,603.768’ and average claim frequency of ‘0.200000’.

For car model: We have to select the car model, average frequency claim, and the average claim amount to ensure we get the desired output. The next step is to group all by the car model in order to get the various models.

-- FOR CAR MODEL

SELECT car_model,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY car_model
ORDER BY AVG_claim_amt DESC;

Output: The output shows the various types of car model with their respective average claim frequency and average claim amount.

From the analysis, it shows that ‘Crossfire Roadster’ has the highest average claim amount of ‘$99,403.06’ and an average claim frequency of ‘0.000000’ while ‘W123’ has the lowest average claim amount of ‘$17,006.755’ and average claim frequency of ‘0.500000’.

For car year: We select the car year, average frequency claim, and the average claim amount to ensure we get an accurate output. The next step is to group all by the car year in order to get the various years.

-- FOR CAR YEAR

SELECT car_year,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY car_year
ORDER BY AVG_claim_amt DESC;

Output: The output shows the various types of car over the years with their respective average claim frequency and average claim amount.

From the analysis, it shows that year ‘1950’ has the highest average claim amount of ‘$79,049.4266’ and an average claim frequency of ‘0.000000’ while year ‘1926’ has the lowest average claim amount of ‘$27,555.42’ and average claim frequency of ‘1.000000’.

Risk Assessment

Which factors (e.g., household income, education level, coverage zone) are most indicative of high-risk policyholders?

For household income: This analysis shows the various IDs with their respective household income, average claim amount and average claim frequency.

SELECT ID,
       household_income,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY ID,household_income
ORDER BY AVG_claim_amt DESC;

Output: From the analysis, it shows that ID ‘86–0476340’ has the highest average claim amount of ‘$99,997.70’, average claim frequency of ‘1.000000’ and household income of ‘$150,024.89’ followed by ID ‘79–3515001’ with average claim amount ‘$99,994.03’, average claim frequency ‘1.000000’ and household income ‘$230,848.15’ while ID ‘80–5374656’ has the lowest average claim amount of ‘$0.04’ and average claim frequency of ‘0.000000’ household income ‘$95,355.59’.

If we decide to analyze using the highest household income for the above analysis, the result will be:

SELECT ID,
       household_income,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY ID,household_income
ORDER BY household_income DESC;

Output: The analysis shows that ID ‘79–6192187’ has the highest household income of ‘$249,991.11’, average claim frequency of ‘0.000000’ and average claim amount of ‘$89,160.54’ followed by ‘06–1578927’ with household income of ‘249990.42’, average claim amount ‘$40,064.00’, and average claim frequency ‘0.000000’ while ID ‘37–2490879’ has the lowest average household income of ‘$45,004.91’ average claim amount of ‘$64,395.96’ and average claim frequency of ‘0.000000’.

For educational level: This analysis shows the various respective Educational levels, average claim amount and average claim frequency.

-- FOR EDUCATION

SELECT education,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY education
ORDER BY AVG_claim_freq DESC, AVG_claim_amt DESC;

Output: Looking at the analysis, we can say that ‘Bachelors’ has the highest average claim amount ‘$50,275.5356 with an average claim frequency of ‘0.507994’, followed by ‘Masters’ with an average claim amount of ‘$50,193.9721’ and average claim frequency of ‘0.514539’, then ‘High School’ with an average claim amount of ‘$49,621.5388’ and average claim frequency of ‘0.514800’ and lastly ‘PhD’ with the lowest claim amount of ‘$49,556.2771’ and average claim amount of ‘0.500362’.

For coverage zone: This analysis shows the various respective Coverage zones, average claim amount and average claim frequency.

-- FOR COVERAGE ZONE

SELECT coverage_zone,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY coverage_zone
ORDER BY AVG_claim_freq DESC, AVG_claim_amt DESC;

Output: From proper analysis, we can say that ‘Urban coverage zone’ has the highest average claim amount ‘$50377,7303’ with an average claim frequency of ‘0.508170’, followed by ‘Suburban’ with average claim amount of ‘$50,124.8431’ and average claim frequency ‘0.520091’, then ‘Highly Rural’ with an average claim amount of ‘$49,998.1321’ and an average claim frequency of ‘0.500402’, then ‘Highly Urban’ with an average claim amount of ‘$49,861.0366’ with an average claim frequency of ‘0.516502’ and lastly, ’Rural’ with the lowest claim amount of ‘$49,778.0202’ and an average claim frequency of ‘0.506381’.

Can we identify any common characteristics among policyholders who make frequent claims?

With proper analysis, we can identify any common characteristics among policyholders who make frequent claims. Therefore, there is a need to query the data in order to get the desired output.

-- Policyholders with frequent claims

SELECT ID,
       marital_status,
       education,
       gender,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data] 
GROUP BY ID,
         marital_status,
         education,
         gender
ORDER BY education DESC

Output: From analysis, we can tell that IDs with educational level ‘PhD’ have the highest number of frequent claims while IDs with educational level ‘Bachelors’ has the lowest number of frequent claims.

Premium Optimization

How do current premium amounts relate to the risk profiles of policyholders?

Since there is no specific column for premium amount in the entire dataset, it is therefore mandatory to create a column where we see the various amounts for different IDs.

-- Premium amount column

ALTER TABLE [Insurance Policies Data]
ADD premium_amount DECIMAL(10, 2);

-- Lets assume $50 to be the rate of interest set by the Central Bank (Base Rate)
-- no discount and surcharge
-- With these codes, we get the actual values for Premium_ amount_column using the $50 rate of interest

UPDATE [Insurance Policies Data]
SET premium_amount = 50 *
    CASE
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) < 30 THEN 5.0
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 30 AND 45 THEN 3.0
  WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 46 AND 60 THEN 1.5
        ELSE 1.0
    END *
    CASE
        WHEN gender = 'Male' THEN 1.1
        ELSE 1.0
    END *
    CASE
        WHEN marital_status = 'Single' THEN 1.5
  WHEN marital_status = 'Divorced' THEN 1.4
  WHEN marital_status = 'Married' THEN 1.2
        ELSE 1.0
    END

Output: The premium column shows the various premium amount for all the IDs in the entire dataset. With the help of the premium amount, we can relate to the risk profile of policyholders.

SELECT ID,
       gender,
       marital_status,
    education,
    coverage_zone,
       household_income,
       AVG(premium_amount) AS AVG_premium_amt,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       COUNT(ID) AS Total_policyholders
FROM [Insurance Policies Data]
GROUP BY ID,
         coverage_zone,
         household_income,
         education,
         gender,
         marital_status
ORDER BY AVG_premium_amt DESC;

The analysis shows that premium amount looks constant on all record. It clearly shows that Male gender that are Single possess the highest average premium amount ‘$412.50’, followed by Female gender that are also Singles with an average premium amount ‘$375.00’, then Female gender that are Divorced with average premium amount ‘$350.00’ and lastly Female gender that are Separated has the least average premium amount of ‘$50.00’.

Can we recommend adjustments to the premium pricing model to better reflect risk levels and improve profitability?

For proper adjustment to be made to the premium pricing model to be reflect risk levels and improve profitability, we must analyze the relationship between various factors to ensure we moving in the right direction.

SELECT ID,
       gender,
       marital_status,
    education,
    coverage_zone,
       household_income,
       car_make,
       car_model,
       car_year,
    AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt
FROM [Insurance Policies Data]
GROUP BY  ID,
       gender,
       marital_status,
    education,
    coverage_zone,
       household_income,
       car_make,
       car_model,
       car_year
 ORDER BY AVG_claim_amt DESC

Output: After creating the age column, we must query the data in order to get the precise ages for all IDs in the dataset. With the age data set in place, we can group all IDs to know their precise age groups for proper analysis.

-- Declaring the base rate

DECLARE @base_rate DECIMAL(10, 2) = 50.0;

-- Updating premium amount

UPDATE [Insurance Policies Data]
SET premium_amount = 
    50.0 *
    CASE
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) < 30 THEN 5
        WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 30 AND 45 THEN 3.0
  WHEN DATEDIFF(YEAR, birthdate, GETDATE()) BETWEEN 46 AND 60 THEN 1.5
        ELSE 1.0
    END *
    CASE
        WHEN gender = 'Male' THEN 1.1
        ELSE 1.0
    END *
    CASE
        WHEN marital_status = 'Single' THEN 1.5
  WHEN marital_status = 'Divorced' THEN 1.4
     WHEN marital_status = 'Married' THEN 1.2
        ELSE 1.0
    END *
    CASE
        WHEN coverage_zone = 'Highly Urban' THEN 5.0
  WHEN coverage_zone = 'Urban' THEN 4.0
  WHEN coverage_zone = 'Suburban' THEN 2.0
  WHEN coverage_zone = 'Highly Rural' THEN 1.0
        ELSE 0.5
    END

To check the Age column:

SELECT * FROM [Insurance Policies Data]

Customer Segmentation and Marketing

What are the key characteristics of policyholders with low claim frequencies and high household incomes?

From the dataset, to get the key characteristics of policyholders with low claim frequencies and high household incomes, we must first declare the variables to clarify entry.

-- Clarify various Entry

DECLARE @Low_claim_freq_entry INT = 1;
DECLARE @High_income_entry DECIMAL(10, 2) = 85000.00;

The low claim frequency is set to INT=1 while the high income entry is set to $85,000.00.

-- Query for low claim frequencies and high household incomes

SELECT gender,
       marital_status,
    education,  
       coverage_zone,
       AVG(DATEDIFF(YEAR, birthdate, GETDATE())) AS AVG_age,
       COUNT(ID) AS Total_policyholders,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       AVG(household_income) AS AVG_household_income
FROM [Insurance Policies Data]
WHERE claim_freq <= @Low_claim_freq_entry
    AND household_income >= @High_income_entry
GROUP BY gender,
         marital_status,
         education,
         coverage_zone
ORDER BY AVG_household_incomeDESC;

Output: From the analysis, using the top three by Gender, Marital status, Coverage zone and Education, we can say that the ‘Female’ gender with Marital status ‘Separated’, coverage zone ‘Rural’ and Education ‘PhD’ has the highest average household income, followed by ‘Male’ gender with Marital status ‘Separated’, coverage zone ‘Suburban’ and Education ‘PhD’ and lastly ‘Female’ gender with Marital status ‘Separated’, coverage zone ‘Highly Rural’ and Education ‘High School’.

SELECT gender,
       marital_status,
    education,  
       coverage_zone,
       AVG(DATEDIFF(YEAR, birthdate, GETDATE())) AS AVG_age,
       COUNT(ID) AS Total_policyholders,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       AVG(household_income) AS AVG_household_income
FROM [Insurance Policies Data]
WHERE claim_freq <= @Low_claim_freq_entry
    AND household_income >= @High_income_entry
GROUP BY gender,
         marital_status,
         education,
         coverage_zone
ORDER BY AVG_claim_freq ASC;

From the analysis, using the top three by Gender, Marital status, Coverage zone and Education, we can say that the ‘Female’ gender with Marital status ‘Separated’, coverage zone ‘Urban’ and Education ‘Masters’ has the lowest claim frequency, followed by ‘Female’ gender with Marital status ‘Separated’, coverage zone ‘Rural’ and Education ‘PhD’ and lastly ‘Male’ gender with Marital status ‘Married’, coverage zone ‘Highly Urban’ and Education ‘PhD’.

How can we segment the customer base to identify high-value customers for targeted marketing campaigns?

Since we are only interested in High value customers for targeted campaigns, then, we declare some variables and include some query.

DECLARE @High_income_entry DECIMAL(10, 2) = 85000.00;
DECLARE @Low_claim_freq_entry INT = 0;
DECLARE @High_premium_entry DECIMAL(10, 2) = 90.00;  

SELECT customer_segment,
       total_customers,
       AVG_household_income,
       AVG_claim_freq,
       AVG_premium_amount,
       coverage_zone 
FROM   (SELECT
     CASE
        WHEN household_income >= @High_income_entry AND claim_freq <= @Low_claim_freq_entry AND premium_amount >= @High_premium_entry THEN 'High-value_customers'
        WHEN household_income >= @High_income_entry AND claim_freq <= @Low_claim_freq_entry THEN 'Mid_value_customers'
        WHEN household_income < @High_income_entry AND claim_freq > @Low_claim_freq_entry THEN 'Low_value_customers'
    ELSE 'Standard_value_customers'
    END AS customer_segment,
    COUNT(ID) AS total_customers,
 AVG(household_income) AS AVG_household_income,
    AVG(claim_freq) AS AVG_claim_freq,
    AVG(premium_amount) AS AVG_premium_amount,
 coverage_zone
FROM [Insurance Policies Data]
    GROUP BY
        CASE
        WHEN household_income >= @High_income_entry AND claim_freq <= @Low_claim_freq_entry AND premium_amount >= @High_premium_entry THEN 'High-value_customers'
        WHEN household_income >= @High_income_entry AND claim_freq <= @Low_claim_freq_entry THEN 'Mid_value_customers'
        WHEN household_income < @High_income_entry AND claim_freq > @Low_claim_freq_entry THEN 'Low_value_customers'
    ELSE 'Standard_value_customers'  
 END,
 coverage_zone
) AS total_customer_segmentation
WHERE customer_segment = 'High-value_customers'
ORDER BY coverage_zone;

Output: The analysis is based on high value customers on different segments. For high value customers in ‘Highly Rural’ coverage zone, the total customers are ‘3137’ with average household income of ‘$167,890.6231’ average claim frequency of ‘0’ and average premium amount of ‘$200.58’ then ‘Highly Urban’ coverage zone, the total customers are ‘4377’ with average household income of ‘$166,661.041’ average claim frequency of ‘0’ and average premium amount of ‘$806.94’.

‘Rural’ coverage zone, has total customers of ‘1912’ with average household income of ‘$165,934.1632’, average claim frequency of ‘0’ and average premium amount of ‘$131.98’, also ‘Suburban’ coverage zone, has total customers of ‘4389’ with average household income of ‘$167,887.2972’, average claim frequency of ‘0’ and average premium amount of ‘$327.94’ and lastly ‘Urban’ coverage zone, has total customers of ‘4378’ with average household income of ‘$166,976.6059’, average claim frequency of ‘0’ and average premium amount of ‘$652.94’.

Demographic Analysis

How does the distribution of policyholders vary across different demographic factors (age, gender, marital status)?

The distribution of policyholders varies across different demographic factors so I queried the data in order to get the distribution for proper analysis.

SELECT gender,
       marital_status,
       DATEDIFF(YEAR, birthdate, GETDATE()) AS age,
       COUNT(ID) AS total_policyholders
FROM [Insurance Policies Data]
GROUP BY gender,
         marital_status,
         DATEDIFF(YEAR, birthdate, GETDATE())  
ORDER BY  gender, marital_status,age ASC;

Output: Categorizing the distribution of policyholders into groups was best for the desired analysis. Age group ‘Young’(<30) has the highest number of policyholders across all marital statuses and genders, followed by age groups (30–45), while age group (>60) has the least amount of policyholders.

Policyholders reduced drastically from ages 65 and above. Both genders showed similar movement in terms of the number of policyholders across different ages and marital statuses. There is no significant difference in the number of male and female policyholders in most age and marital status categories.

Are there any noticeable trends in car usage and ownership among different demographic groups?

From analysis on the data, we can notice that there is a trend.

-- For noticable trends

SELECT gender,
       marital_status,
       AVG(CAST(kids_driving AS DECIMAL)) AS AVG_kids_driving,
       AVG(CAST(parent AS DECIMAL)) AS AVG_parent,
       AVG(car_year) AS AVG_car_year,
       COUNT(ID) AS Total_policyholders
FROM [Insurance Policies Data]
GROUP BY gender,
         marital_status
ORDER BY Total_policyholders DESC;

Output: The analysis shows that the average car year for all gender falls under ‘2000’. The ‘Female’ gender that are ‘Single’ has the highest number of policyholders while ‘Male’ gender that are ‘Separated’ has the least amount of policyholders.

It also shows that ‘Female’ gender that are ‘Separated’ has the highest number of kids driving ‘ 0.469104’ while ‘Female’ gender that are ‘Divorced’ has the least number of kids driving ‘0.401205’. Also ‘Female’ gender that are ‘Separated’ has the highest number of average parent ‘0.466582’ while ‘Female’ gender that are ‘Divorced’ has the least number of average parent ‘0.428480’.

Geographical Analysis

How do claim frequencies and amounts vary across different coverage zones?

To know how the claim frequencies and amounts vary across different coverage zones, we must query the data properly in order to get a desired output.

For Average claim frequency

-- For claim frequencies

SELECT coverage_zone,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       COUNT(ID) AS Total_policyholders
FROM [Insurance Policies Data]
GROUP BY coverage_zone
ORDER BY AVG_claim_freq DESC;

Output: The analysis shows that ‘Suburban’ has the highest number of average claim frequency ‘0.520091’ while ‘Highly Rural’ has the least number of average claim frequency ‘0.500402’.

For Average claim amount

-- For average claim amount

SELECT coverage_zone,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       COUNT(ID) AS Total_policyholders
FROM [Insurance Policies Data]
GROUP BY coverage_zone
ORDER BY AVG_claim_amt DESC;

Output: From the analysis, we can see that ‘Urban’ has the highest average claim amount ‘$50,377.7303’ while ‘Rural’ has the least average claim amount ‘$49,778.0202’.

Are there any regional trends or patterns that should be taken into consideration for marketing or risk assessment?

Since we are told to analyze the regional trend, we must first of all create a table called Regional-trend, then use INNER JOIN for joining the Regional_trend table and the Insurance_Policies_Data table to get the desired output.

WITH Regional_trend AS (
    SELECT coverage_zone,
           AVG(claim_freq) AS AVG_claim_freq,
           AVG(claim_amt) AS AVG_claim_amt,
           AVG(premium_amount) AS AVG_premium_amt,
           COUNT(ID) AS Total_policyholders
    FROM [Insurance Policies Data]
    GROUP BY coverage_zone
)
SELECT rt.coverage_zone,
       rt.AVG_claim_freq,
       rt.AVG_claim_amt,
       rt.AVG_premium_amt,
       rt.Total_policyholders,
       AVG(DATEDIFF(YEAR, ip.birthdate, GETDATE())) AS AVG_age,
       AVG(ip.household_income) AS AVG_household_income,
       COUNT(DISTINCT ip.gender) AS Total_genders,
       COUNT(DISTINCT ip.marital_status) AS Total_marital_statuses
FROM Regional_trend rt
    JOIN [Insurance Policies Data] ip ON rt.coverage_zone = ip.coverage_zone
GROUP BY rt.coverage_zone,
         rt.avg_claim_freq,
         rt.avg_claim_amt,
         rt.avg_premium_amt,
         rt.Total_policyholders
ORDER BY rt.coverage_zone DESC;

Output: From analysis, ‘Urban’ has the highest average claim amount ‘$50,377.7303’, followed by ‘Suburban’ with average claim amount ‘$50,124.8431 and ‘Rural’ with the least average claim amount ‘$49,778.0202’.

‘Urban’ has the highest number of policyholders ‘7,588’, followed by ‘Rural’ with ‘7,522’ and lastly ‘Highly Rural’ with the least number of policyholders ‘7,452’. All coverage zone ha the same number of Gender ‘2’ and Marital status ‘4’.

‘Highly Urban’ has the highest average premium amount ‘$817.07’ and ‘Rural’ has the least average premium amount ‘83.28’.

Customer Behavior Insights

Are there any trends or patterns in the behavior of policyholders who have children driving?

To find any trends or patterns in the behavior of policyholders who have children driving, we must ensure the data is queried properly.

WITH All_policy_holders_with_children_driving AS (
    SELECT *, 
 DATEDIFF(YEAR, birthdate, GETDATE()) AS age_in_years,
    CASE 
  WHEN kids_driving < 0 THEN 'No' ELSE 'Yes' 
 END AS having_kids_driving_tag
    FROM [Insurance Policies Data]
)
SELECT having_kids_driving_tag AS having_kids_driving,
       AVG(claim_freq) AS AVG_claim_freq,
       AVG(claim_amt) AS AVG_claim_amt,
       AVG(premium_amount) AS AVG_premium_amt,
       AVG(age_in_years) AS AVG_age,
       AVG(household_income) AS AVG_household_income,
       COUNT(ID) AS Total_policyholders
FROM All_policy_holders_with_children_driving
GROUP BY having_kids_driving_tag
ORDER BY having_kids_driving_tag;

Output: From the analysis, it shows that the average age of parents who have kids driving is ‘48’ with the total number of policyholder to be ‘37,542’, average claim amount ‘$50,028.514’, average claim frequency of ‘0.510308’, average household income of ‘$147247.40’ and average premium amount of ‘$410.03’.

How does the presence of children driving affect the frequency and number of claims?

WITH All_policy_holders_with_children_driving AS (
    SELECT *,
        CASE 
     WHEN kids_driving < 0 THEN 'No' ELSE 'Yes' 
 END AS having_kids_driving_tag
        FROM [Insurance Policies Data]
)
SELECT having_kids_driving_tag AS having_kids_driving,
       AVG(claim_freq) AS avg_claim_freq,
       AVG(claim_amt) AS avg_claim_amt,
       COUNT(ID) AS Total_policyholders
FROM All_policy_holders_with_children_driving
GROUP BY having_kids_driving_tag
ORDER BY Total_policyholders;

Output: ‘0.510000’ is the average frequency of claims for children driving, according to the data, and ‘$50,028.514’ is the average amount of claims both of which are less than the case of no children driving. and there are ‘10,857’ policyholders but a total of ‘37,542’. Since there is little variation in claim frequency across families with and without children who drive, the presence of children driving does not significantly affect the frequency of claims. Similarly, households with drivers had an average claim amount that is marginally lower, but not significantly lower. The higher number of policyholders in non-child-driving families may indicate a higher prevalence of such households within the dataset.

DATA VISUALIZATION FOR THE DATASET

CONCLUSION

The business may maximize risk management, boost consumer happiness, and improve marketing efficacy by utilizing these information. High-value consumers can be attracted and retained with the aid of tailored tactics based on this data, which will ultimately spur business expansion and profitability. In order to attract customers in rural and highly rural areas, emphasize pricing and packaged products.

I am appreciative of your reading. More to come.

CAR INSURANCE POLICY ANALYSIS

Written by IGIRIGIRI JOSEPH UCHECHUKWU