SQL ANALYSIS OF AN INSURANCE COMPANY DATA : Trends and Discoveries

Edohamhen Lucky
12 min readJun 30, 2024

--

PROJECT OVERVIEW

Splendor Analytics Insurance is a company specialising in offering a variety of insurance products to its customers. The company’s portfolio includes auto insurance, which covers losses or damages related to vehicles. The provided dataset contains anonymised information about the policyholders, their vehicles, and their claim histories. The goal of this project is to analyse this dataset to uncover insights that can help Splendor Analytics Insurance improve its services, optimise premiums, and reduce claim costs.

Splendor Analytics Insurance aims to enhance its auto insurance division by leveraging data analysis to:

1. Identify patterns in claim frequencies and amounts.

2. Understand the demographics and characteristics of high-risk and low-risk policyholders.

3. Optimise premium pricing models based on risk factors.

4. Develop targeted marketing strategies to attract and retain customers.

PROJECT BRIEF

You are giving a dataset and you are tasked with analysing the provided dataset to help Splendor Analytics Insurance achieve the above objectives. Your analysis should be thorough and provide actionable insights. Your analysis should be thorough and provide actionable insights. Use any data analysis tool of your choice (e.g., Python, R, Excel, Tableau, SQL, Power BI etc) to conduct this analysis.

Deliverables

1. Cleaned dataset.

2. Dashboard/Report

3. Detailed report answering the key questions.

4. Presentation slides summarising key insights and recommendations.

Data Dictionary

Column Name : description

ID:Unique identifier for the policyholder

birthdate:Birthdate of the policyholder

marital_status:Marital status of the policyholder (Single, Married, etc.)

car_use: Purpose of car use (Private, Commercial)

gender: Gender of the policyholder (Male, Female)

kids_driving: Number of kids of the policyholder who are driving

parent: Whether the policyholder is a parent (Yes, No)

education: Education level of the policyholder (High School, Bachelors, etc.)

car_make: Make of the policyholder’s car

car_model: Model of the policyholder’s car

car_color: Color of the policyholder’s car

car_year: Year of the policyholder’s car

claim_freq: Frequency of claims made by the policyholder

coverage_zone: Area of coverage (Highly Urban, Urban, Suburban, Rural, Highly Rural)

claim_amt: Amount claimed by the policyholder

household_income: Household income of the policyholder

DATA CLEANING

Data cleaning took place after creating a database for this project “InsuranceDB”, then we import our data

The first step is to remove the”$” from the claim_amt and household_income column, then convert the string to “varchar”to “decimal” so we can perform numeric calculations.

- Data cleaning
UPDATE Insurance
SET claim_amt = REPLACE(claim_amt, '$', '');
-
UPDATE Insurance
SET household_income = REPLACE(household_income, '$', '');
- Alter table to change column type to DECIMAL
ALTER TABLE Insurance
ALTER COLUMN claim_amt DECIMAL(20, 2);
-
ALTER TABLE Insurance
ALTER COLUMN household_income DECIMAL(20, 2);

ADDING THE “AgeRange” COLOMN

--Adding a new date column
ALTER TABLE Insurance
ADD DateOfBirth date;
-- Update the new date column with converted values
UPDATE Insurance
SET DateOfBirth = TRY_CONVERT(date, birthdate, 101);
-- Drop the old varchar column if no longer needed
ALTER TABLE Insurance
DROP COLUMN birthdate;

The addition of this column will facilitate age group segmentation in our analysis, enhancing the depth and thoroughness of our investigation.

--Adding age range
ALTER TABLE Insurance
ADD AgeGroup AS (CASE
WHEN DATEDIFF(YEAR, DateOfBirth, GETDATE()) BETWEEN 0 AND 9 THEN 'Children'
WHEN DATEDIFF(YEAR, DateOfBirth, GETDATE()) BETWEEN 10 AND 17 THEN 'Adolescents'
WHEN DATEDIFF(YEAR, DateOfBirth, GETDATE()) BETWEEN 18 AND 34 THEN 'Youths'
WHEN DATEDIFF(YEAR, DateOfBirth, GETDATE()) BETWEEN 35 AND 64 THEN 'Middle-aged'
WHEN DATEDIFF(YEAR, DateOfBirth, GETDATE()) >= 65 THEN 'Seniors'
ELSE 'Unknown'
END);

KEY QUESTIONS

1. Claim Frequency and Amount Analysis

— -What are the average claim frequencies and amounts for different demographic groups (e.g., age, gender, marital status)?

GENDER

--GENDER
SELECT
gender,
AVG(claim_freq) AS AverageClaimFrequency,
ROUND(AVG(claim_amt), 0) AS AverageClaimAmount
FROM Insurance
GROUP BY gender
ORDER BY gender;
While the average claim frequency is evenly distributed between genders, the male gender exhibits the highest average claim amount

MARITAL STATUS


--Marital Status
SELECT marital_status,
AVG(claim_freq) AS AverageClaimFrequency,
Round (AVG(claim_amt),0) AvgClaimAmount
FROM Insurance
GROUP BY marital_status
ORDER BY AvgClaimAmount DESC;
While the average claim frequency is evenly distributed between all marital status while married people exhibits the highest average claim amount

AGE RANGE

--AgeRange
SELECT AgeGroup,
AVG(claim_freq) AS AverageClaimFrequency,
Round (AVG(claim_amt),0) AvgClaimFreq
FROM Insurance
GROUP BY AgeGroup
ORDER BY AvgClaimFreq DESC;
Across all demographic groups, claim frequency is evenly distributed, with seniors, married individuals, and males standing out in each demographic for their higher claim amounts.

— -Are there any specific vehicle characteristics (e.g., make, model, year) that correlate with higher claim frequencies or amounts?

-- Focusing on highest claim frequency (4)
SELECT TOP 1 car_make,car_model,car_year,
COUNT(claim_freq) AS ClaimFrequency,
AVG(claim_amt) AS ClaimAmount
FROM Insurance
GROUP BY car_make, car_model, car_year
HAVING COUNT(claim_freq) IN (4)
ORDER BY ClaimAmount DESC;
The vehicle characteristic most correlated with higher claim frequencies or amounts is predominantly driving a Lexus SC 1994, with an average claim amount of $91,487

2. Risk Assessment

— -Which factors (e.g., household income, education level, coverage zone) are most indicative of high-risk policyholders?

— — The most indicative factors are household income earners earning less than $50,000 ; Lower income households may face challenges in accessing healthcare or maintaining property (cars), potentially leading to higher insurance claims.

— — Claim frequency of 3 and 4 (highest claim frequency)

— — Age Group of youths and seniors ; younger drivers are statistically more likely to be involved in accidents due to less driving experience and riskier behaviour.

— — Living in highly urban and urban areas; areas with dense population and higher incidence of crime or accidents.

SELECT coverage_zone,
SUM(CASE WHEN household_income < 100000 AND claim_freq IN (3, 4) AND AgeGroup IN ('Youths', 'Seniors') THEN 1 ELSE 0 END) AS HighRisk_PolicyHolder
FROM Insurance
WHERE coverage_zone IN ('Highly Urban', 'Urban')
GROUP BY coverage_zone
ORDER BY HighRisk_PolicyHolder DESC;

— -Can we identify any common characteristics among policyholders who make frequent claims?

ADDING THE “policy_holder” COLOMN

The addition of this column will facilitate the grouping of our policyholders into high-risk and low-risk categories, thereby enhancing the depth and thoroughness of our analysis.

-- Adding a new column 'policy_holders' 
ALTER TABLE Insurance
ADD policy_holders VARCHAR(20);

-- Update the 'policy_holders' column based on indicative factors
UPDATE Insurance
SET policy_holders = CASE
WHEN (coverage_zone = 'Urban' OR coverage_zone = 'Highly urban') AND household_income < 100000 AND AgeGroup IN ('Youths' , 'Seniors') THEN 'High'
ELSE 'Low'
END

--common characteristics
SELECT TOP 1 car_make,policy_holders,marital_status,AgeGroup,car_use,car_make,coverage_zone,
COUNT(policy_holders) AS COUNTS
FROM Insurance
WHERE claim_freq IN (3,4)
GROUP BY car_make,policy_holders,marital_status,AgeGroup,car_use,car_make,coverage_zone
ORDER BY COUNTS DESC
The most common characteristics from policy holders who make frequent claims (3,4) are policy holders who are single within the middle-aged group driving a Chevrolet and lives in highly rural areas. There are 16 of them in that category.

3. Customer Segmentation and Marketing

— -What are the key characteristics of policyholders with low claim frequencies and high household incomes?

ADDING THE “income_earners” COLOMN

The addition of this column will facilitate the grouping of our policyholders based on their household income. Policyholders with incomes greater than $100,000 will be categorised as high-income earners, while those earning less than $100,000 will be categorised as low-income earners.

- Adding a new column 'income_earners'
ALTER TABLE Insurance
ADD income_earners VARCHAR(20);
- Update the 'income_earners' column based on household income
UPDATE Insurance
SET income_earners = CASE
WHEN household_income > 100000 THEN 'High'
ELSE 'Low'
END;
- Characteristics
SELECT TOP 1 car_use, marital_status,gender,parent, AgeGroup,
COUNT(*) AS policy_holders
FROM Insurance
WHERE income_earners = 'high' AND claim_freq IN (0,1,2)
GROUP BY car_use, marital_status,gender,parent, AgeGroup
ORDER BY policy_holders DESC;
Key characteristics with low claim frequency(0,1,2) and high household income (>$100,000) are low female policy holders within the middle aged group who drives Chevrolet living in the suburban area. This category consist of 36 policy holders

— -How can we segment the customer base to identify high-value customers for targeted marketing campaigns?

High value customers in this context are particularly those owning private vehicles, residing in urban/highly urban areas and families with kids driving

SELECT COUNT(*) AS HighValueCustomers
FROM (SELECT ID, household_income, coverage_zone,
CASE
WHEN claim_freq IN (0,1,2) AND car_use ='Private' AND kids_driving IN (1,2,3) AND household_income >= 150000 THEN 'High-Value'
ELSE 'Low'
END AS CustomerSegment
FROM Insurance
WHERE AgeGroup IN ('Middle-aged')
AND coverage_zone IN ('Rural','Highly Rural') ) AS HighValueSegments
WHERE CustomerSegment = 'High-Value';
We currently have 871 high-value customers

4. Demographic Analysis

— -How does the distribution of policyholders vary across different demographic factors (age, gender, marital status)?

AGE

- Age
SELECT AgeGroup,
COUNT(CASE WHEN policy_holders = 'High' THEN 1 END) AS High_PolicyHolders,
COUNT(CASE WHEN policy_holders = 'Low' THEN 1 END) AS Low_PolicyHolders
FROM Insurance
GROUP BY AgeGroup
ORDER BY High_PolicyHolders DESC;
The distribution of policyholders by age group shows a larger proportion of low-risk policyholders, particularly in the middle-aged group. Meanwhile, there are no middle-aged individuals classified as high-risk policyholders. The majority of high-risk policyholders are youths, totalling over 900 persons

GENDER

        --Gender
SELECT gender,
COUNT(CASE WHEN policy_holders = 'High' THEN 1 END) AS High_PolicyHolders,
COUNT(CASE WHEN policy_holders = 'Low' THEN 1 END) AS Low_PolicyHolders
FROM Insurance
GROUP BY gender;
We have a higher number of female high-risk policyholders compared to males. Meanwhile, the distribution of low-risk policyholders is evenly balanced between genders

MARITAL STATUS

--marital status
SELECT marital_status,
COUNT(CASE WHEN policy_holders='High'THEN 1 END) AS High_PolicyHolders,
COUNT(CASE WHEN policy_holders = 'Low' THEN 1 END) AS Low_PolicyHolders
FROM Insurance
GROUP BY marital_status
Singles constitute the majority in both high-risk and low-risk policyholder categories.

— -Are there any noticeable trends in car usage and ownership among different demographic groups?

AGE GROUP

--Age group
SELECT car_use,
COUNT(CASE WHEN AgeGroup = 'Youths' THEN 1 END) AS Youth,
COUNT(CASE WHEN AgeGroup = 'Middle-aged' THEN 1 END) AS MiddleAged,
COUNT(CASE WHEN AgeGroup = 'Seniors' THEN 1 END) AS Seniors
FROM Insurance
GROUP BY car_use;

GENDER

--Gender
SELECT car_use,
COUNT(CASE WHEN gender= 'Male' THEN 1 END) AS Male,
COUNT(CASE WHEN gender = 'Female' THEN 1 END) AS Female
FROM Insurance
GROUP BY car_use;

MARITAL STATUS

--marital status 
SELECT car_use,
COUNT(CASE WHEN marital_status = 'Single' THEN 1 END) AS Single,
COUNT(CASE WHEN marital_status= 'Married' THEN 1 END) AS Married,
COUNT(CASE WHEN marital_status = 'Divorced' THEN 1 END) AS Divorced,
COUNT(CASE WHEN marital_status= 'Seperated' THEN 1 END) AS Seperated
FROM Insurance
GROUP BY car_use;

5. Geographical Analysis

— - How do claim frequencies and amounts vary across different coverage zones?

SELECT coverage_zone,
ROUND (AVG(claim_freq),3) AS ClaimFreq,
AVG(claim_amt) AS ClaimAmount
FROM Insurance
GROUP BY coverage_zone;
Suburban areas, despite having higher claim frequencies, might warrant higher premiums due to their higher average claim amounts.

— -Are there any regional trends or patterns that should be taken into consideration for marketing or risk assessment?

COVERAGE ZONE

-- Number of policiy holders and average claim amount by coverage area
SELECT coverage_zone,
COUNT(*) AS policy_holders,
AVG(claim_amt) AS AvgClaimAmount
FROM Insurance
GROUP BY coverage_zone
ORDER BY AvgClaimAmount DESC;
Urban areas have the highest number of policyholders (7,588), with an average claim amount of $50,377.74. Suburban regions follow closely with 7,466 policyholders and an average claim amount of $50,124.85. Highly Rural areas exhibit slightly lower numbers of policyholders (7,452) but with a comparable average claim amount of $49,998.14. Highly Urban areas have the highest average claim amount at $49,861.04, despite having slightly fewer policyholders (7,514). Rural regions show similar trends to Highly Urban areas, with 7,522 policyholders and an average claim amount of $49,778.03

CAR USE

- - Car type
SELECT coverage_zone,
COUNT(CASE WHEN car_use='Commercial' THEN 1 END) AS Commercial,
COUNT(CASE WHEN car_use='Private' THEN 1 END) AS Private
FROM Insurance
GROUP BY coverage_zone;
Highly Urban areas show the highest number of policyholders for both commercial and private car types, with 1,026 and 1,030 policyholders respectively, while Highly Rural areas have the lowest number of policyholders for both types, with 973 and 998 policyholders respectively.

AGE GROUP

--Age group
SELECT coverage_zone,
COUNT(CASE WHEN AgeGroup = 'Youths' THEN 1 END) AS Youth,
COUNT(CASE WHEN AgeGroup = 'Middle-aged' THEN 1 END) AS MiddleAged,
COUNT(CASE WHEN AgeGroup = 'Seniors' THEN 1 END) AS Seniors
FROM Insurance
GROUP BY coverage_zone;
Relatively balanced distribution across age groups, although middle-aged policy holders are prominently represented across all regions. Target industries or businesses with specialized insurance products to commercial vehicle usage, focusing on fleet management and operational risks.

6. Customer Behaviour Insights

— -Are there any trends or patterns in the behavior of policyholders who have children driving?

- policyholders with children driving
SELECT
CASE WHEN kids_driving IN (1, 2, 3) THEN 'Yes' ELSE 'No' END AS ChildrenDriving,
SUM(CASE WHEN car_use = 'Commercial' THEN 1 ELSE 0 END) AS Commercial,
SUM(CASE WHEN car_use = 'Private' THEN 1 ELSE 0 END) AS Private,
SUM(CASE WHEN coverage_zone = 'Highly Urban' THEN 1 ELSE 0 END) AS HighlyUrban,
SUM(CASE WHEN coverage_zone = 'Urban' THEN 1 ELSE 0 END) AS Urban,
SUM(CASE WHEN coverage_zone = 'Rural' THEN 1 ELSE 0 END) AS Rural,
SUM(CASE WHEN coverage_zone = 'Highly Rural' THEN 1 ELSE 0 END) AS HighlyRural,
SUM(CASE WHEN coverage_zone = 'Suburban' THEN 1 ELSE 0 END) AS Suburban
FROM Insurance
GROUP BY CASE WHEN kids_driving IN (1, 2, 3) THEN 'Yes' ELSE 'No' END;
Families with children driving are more likely to opt for private vehicles (higher counts in Private category), possibly due to perceived safety, convenience, and flexibility associated with private transportation for family needs. Urban areas show higher counts in both categories with or without children, indicating a strong presence of policy holders residing in urban settings irrespective of having children driving.

— -How does the presence of children driving affect the frequency and number of claims?

SELECT
CASE WHEN kids_driving IN (1, 2, 3) THEN 'Yes' ELSE 'No' END AS ChildrenDriving,
ROUND(AVG(claim_freq),3) AS AvgClaimFrequency,
SUM(claim_freq) AS NoOfClaims,
COUNT(*) AS policy_holders
FROM Insurance
GROUP BY CASE WHEN kids_driving IN (1, 2, 3) THEN 'Yes' ELSE 'No' END;
The presence of children appears to have a slight impact on claim frequency, with those without children showing a slightly higher average frequency of 0.512 compared to 0.506 for those with children. In terms of claim numbers, policyholders without children filed 13,662 claims out of 26,685 policyholders, whereas policyholders with children filed 5,496 claims out of 10,857 policyholders

INSIGHTS

Splendor Analytics Insurance company has a total of 37,542 Policy Holders and provides insurance coverage for over a thousand car models.

High-Risk Policyholders features

- Youths and Senior drivers

- Drivers in highly urban and urban areas

- Lower- income households

  • Frequent claim history

High-risk policyholders typically include youths and senior drivers, residents of highly urban and urban areas, individuals from lower-income households, and those with a history of frequent claims. These characteristics suggest increased risk due to factors such as inexperience (youths), potential health issues (seniors), higher traffic density (urban areas), financial constraints (lower-income households), and a pattern of past incidents (frequent claims). Insurance companies use this understanding to customize their strategies for risk assessment, pricing, and targeted interventions aimed at improving safety and reducing claims among these high-risk groups.

Low-Risk Policyholders features

- Middle aged drivers

- Drivers in suburban and rural areas

- High- income households

  • Zero to low claim history

Low-risk policyholders are typically middle-aged drivers residing in suburban and rural areas, often from high-income households, and with a history of zero to low insurance claims. These characteristics indicate a lower risk profile due to factors such as experience and maturity in driving (middle-aged), lower traffic density and safer driving conditions (suburban and rural areas), higher financial resources for vehicle maintenance and safety (high-income households), and a history of safe driving behaviors (low claim history). Insurance companies leverage this understanding to offer tailored products with lower premiums, incentives for maintaining safe driving habits, and strategies to maintain customer satisfaction through fewer claims and smoother insurance experiences for these low-risk groups.

In terms of car usage, higher preference or ownership of private cars among policy holders .This demographic likely values the flexibility, style, and affordability associated with private vehicles.

There’s a consistent distribution across age groups, reflecting stable policy holder demographics. The distribution of youth policy holders across different regions (Highly Rural, Urban, Suburban, Highly Urban, Rural) varies slightly, but generally shows a balanced representation across all regions.Middle-aged policy holders are prominently represented across all regions, indicating a significant market presence.

Families with children driving are more likely to opt for private vehicles (higher counts in Private category), possibly due to perceived safety, convenience, and flexibility associated with private transportation for family needs. Urban areas show higher counts in both categories with or without children, indicating a strong presence of policy holders residing in urban settings irrespective of having children driving.

RECOMMENDATIONS

Given the higher preference for and ownership of private cars among policy holders, insurers should develop tailored insurance products that emphasise flexibility, style, and affordability. This can include customisable coverage options, discounts for safe driving, and features that appeal to the lifestyle preferences of private car owners.

While there is a consistent distribution across age groups, insurers can enhance regional targeting strategies based on slight variations in youth distribution across different regions (Highly Rural, Urban, Suburban, Highly Urban, Rural). This involves tailoring marketing efforts and insurance packages to address specific regional needs and preferences.

Recognising that families with children driving prefer private vehicles due to perceived safety and convenience. Insurers should offer specialized family insurance packages. These could include coverage for student drivers, comprehensive safety features, and discounts for multi-car policies to attract and retain this demographic.

Given the strong presence of policy holders in urban areas, regardless of children driving, insurers should focus on developing urban-specific insurance solutions. This could involve addressing urban driving challenges such as congestion, theft prevention, and parking issues through innovative coverage options and customer-centric services

CONCLUSION

Insurance providers should leverage insights into private vehicle ownership, regional demographics, and family-oriented preferences to tailor their product offerings effectively.

By understanding and addressing these preferences, insurers can enhance customer satisfaction, improve market penetration in specific demographic segments, and maintain competitiveness in the evolving insurance landscape.

Continuous monitoring of demographic trends and customer behaviours will be essential for adapting insurance strategies to meet the evolving needs and expectations of policy holders across different age groups and regions.

Thank you for reading
Kindly interact with my slides and SQL codes here and tell me what you think about it.

Do well to share.
THANK YOU.

Please Kindly Follow me on LinkedIn
Website: https://lawkeytheengine.github.io/

--

--