How to: Manage your spatial data to make analytics easy

Daniel van der Maas
6 min readMay 18, 2023

Extracting value from spatial data is very time-consuming. As spatial data is growing faster than legacy systems can handle, accessibility and integration challenges result in frustration and lengthy manual data wrangling for anyone looking to leverage location data in their business processes.

The burden of finding a way to deal with this mess and create spatially aware intelligence and stocastics (in spite of all the trouble this implies) often falls to Data Scientists, -Analysts and -Engineers.

Seeing as this is a professional group for whom I have vowed to make using spatial data easy, allow me to present a way to automate the part of your work that your didn’t actually sign up for!

This article aims to showcase how Data Scientists and Analysts can fast-track their work by relying on some off-the-shell solutions in Python and Ellipsis Drive to automate the spatial data transformation and unification that is required for analysis. For this, I will present a concrete use case with some small but realistic sample data.

The use case

Exposure management for insurers
In the P&C Insurance industry I have observed many day-to-day data management challenges for professionals working in risk & portfolio management, modelling, underwriting, and claims. Mainly because data in this context tends to be distributed and large in both scale and variety. Decades of IT legacy also makes (modern) spatial data hard to search through, query and integrate into old workflows and systems.

From the perspective of the re-/insurer, gaining detailed knowledge about the specifics of their policies, accounts, and portfolios is crucial. This allows them to conduct a more precise evaluation of the risks they carry to determine the appropriate premiums for insured parties or balance out their portfolio. However, they are only able to conduct such precise risk evaluations when they have 1) comprehensive information about the policies, accounts, and portfolios under their management and 2) when this data is also searchable and accessible for simultaneous and scalable querying. Something legacy systems don’t allow!

The goal
The ultimate goal is to optimise exposure management. For this specific data points on conditions are crucial, such as the occupancy type (e.g., residential, commercial, industrial) of insured buildings, the construction year, height, and even more advanced details such as secondary modifiers which encompass additional factors like square footage, roof type, roof anchors and sprinkler systems etc.

In our example of how how to fast-track your work, we will be using the following data sets:

  1. An re/insurer’s portfolio of active policies in Pennsylvania, Maryland and Virginia: In this (demo) file we can find all the policies underwritten by the insurer in that area and all the related information that has been gathered by the insurer through their underwriting or claiming process.
  2. A Flood Risk dataset for Pennsylvania, Maryland and Virginia: Acquired from an external NATCAT model vendor (such as Verisk, RMS, Fathom, JBA or, in this case, Answr).
  3. A windstorm risk dataset for Pennsylvania, Maryland and Virginia: Also acquired from an external vendor (in this case also Answr).
  4. A dataset with additional data on building attributes: Acquired from a provider of second modifiers information (such as Precisely, Lightbox or MBI).

What would we like to know to better manage our exposure?
One of the many aspects we may be interested in is the average age of buildings within our portfolio. This information provides insights into the prevailing construction norms when they were built and enables us to assess if proper maintenance has been carried out. Understanding the age distribution is particularly significant when evaluating the portfolio’s resilience to natural catastrophes, as building techniques have evolved over the years. Having a majority of buildings dating back to 1920, for example, introduces a risk factor that must be taken into account. By leveraging the data that we active through Ellipsis Drive, we can thoroughly analyse our portfolio, identify potential vulnerabilities, and make informed decisions to optimise our risk management.

Let’s do it!
Now, let’s see how easy it is to extract key insights when managing our input files in Ellipsis Drive.

Step 1: I create layers in Ellipsis Drive and upload my files to the corresponding layers. Once this is done, the Drive will process my files and publish them as interactive data layers:

Step 2: In order to check whether my data was added correctly, I can click one of the layers and inspect the locations I find inside. I can for example check out my portfolio information:

Step 3: I can manually overlay all the remaining data layers (on flood risk, windstorm risk and building attributes) in the viewer. This results in the display of all my input information for Virginia, Maryland and Pennsylvania on a dynamic map:

Step 4: With the visual inspection off my data done, I can now start to create actual a Python script to perform my analysis. Since Ellipsis Drive publishes the layers in the exact same way regardless of the original file type or size, I can easily combine and overlay the data in my script (data standardization and unification is completely automated!!).

So let’s make a query! (the fun part)

To understand the distribution of risk of my portfolio in Washington DC, I may want to know the portion of my active policies that cover buildings older than 1950 with a brick foundation that are located in a potential flood zone.

My script would look as follows:

import ellipsis as el


token = el.account.logIn('demo_user', 'demo_user')


washington_extent = {'xMin': -77.163, 'yMin': 38.812 , 'xMax':-76.929, 'yMax': 38.968 }



#Retrieve flood risk around Washington DC
pathId = 'cb3bb4bd-dfc6-4fa0-9455-ab862fca1aaa'
timestampId = '2d643252-34f6-42cb-a547-32cf2a1f47b6'
floodRisk = el.path.vector.timestamp.getFeaturesByExtent(pathId= pathId, timestampId = timestampId, extent = washington_extent, token = token)['result']

#Retrieve properties with a certain policy around Washington DC
pathId = 'f56f474b-bc56-458e-8175-5411b70eb3f3'
timestampId = '242c9f2a-553f-48b3-9fb4-9dd7b2ce71f8'
policy = 'HP 503763014'
portfolio = el.path.vector.timestamp.getFeaturesByExtent(pathId= pathId, timestampId = timestampId, extent = washington_extent, token = token, propertyFilter=[{'key':'policy', 'operator': '=', 'value':policy}])['result']


#Retrive construction infomration around Washintong DC
pathId = '862fe065-ad45-484d-9cfd-ca9f3107e4de'
timestampId = '0e62c5d9-7b39-4be5-9e8b-c6d605335cac'
FoundationMaterial = 'concrete & brick'
constructionYear = 1950
constructionInfo = el.path.vector.timestamp.getFeaturesByExtent(pathId= pathId, timestampId = timestampId, extent = washington_extent, token = token, propertyFilter=[{'key':'FoundationMaterial', 'operator': '=', 'value':FoundationMaterial}, {'key':'constructionYear', 'operator': '<=', 'value':constructionYear}])['result']

#Intersect to get relvant properties
relevant_construction_area = constructionInfo.unary_union
floodRisk = floodRisk.unary_union

relevant_part_of_portfolio = portfolio[portfolio.intersects(relevant_construction_area)]
relevant_part_of_portfolio = relevant_part_of_portfolio[relevant_part_of_portfolio.intersects(floodRisk)]

relevant_part_of_portfolio['address'].head()

Here we find 77 addresses matching our search criteria (which happens to be 88% of our demo portfolio in the DC area):

2    6139 LEESBURG PIKE, APT 608
5 2407 LESLIE AVE
6 2456 S WALTER REED DR
8 1817 KENWOOD AVE
9 5900 ROSEDALE DR

Pretty quick right?
So just upload all relevant data, query your data, and be done before your first coffee break!

Step 5: As a last step we may want to publish results to a dashboard. You can connect your findings directly to your dashboards or to any other workflow of choice. For example you might want to create a PowerBI dashboard as was done here.

We can build out our analysis and dashboard as far as we’d like by drawing on other data layers. Ellipsis Drive was built to scale, so there is no limit to how many data points you would like to include.

Final thoughts

I hope that overall the scenario portrays how Ellipsis Drive empowers insurance professionals (and data scientists in the broadest sense) by effectively managing, querying, and integrating detailed data, enabling accurate risk assessment and efficient exposure management in a user-friendly and interoperable manner.

--

--

Daniel van der Maas

As CTO of Ellipsis Drive it's my mission to make spatial data useable for developers and data scientists. https://ellipsis-drive.com/