Building a Config-based Knowledge Graph
Nabeel Zewail | Senior Software Engineer, Data
As an insurance company, modeling risk is paramount to our success, and since our focus is on venture backed startups we're investing in building out a dataset that models startups over time. We’re writing this article so any engineers tackling data integration across multiple sources and considering using knowledge graphs can learn from our journey.
In this post, we will cover some of the problems that we tried to solve in building out our knowledge graph and how we got set up. In the a future post, we will cover some of the tooling that we have built to enable teams to access the dataset that we have built.
Before we get into some of the technical aspects, I wanted to to go through some of the problems we were hoping to solve and assumptions that we made when building out our knowledge graph:
- Ingesting data from a variety of different sources, with different data, and different update frequencies. We ingest data from a variety of different places (CRM system, backend databases, 3rd party sources, etc). Each of these datasets contain different pieces of data about a company, and so a requisite part of our solution was being able to handle all these differences elegantly.
- Understanding data provenance is essential. A big goal for us was to track where and when data was updated. We needed to know how an entity changed over time, with a paper trail to tell us where we learned about that change.
- Startups, investors, and their partners operate in a highly connected network. By understanding this network of relationships we can understand the changing nature of each players' risks.
Why not just use a SQL based, relational warehouse?
We really wanted to keep things in our dbt — Snowflake infrastructure because it is a set of technologies that our team was more familiar with and which has a robust set of tooling that makes development easier. However, we found that addressing some of the problems described above would be difficult. When it came to data provenance, entity resolution, storing and accessing attribute metadata, and understanding the relationships between different entities a graph based solution seemed cleaner.
This decision came with a bit of a steep learning curve. We had to reorient ourselves to new technologies to work with graph databases with standards like RDF and SPARQL. This was a set of skills that our team didn’t really have when we began the implementation, and generally not something that most software or data engineers are familiar with.
When beginning to research how to build a knowledge graph we settled on using RDF for a few reasons:
- RDF cleanly handles harmonizing different data sources in a unified way
- RDF graphs allow us to set up rules, build metadata, and track provenance
- Existing ontologies allowed us to get started quickly (shout-out to schema.org!). This can be a bit of a double edged sword because it is helpful to be able to lean into existing ontologies but also the total lack of structure to your graph can make querying the graph with confidence hard (more on how we address this later!)
Getting Started
This diagram provides a high-level overview of the system we have designed around our knowledge graph, where we ingest data into our data warehouse from a variety of different sources/vendors. Then we take that data and translate it into RDF to ingest into our graph database (AWS Neptune, in our case). We also have some exploratory tooling that we have built on top of our graph directly which we will cover in a future post.
Ingestion
The first problem we had to solve was taking our data from our data warehouse, translating it into RDF, and ingesting it into our graph database.
We designed a config based system called quadify
that allowed a developer to generate RDF quads by defining a mapping such as below:
{
"primary_key": "article_id,
"source_name": "wikipedia",
"predicate_mapping": {
"city": "https://schema.org/City",
"address": "https://schema.org/address",
"url": {
"predicate_uri": "https://schema.org/url",
"obj_datatype": "uri"
},
"created_at": {
"predicate_uri": "https://schema.org/dateCreated",
"obj_datatype": "date"
}
}
}
This config maps specific columns in SQL tables (in this case city
, address
, url
, created_at
to RDF predicates. Each source
/date_modified
would go into a specific named graph to give us the provenance of where that data came from and when that data was updated (this post from StarDog was very helpful in making sense of how to use named graphs). This setup allows us to easily query for the set of facts on a given date about a specific entity.
The example above shows how we translate a few SQL tables into a series of RDF facts that can be ingested into our graph. Here, the URI for the named graph provides some metadata about where/when that fact came from. The primary_key
in this example is the id
column, which translates to the subject for each fact. Then each of the subsequent columns ( name
, headcount
, funds_raised
) gets mapped to a defined predicate and the value of that column becomes the object. The named graph is a URI constructed fromthe source
and the date that fact was learned in this case 2022–05–01. By applying this "quadifying" logic we have taken data for a few different sources and normalized it into a defined RDF schema with clear provenance on an attribute basis.
Because Vouch's engineering team uses AWS widely, we opted to build on top of AWS' managed graph database, Neptune, which has a very useful bulk loader that makes it easy to load RDF structured data in S3 into the database directly.
Our basic workflow is to iterate through all the sources that we have defined, quadify
them, load the output RDF data into S3, and then use the Neptune bulk loader endpoint to load that data into Neptune.
With this system, we have a way to easily structure new data sources into RDF and load them into our knowledge graph.
In a future blog post, we will cover how we make sense of all that data and how we operationalize insights from it here at Vouch.
For a deeper dive into this system and the graph technologies we're developing, you can check out our talk at KGC'22 on YouTube: Modeling the startup ecosystem using a config based knowledge graph.