Building Semantic Data Catalogs

Kurt Cagle
7 min readJun 30, 2018

I had a recent conversation with a senior data manager for a large corporate enterprise who is in the process of setting a long term data strategy. We had been talking semantics and RDF, and he raised an issue that I’ve run into as a consultant before, running something like this:

“We have a huge amount of siloed data, and would prefer to not have to move or duplicate that data but still need to get at it.”

This is a reasonable assumption. Databases represent a major sunk cost for any company, especially global ones, and any solution that requires relocating data (and potentially impacting thousands of existing applications) is a non-starter. At the same time, most data systems, whether relational databases, NoSQL databases, spreadsheets, documents or similar content, are not in fact designed out of the box to be used globally, and often there are complex ETL costs associated with making that data available through a queryable interface.

Putting everything into a triple store by itself is not an ideal solution, any more than dumping everything into a relational database or a NoSQL solution is. The fundamental challenge that you face in all of these cases is that the more content you bring into the database, the more that content needs to be indexed, to the extent that even with an efficient index, performance degrades significantly. (This…

--

--