What is the Linked Open Data Cloud, and why is it important?

Published in

OpenLink Virtuoso Weblog

7 min readJun 11, 2019

Situation Analysis

It is the year 2019, and “Data rules the world” has become a global understanding shared by every individual equipped with a computing device.

Today we have an 80 Trillion Dollar Global Economy that is fundamentally driven by Data.

Global Economy Visualization from howmuch.net

Data is generally understood to be crucial to the creation of Information en route to the discovery of Knowledge.

Unfortunately, there is also a lingering misconception that challenges such as Data’s Volume, Velocity, Variety, Veracity, and Vulnerability can only be solved by some mythic Database Management System that slurps up all of mankind’s data into a single system from which Knowledge is doled out, subject to one’s ability to navigate a morass of obtrusive ads.

This post is an attempt to bring clarity to what the Linked Open Data Cloud (a/k/a the LOD Cloud) actually exemplifies, and how it demonstrates an unrivaled solution to the Data Access problems that we all face today.

What is the LOD Cloud?

The LOD Cloud is a Knowledge Graph that manifests as a Semantic Web of Linked Data. It is the natural product of several ingredients:

Open Standards — such as URI, URL, HTTP, HTML, RDF, RDF-Turtle (and other RDF Notations), the SPARQL Query Language, the SPARQL Protocol, and SPARQL Query Solution Document Types
Myriad amateur and professional Data Curators across industry and academia
A modern DBMS platform — Virtuoso from OpenLink Software
Seed Databases — initially, DBpedia and Bio2RDF formed the core; more recently, significant contributions have come from the Wikidata project and the Schema.org-dominated SEO and SSEO axis supported by Search Engine vendors (Google, Bing!, Yandex, and others) — provided master data from which other clouds (and sub-clouds) have been spawned

The core tapestry of the LOD Cloud arises from adherence to the “deceptively simple” notion that hyperlinks should be used to identify any thing while entity→attribute→value or subject→predicate→object structured sentences should be used to describe every thing.

The practices above constitute what are now commonly known as the principles of Linked Data — a deployment method for representations of structured data that adds the use of hyperlinks (specifically, HTTP URIs) to the EAV (Entity Attribute Value) and RDF (Resource Description Framework) models.

Bio2RDF and DBpedia — two inter-connected projects that seeded the LOD Cloud

DBpedia and Bio2RDF

Circa 2006, the DBpedia project created a General Knowledge seed for the germination of the LOD Cloud by repurposing Wikipedia content in Linked Data form.

The DBpedia RDF Data Set is hosted and published using OpenLink Virtuoso. The Virtuoso infrastructure provides access to DBpedia’s RDF data via a SPARQL endpoint, alongside HTTP support for any Web client’s standard GET for HTML or RDF representations of DBpedia resources.

**Illustration of Current DBpedia Data Provision Architecture**

The Bio2RDF project created a similar, but more focused, seed by generating Linked Data from a variety of Life Science, Healthcare, and Pharmaceutical industry data sources.

Thus, from the get-go, there was an extremely rich mesh of master records that made it easy for others to embrace and extend.

Wikidata

Where DBpedia focuses on generating Linked Open Data from Wikipedia documents, Wikidata focuses on creating Linked Open (meta)Data to supplement Wikipedia documents, so while these may appear at first glance to be competitive projects, they are better treated as complementary.

Schema.org

The Schema.org vocabulary is a relatively new addition to the LOD Cloud. This collection of terms is increasingly understood by search engines — because it is primarily curated by the operators of those same engines, which provides a compelling incentive for its use by content publishers seeking to optimize their Search Engine Results Placement (SERP).

Why is the LOD Cloud Important?

The LOD Cloud provides a loosely-coupled collection of Data, Information, and Knowledge that’s accessible by any human or machine with access to the Internet, courtesy of the abstraction layer provided by the Web. It permits both basic and sophisticated lookup-oriented access using either the SPARQL Query Language or SQL.

Economic Challenge of the LOD Cloud

The current cloud came about largely as a side benefit of other projects, through community collaboration strongly influenced by academia and indirectly funded by a variety of research projects. Thus, for all of its existence, a functional business model has been a mercurial pursuit.

Data storage, processing power, network bandwidth, server administration — all of these have costs that must be borne somehow. Content Quality and Query Service Availability are the key user-visible items that challenge the current cloud. None of these is sufficiently addressed by conventional “Open Source” and “Community Collaboration” patterns, i.e., Services-as-Gifts or “Honorable Contributions” aren’t a sustainable option, as time has demonstrated.

**Current LOD Cloud Economy** — where the Scheme, Source, and Currency of Compensation are unknowns

Solving the Economic Challenge

Fundamentally, every solution to the LOD Cloud business model challenge boils down to evolving the currently “unknown” elements — Compensation Scheme, Compensation Source, and Compensation Currency — into specifics.

Compensation Scheme

Fine-grained Attribute-based Access Controls (e.g., WebACLs) that describe who (i.e., what person or software agent) has access to what data, and under what conditions. This allows data and query service providers to provide broad but shallow access at no cost, while granting paying users deep and/or focused access at prices appropriate to the net benefit of that access.

Rights Tokenization

An X.509 Digital Certificate can be used to tokenize Identity (in the form of a WebID) and Identification (WebID-Profile) to produce credentials that are reconciled to Web ACLs associated with datasets published to the LOD Cloud by various publishers, via Attribute-Based Access Control (ABAC) systems.

Compensation Currency

Payment options include conventional (fiat) currencies, cryptocurrencies like Bitcoin, cryptocurrencies associated with some Rewards Systems, and others.

Purchase and Usage Process

A user purchases a ticket and stores it in the native key store provided by their operating system (Windows, macOS, Linux, etc.).

Users can also use PKCS#12 files to make their own key stores which reside on their interaction device (laptop, desktop, tablet, phone, etc.) and/or on a detached credentials device (or dongle).

Prototype Solution — URIBurner Service

One example of this system exists today in the form of the LOD Connectivity Licenses that we offer for our URIBurner Service which can be thought of as a “deceptively simple” conduit to the LOD Cloud.

URIBurner Service and its LOD Connectivity Drivers

In their most basic form, these LOD Connectivity Licenses add SQL access via ODBC, JDBC, ADO.NET, and/or OLE DB to the mix of LOD Cloud data-access protocols (primarily SPARQL and HTTP). URIBurner also brings users an ability to crawl the LOD Cloud as part of the query solution pipeline — using a progressive and intelligent Small Data pattern.

author — creates data
curator — checks facts
publisher — publishes data using Linked Data principles and protects data access using WebACLs, e.g., reserving privileged read-write operations (e.g., sponging to sandboxed named graphs) to authenticated users, while other read-write operations may specifically require a WebID
WebACL Analyzer — a component of the Virtuoso Authentication Layer (VAL), evaluates ACLs prior to invoking functionality

Prototype Solution — VIOS Network

In the VIOS Network, data visualization is added to the mix to increase understanding by way of faceted data browsing or exploration, i.e., entity relationship types (relations) and their aggregate memberships are used to deliver exploration and navigational intelligence.

Naturally, availability and data accuracy remain important factors in this system, hence the use of Activity Stream Monitors & Analyzers (dApps), Cryptocurrencies, Smart Contracts, and Blockchain-based Distributed Ledgers to create a LOD Cloud dimension with incentives for all contributors — publishers, authors, fact-checkers, etc.

author — creates data
curator — checks facts
publisher — publishes data using Linked Data principles and protects data access using Web ACLs, e.g., reserving privileged read-write operations (e.g., sponging to sandboxed named graphs) to authenticated users, while other read-write operations may specifically require a WebID
Log Analyzer — a dApp that uses information gleaned from activity streams in log files and associated dataset metadata (e.g., values of schema:author, schema:contributor, and schema:publisher properties) to allocate rewards in the form of VIOS Tokens (a cryptocurrency).

Conclusion

In the LOD Cloud, we have a live demonstration of a new frontier for data access, integration, and management, in which each aspect presents a Trillion Dollar market opportunity, applicable to a variety of market segments.