What is the Linked Open Data Cloud, and why is it important?
Situation Analysis
It is the year 2019, and “Data rules the world” has become a global understanding shared by every individual equipped with a computing device.
Today we have an 80 Trillion Dollar Global Economy that is fundamentally driven by Data.
Data is generally understood to be crucial to the creation of Information en route to the discovery of Knowledge.
Unfortunately, there is also a lingering misconception that challenges such as Data’s Volume, Velocity, Variety, Veracity, and Vulnerability can only be solved by some mythic Database Management System that slurps up all of mankind’s data into a single system from which Knowledge is doled out, subject to one’s ability to navigate a morass of obtrusive ads.
This post is an attempt to bring clarity to what the Linked Open Data Cloud (a/k/a the LOD Cloud) actually exemplifies, and how it demonstrates an unrivaled solution to the Data Access problems that we all face today.
What is the LOD Cloud?
The LOD Cloud is a Knowledge Graph that manifests as a Semantic Web of Linked Data. It is the natural product of several ingredients:
- Open Standards — such as URI, URL, HTTP, HTML, RDF, RDF-Turtle (and other RDF Notations), the SPARQL Query Language, the SPARQL Protocol, and SPARQL Query Solution Document Types
- Myriad amateur and professional Data Curators across industry and academia
- A modern DBMS platform — Virtuoso from OpenLink Software
- Seed Databases — initially, DBpedia and Bio2RDF formed the core; more recently, significant contributions have come from the Wikidata project and the Schema.org-dominated SEO and SSEO axis supported by Search Engine vendors (Google, Bing!, Yandex, and others) — provided master data from which other clouds (and sub-clouds) have been spawned
The core tapestry of the LOD Cloud arises from adherence to the “deceptively simple” notion that hyperlinks should be used to identify any thing while entity→attribute→value or subject→predicate→object structured sentences should be used to describe every thing.
The practices above constitute what are now commonly known as the principles of Linked Data — a deployment method for representations of structured data that adds the use of hyperlinks (specifically, HTTP URIs) to the EAV (Entity Attribute Value) and RDF (Resource Description Framework) models.
DBpedia and Bio2RDF
Circa 2006, the DBpedia project created a General Knowledge seed for the germination of the LOD Cloud by repurposing Wikipedia content in Linked Data form.
The DBpedia RDF Data Set is hosted and published using OpenLink Virtuoso. The Virtuoso infrastructure provides access to DBpedia’s RDF data via a SPARQL endpoint, alongside HTTP support for any Web client’s standard GET
for HTML or RDF representations of DBpedia resources.
The Bio2RDF project created a similar, but more focused, seed by generating Linked Data from a variety of Life Science, Healthcare, and Pharmaceutical industry data sources.
Thus, from the get-go, there was an extremely rich mesh of master records that made it easy for others to embrace and extend.
Wikidata
Where DBpedia focuses on generating Linked Open Data from Wikipedia documents, Wikidata focuses on creating Linked Open (meta)Data to supplement Wikipedia documents, so while these may appear at first glance to be competitive projects, they are better treated as complementary.
Schema.org
The Schema.org vocabulary is a relatively new addition to the LOD Cloud. This collection of terms is increasingly understood by search engines — because it is primarily curated by the operators of those same engines, which provides a compelling incentive for its use by content publishers seeking to optimize their Search Engine Results Placement (SERP).
Why is the LOD Cloud Important?
The LOD Cloud provides a loosely-coupled collection of Data, Information, and Knowledge that’s accessible by any human or machine with access to the Internet, courtesy of the abstraction layer provided by the Web. It permits both basic and sophisticated lookup-oriented access using either the SPARQL Query Language or SQL.
Economic Challenge of the LOD Cloud
The current cloud came about largely as a side benefit of other projects, through community collaboration strongly influenced by academia and indirectly funded by a variety of research projects. Thus, for all of its existence, a functional business model has been a mercurial pursuit.
Data storage, processing power, network bandwidth, server administration — all of these have costs that must be borne somehow. Content Quality and Query Service Availability are the key user-visible items that challenge the current cloud. None of these is sufficiently addressed by conventional “Open Source” and “Community Collaboration” patterns, i.e., Services-as-Gifts or “Honorable Contributions” aren’t a sustainable option, as time has demonstrated.
Solving the Economic Challenge
Fundamentally, every solution to the LOD Cloud business model challenge boils down to evolving the currently “unknown” elements — Compensation Scheme, Compensation Source, and Compensation Currency — into specifics.
Compensation Scheme
Fine-grained Attribute-based Access Controls (e.g., WebACLs) that describe who (i.e., what person or software agent) has access to what data, and under what conditions. This allows data and query service providers to provide broad but shallow access at no cost, while granting paying users deep and/or focused access at prices appropriate to the net benefit of that access.
Rights Tokenization
An X.509 Digital Certificate can be used to tokenize Identity (in the form of a WebID) and Identification (WebID-Profile) to produce credentials that are reconciled to Web ACLs associated with datasets published to the LOD Cloud by various publishers, via Attribute-Based Access Control (ABAC) systems.
Compensation Currency
Payment options include conventional (fiat) currencies, cryptocurrencies like Bitcoin, cryptocurrencies associated with some Rewards Systems, and others.
Purchase and Usage Process
A user purchases a ticket and stores it in the native key store provided by their operating system (Windows, macOS, Linux, etc.).
Users can also use PKCS#12 files to make their own key stores which reside on their interaction device (laptop, desktop, tablet, phone, etc.) and/or on a detached credentials device (or dongle).
Prototype Solution — URIBurner Service
One example of this system exists today in the form of the LOD Connectivity Licenses that we offer for our URIBurner Service which can be thought of as a “deceptively simple” conduit to the LOD Cloud.
In their most basic form, these LOD Connectivity Licenses add SQL access via ODBC, JDBC, ADO.NET, and/or OLE DB to the mix of LOD Cloud data-access protocols (primarily SPARQL and HTTP). URIBurner also brings users an ability to crawl the LOD Cloud as part of the query solution pipeline — using a progressive and intelligent Small Data pattern.
- author — creates data
- curator — checks facts
- publisher — publishes data using Linked Data principles and protects data access using WebACLs, e.g., reserving privileged read-write operations (e.g., sponging to sandboxed named graphs) to authenticated users, while other read-write operations may specifically require a WebID
- WebACL Analyzer — a component of the Virtuoso Authentication Layer (VAL), evaluates ACLs prior to invoking functionality
Prototype Solution — VIOS Network
In the VIOS Network, data visualization is added to the mix to increase understanding by way of faceted data browsing or exploration, i.e., entity relationship types (relations) and their aggregate memberships are used to deliver exploration and navigational intelligence.
Naturally, availability and data accuracy remain important factors in this system, hence the use of Activity Stream Monitors & Analyzers (dApps), Cryptocurrencies, Smart Contracts, and Blockchain-based Distributed Ledgers to create a LOD Cloud dimension with incentives for all contributors — publishers, authors, fact-checkers, etc.
- author — creates data
- curator — checks facts
- publisher — publishes data using Linked Data principles and protects data access using Web ACLs, e.g., reserving privileged read-write operations (e.g., sponging to sandboxed named graphs) to authenticated users, while other read-write operations may specifically require a WebID
- Log Analyzer — a dApp that uses information gleaned from activity streams in log files and associated dataset metadata (e.g., values of
schema:author
,schema:contributor
, andschema:publisher
properties) to allocate rewards in the form of VIOS Tokens (a cryptocurrency).
Conclusion
In the LOD Cloud, we have a live demonstration of a new frontier for data access, integration, and management, in which each aspect presents a Trillion Dollar market opportunity, applicable to a variety of market segments.
Related
- Virtuoso LOD Connectivity Licenses for ODBC or JDBC access to the LOD Cloud
- What is DBpedia, and why is it Important?
- What is Small Data, and why is it important?
- 80 Trillion Dollar Global Economy — Visual Capitalist Chart
- Trillion Dollar Semantic Web Market Opportunity Spreadsheet
- URIBurner Service
- About VIOS
- Bio2RDF Presentation
- Identifiers.org
- The Linking Open Data project home page and news archive
- On the Mutually Beneficial Nature of DBpedia and Wikidata
- Statistics about the Web Data Commons RDFa, Microdata and Microformats data sets, extracted from the November 2018 release of the Common Crawl
- The Birth of DBpedia and the LOD Cloud
- LDOW 2008 Workshop about LOD Cloud Growth