What is the Virtuoso Sponger Middleware about, and why is it important?

Published in

OpenLink Virtuoso Weblog

8 min readApr 3, 2019

Situation Analysis

The Web continues to grow exponentially across multiple axes. Each axis presents a new set of challenges to unsuspecting users, from the propagation of data silos, to compromises of privacy, to orientation away from the “literary machine” that is this unique global space.

Further, the challenges of the public Web are increasingly seeping into the private domains of organizations on the backs of various computing devices (such as phones, watches, and other IoT components) that make up the emerging Hybrid Cloud Infrastructure.

One solution, applicable to all of these emerging challenges, may be found in the generation of machine- and human-readable document metadata that provides insights into the nature of the content of those documents. Naturally, creating such metadata requires a combination of manual and automated operations performed by both humans and machines.

What is the Virtuoso Sponger?

The Virtuoso Sponger (or simply, the Sponger) is an Extract, Transform, and Load (ETL) Middleware Layer, built into all Virtuoso instances, that treats a wide variety of Document Types and APIs as Structured Data Sources usable in both SQL, SPARQL, Free Text queries. This is all achived through innovative implementation and exploitation of existing open standards.

Structured Data generated by the Sponger always manifests as a Knowledge Graph woven together by hyperlinks (specifically, HTTP URIs) i.e., a collection of RDF statements deployed using Linked Data principles.

Cartridge (Connector) Installation

The Sponger is built into all Virtuoso instances, but to take full advantage of its broad collection of transformation drivers, you need to install the Linked Data Cartridges VAD — via the HTTP browser-based Virtuoso Conductor interface or the Operating System-native iSQL command-line interface.

What document types are supported?

The Sponger (or more accurately, the Sponger Cartridges — the transformation drivers or connectors) operate on HTML, Plain Old Semantic HTML (POSH), (X)HTML+RDFa, HTML5+Microdata, HTML5+JSON-LD, RDF-Turtle, RDF-N-Triples, RDF-N-Quads, RDF-XML, CSV, Atom, RSS, iCal/iCalendar, vCard/vCalendar, Plain Text (with or without Nanotations), and XML and JSON content-types returned by a broad spectrum of APIs.

Benefits?

Simplified Knowledge Graph exploitation — by taking the tedium and confusion out of Linked Data deployment, i.e., generating proxy-hyperlinks that function as Super Keys in conformance with Linked Data (“webby structured data”) principles
Ease of Use — courtesy of hyperlinks as the sole control mechanism for delivering its powerful document description and content transformation
Ease of Extensibility — cartridges (drivers or connectors) for specific document types are leveraged as the delivery mechanism for content transformation functionality
Broad Integration with Third Party Services (client and server ends) — more than 70 API and Document Type combinations are currently supported, with customization APIs available for additional enhancements
Powerful Meshing (rather than mashing) of content from disparate data sources — i.e., a powerful solution for Data Virtualization exploitable by SQL, SPARQL, or Free Text queries

How do I use it?

Our URIBurner Service is a live instance of the Virtuoso Sponger that’s been available for free use since 2007, around the time of the DBpedia project launch and commencement of the Linked Open Data (LOD) Cloud.

Given a document of interest that’s available on the Web, at a location identified by a URI, here’s how you would obtain metadata describing said document that manifests as an exploration-friendly Knowledge Graph:

Go to http://linkeddata.uriburner.com
Place the URL that identifies the document of interest into the input field labeled “Enter the URL to sponge:”
Click on the “Sponge” button
View the doc returned to your browser

You can shorten this experience by installing our OpenLink Structured Data Sniffer and OpenLink Data Explorer Browser Extensions — both of which reduce the four-step-process above to a single-click action whenever you seek additional information about the document currently shown in your browser.

You can also manually invoke this functionality with the following URL pattern:

http://linkeddata.uriburner.com/about/html/{document-url}

Finally, you can invoke the Sponger’s services via the SPARQL Query Service endpoint associated with any Virtuoso instance (including URIBurner) that has this module enabled.

How does it work?

The invisible workflow behind this “deceptively simple” content transformation middleware is as follows:

Negotiate a preferred content-type with target document publisher using HTTP content-negotiation
Scan through the list of configured Extractor Cartridges (based on paths or content-type settings) and apply transformations offered by each of the relevant cartridges — at this point phase 1 completes with transformed content ready for the next processing cycle
Scan through a list of configured Meta Cartridges that perform tasks like Natural Language Processing (NLP)-based Entity Extraction and Linked Open Data (LOD) Cloud Lookups against the content produced by phase 1
Present the transformed document to the User Agent (e.g. Browser) that invoked the Sponger

Live Examples?

URL of Original Web Page: https://www.fastcompany.com/90324660/how-disney-grew-its-3-billion-mickey-mouse-business-by-selling-to-adults

URL of Metadata Document: https://linkeddata.uriburner.com/about/html/https/www.fastcompany.com/90324660/how-disney-grew-its-3-billion-mickey-mouse-business-by-selling-to-adults

The screenshots that follow demonstrate how the Sponger generates a description of the target document in the form of metadata that manifests as a Knowledge Graph that includes:

Document Type — via the property labeled “Type”
Document Title and Description — via the properties labeled “title” and “description”
Document Focus (a/k/a Primary Topic) — via the property labeled “about”
Related Topics — via the properties labeled “seeAlso” and “has related”
Entities Mentioned — via the property labeled “mentions”

Metadata Segment 1

Top-right “Meta-cartridge” drop-down presenting a list of Meta Cartridges used for NLP-based Entity Extraction contributions to the document description production pipeline

Metadata Segment 2

Property values generated through NLP-based Entity Extraction

Metadata Segment 3

Other property values generated through NLP-based Entity Extraction

In addition to the document presented above, each Sponger-generated page includes a link to an alternative document that presents the same metadata in a Faceted-Browsing-oriented form, i.e., a presentation style where filtering and navigation is driven by deeper exploitation of entity relationship type semantics. For instance, you can click on the value of the property labeled “type” to discover related entities from the underlying Virtuoso database associated with the invoked Sponger instance.

Faceted Browsing Segment 1

Faceted Browsing Segment 2

Description of one of the companies mentioned in the source document

Faceted Browsing Segment 3

Description of another company mentioned in the source document

SPARQL Integration

The Sponger’s services are also available for use as part of Virtuoso’s SPARQL Query Services functionality. For instance, a Document URL functions as an external Data Source Name against which Query Language operations may be performed, declaratively.

Here’s an example of a SPARQL Query that automatically treats a Google Spreadsheet about Intel CPUs as just another structured data source:

DEFINE get:soft "soft"PREFIX cpu: <https://docs.google.com/spreadsheets/d/1NmrGjc8pcgh1S_0mFNABiQpSNjY6Jxm1lAOmcxHaldg/export?format=csv#>
PREFIX dsn: <https://docs.google.com/spreadsheets/d/1NmrGjc8pcgh1S_0mFNABiQpSNjY6Jxm1lAOmcxHaldg/export?format=csv>SELECT DISTINCT ?s AS ?processorID xsd:string(?model) AS ?modelName ?cores IRI(?amazonUrl)
FROM dsn: 
WHERE {
       ?s cpu:Model ?model ;
          cpu:Cores ?cores ;
          cpu:Amazon_Link ?amazonUrl .
       FILTER (CONTAINS(STR(?amazonUrl),"https:"))
      }

The end product is an HTML document (by default; other formats may be requested by various means) equipped with hyperlinks functioning as Super Keys for deeper data exploration and navigation, enabling serendipitous discovery of other related data (locally or across an HTTP-based network like the Web).

Here are some sample live links:

Other Live Examples

Looking at a StackOverflow post about DBpedia SPARQL Endpoint — Basic Metadata Document or Faceted-Browsing-oriented Metadata Document (which provides pathways to other questions and answers related information in the underlying Virtuoso RDBMS instance)
Looking at a Forbes article about venture capital firm Andreessen Horowitz — Basic Metadata Document or Faceted Browsing oriented Metadata Document
Looking at an HTML document about Unix Philosophy and its impact on computing today — Basic Metadata Document or Faceted Browsing oriented Metadata Document
Looking at a Google Spreadsheet about Intel CPUs portfolio — Basic Metadata Document or Faceted Browsing oriented Metadata Document
Looking at an Ontology (Data Dictionary) generated from Google Spreadsheet content — Faceted Browsing oriented Metadata Document
Looking at a Description of an Individual or Instance — associated with one of the Entity Types (Classes) from the generated Ontology

Conclusion

The ubiquity of the Web and its profound impact on Internet usability and accessibility is not accidental. It is the product of a well designed solution to data access and integration based on the principle of deceptive simplicity, courtesy of hyperlinks functioning as powerful enablers of data access and data representation.

Our Sponger middleware brings all of the magic of webby structured data representation and access to your existing HTTP, ODBC, JDBC, ADO.NET, OLE-DB, and XMLA compliant applications and services. Basically, every HTTP-accessible document has become a usable structured data source, lying in wait for full exploitation by current and future digital transformation initiatives aimed at optimizing personal and organization-wide agility.

Existing tools for producing Analytics Dashboards, Performance Indicator Reports, and other personal productivity enhancers immediately morph into launch-points for exploring intelligent Knowledge Graphs automatically derived from your own document repositories and databases.