DBpedia Live, via OpenLink Virtuoso, in the Amazon Web Services Cloud

DBpedia 2016–04 Lived Edition Product Page in Amazon’s Web Services Marketplace

Situation Analysis

Wikipedia is an important focal point for content curation, especially as events occur worldwide. Notably in this era of Fake News, Wikipedia’s editorial policies and curation controls serve as distinguishing features for those seeking to research and verify information across a broad range of subject matter.

DBpedia provides a subset of content from Wikipedia that has been transformed into RDF-based Linked Open Data (LOD) and deployed to the Web for public access. The primary DBpedia service has been updated periodically in batch fashion, typically with each update posted approximately 6 months after its transformation was begun — so data seen on DBpedia has generally been at least 6 months, and could be 12 months or even further, behind then-current Wikipedia content.

Live Linked Open Data Hub

DBpedia-Live is a DBpedia database enhanced by the addition of live update capabilities. These incorporate changes made to Wikipedia in near real-time, working directly against their "Firehose."

An edition of DBpedia that near-immediately reflects changes made to Wikipedia further enhances the fight against Fake News and other misinformation when provided via a live RDBMS that supports:

  • Entity Description Browsing
  • Entity Lookups by Identifier
  • Ad-hoc Structured Queries using the SPARQL (over HTTP) and/or SQL Query Languages (via ODBC, JDBC, ADO.NET, and similar connections)

That RDBMS is, of course, OpenLink Virtuoso, and an obvious place to put such an always-available database is the Amazon Web Services (AWS) Cloud.

DBpedia-Live Setup & Usage

The steps illustrated below will lead you through creating and utilizing your own personal or service-specific DBpedia Live instance, via the AWS Cloud.

Virtual Machine Creation

a. Go to the AWS Marketplace page for the DBpedia Live AMI (Amazon Machine Instance).

b. Follow the one-click route for AMI instantiation, choosing the virtual machine configuration that best suits your needs.

c. Wait a few minutes for AWS to prepare the requested instance.

Using your new Virtual Machine

You now have a virtual machine in the cloud that holds a pre-configured Virtuoso RDBMS instance and a preloaded DBpedia database (split over several disk segments, courtesy of disk striping).

Bring the DBpedia instance online with these two steps:

a. Log in into your AMI using the command:

ssh -i {credentials-doc} ec2-user@{dns-name-of-your-ami}

b. Execute the following commands to start the Virtuoso RDBMS server, navigate to the DBpedia installation directory, and review the final lines of the Virtuoso server’s log file:

sudo /etc/init.d/virtuoso start
cd /dbpedia
sudo tail *.log

When the instance is ready, the output of the last ("tail") command will clearly indicate that the server is listening on port 80 for HTTP connections and port 1111 for SQL connections. (Note — on first launch, this may take an hour or more for the initial data population; later launches will be much faster. Repeat the “tail” command periodically until you see both HTTP and SQL "listening" log entries.)

Testing DBpedia Functionality

Once the Virtuoso server has completed its startup process, you can verify the functionality of your DBpedia instance with the following few steps:

a. Open a Web Browser window.

b. Enter the following into the Browser’s address input field, substituting as appropriate for the segment wrapped in curly-braces ("{ }"): 
http://{dns-name-of-your-ami}/resource/DBpedia

c. Follow-Your-Nose to wherever you like, by clicking on hyperlinks that interest you.

Since you now have a fully configured Virtuoso RDBMS instance at your disposal, these Web Service Endpoints will also be functional:

Enabling Updates with DBpedia Live

To this point, you’ve been setting up a conventional DBpedia instance. Live updates will be enabled by the next few steps, and will keep your instance in sync with Wikipedia changes as they occur.

a. Navigate to to the dbpintegrator (data sync program) directory with the command:

cd /dbpedia/dbpintegrator .

b. Execute the following command, to enable data change tracking:

sudo sh update_changesets.sh 

c. The first time you execute this command, you will see a message indicating that a configuration file has been generated for you, and needs some minor edits to make the updates functional. In reality (assuming you haven’t changed the initial ‘dba’ password), you don’t need to perform any initial edits i.e, this action generates a working INI file that only requires you to repeat the “sh update_changeset.sh” command and your instance is good to go, with regards to live updates.

d. Edit the configuration file (dbpedia_updates_downloader.ini) as advised, only if you changed the default ‘dba’ password and then repeat the previous command to kick off the live updates process:

sudo sh update_changesets.sh

e. You may also want to enable change tracking on the DBpedia ontologies (data dictionaries), which is done with the following command:

sudo sh update_ontology.sh

f. Observe the changing size of the dbpintegrator log file (dbp.log), indicating progress on the task, with the command:

ls -lth

g. A peek into that log file will show how many worker threads have been deployed to retrieve changeset data:

h. You can visit http://{ami-dns-name}/live for a page that presents change activity on your instance, as it occurs.

Automating Updates

For additional resilience and automation, you can easily set up cron jobs to periodically run both change tracking scripts.

a. Open up the cron editor by issuing the command:

cron -6

b. Add the following entries:

@hourly /dbpedia/dbpintegrator/update_changesets.sh
@daily /dbpedia/dbpintegrator/update_ontology.sh

Conclusion

That’s it! You now have your own personal or service-specific instance of DBpedia that will also track and apply changes as they occur in Wikipedia (the data source from with DBpedia’s Linked Open Data is generated).

This post demonstrates the ability to deploy a powerful RDBMS in the cloud that provides modern data access, integration, and management functionality accessible to a plethora of open standards compliant applications and services.

You are a few clicks and a few minutes away from experiencing and exploiting the promise of a Semantic Web of Linked Data combined with the virtues of a Multi-Model RDBMS that supports Relations (Data) represented as RDF Property Graphs and/or SQL Tables.

Related