DBpedia Usage Report (as of 2016–09–06)

A periodic report on the DBpedia SPARQL endpoint and associated Linked Data deployment.

Introduction

This document shows some of the statistics collected between September 2015 and August 2016, spanning a full year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/.

The log files used to prepare this document include data for the DBpedia 2015–04 release, which ran from the beginning of July 2015 up to the beginning of March 2016, as well as the DBpedia 2015–10 release, which is the current version of the DBpedia dataset.

Infrastructure

The DBpedia service consists of:

  • one or more Virtuoso Universal Server Instance(s) — handling SPARQL endpoint and Linked Data Deployment, supporting negotiable RDF and other document formats
  • a Reverse Proxy Server — which redirects client requests to an available Virtuoso instance and caches the results in case another client makes the exact same request within a specified timeframe
  • a Physical Computer — hosted in OpenLink Software’s datacenter

At present, the DBpedia service is provided by the Virtuoso 7.x Column Store Engine on a virtual machine running CentOS 6.x, using eight (8) Intel Xeon E5–2630 2.30 GHz cores with 200 GB SSD and 64GB memory.

As of February 2016, OpenLink added a secondary Virtuoso instance as a fallback in case the primary instance becomes unavailable.

Rate limits and Connection limits

To ensure the DBpedia service is available to everyone, OpenLink uses both rate limits and concurrent connection limits, preventing badly written and/or badly acting services from hogging all the connections.

At this time, the following limits are used:

  • Connection limit of 50 parallel connections per IP address. While this may seem rather large, the fact is that many sites deploy some form of Network Address Translation (NAT) service due to the ongoing IPv4 address shortage. Without the use of tracking cookies, it is impossible to distinguish between machines inside a NAT network, and due to all kind of privacy concerns and EU regulations, OpenLink has decided not to use such cookies at this point in time.
  • Rate limit of 100 requests per second per IP address, with an initial burst of 120 requests.

As part of monitoring the DBpedia service, OpenLink performs frequent traffic analysis to make sure the service is running smoothly.

Ideally, applications should be written to check the HTTP status code of each request, and in case of a 503 status code, perform a 1–2 second sleep before retrying the request.

OpenLink reserves the right to alter these parameters at any time to make sure the service remains reachable to the general public.

In case of misuse, OpenLink may temporarily block an offender’s IP address from accessing the DBpedia service. This temporary ban is automatically lifted once such a blocked IP address refrains from making any request to the DBpedia service for at least 5 minutes.

Virtuoso “Anytime Query” Functionality

The Anytime Query is a core feature of Virtuoso that enables it to handle the challenges inherent in providing a publicly-accessible interface for ad-hoc querying, at Web scale. This feature allows any SPARQL- and HTTP-protocol-savvy user agent (aka client) to issue long-running and/or large-solution queries, for which finding the complete solution would exceed configured query timeout and/or result set limits, and rather than being rebuffed with no solution, to receive partial solutions conforming to those thresholds. Further, this feature enables the use of LIMIT and OFFSET (typically combined with ORDER BY and/or GROUP BY) to create windows (also known as cursors) to slide through the set of data that constitutes the query’s complete solution.

Note: Even while paging through a partial query solution, Virtuoso continues to work towards a complete solution in the background.

Configured Virtuoso limits on the DBpedia endpoint

The Virtuoso configuration for the DBpedia endpoint includes:

  • Query Cost Estimation Timeout of 240 seconds. This is the query plan optimization threshold that comes into play during the early stages of solution construction
  • Query Execution Timeout of 120 seconds. This is the query solution preparation threshold. If the timeout stops execution before the solution is complete — i.e., if the solution is partial — this is signified to the query client via HTTP response headers.
  • Maximum SPARQL query solution (aka result set) size of 10,000 rows. This is the maximum number of solution rows (for SELECT queries) or triple/quad statements (for CONSTRUCT or DESCRIBE queries) returned per query-solution-retrieval round-trip.

Create your own DBpedia instance

Users who frequently hit these restrictions, which may affect very complex analytical queries, are advised to —

HTTP Statistics

HTTP logs

The HTTP server log files used in this report exclude traffic generated by:

  • IP addresses that were temporarily rate-limited after their burst period
  • IP addresses that were banned after misuse
  • Applications, spiders, and other crawlers that were blocked after frequently hitting the rate-limiter or which generally claimed too many resources.

The system uses a combination of firewall rules and Access Control Lists (ACLs) to quickly drop such connections, so legitimate users of the DBpedia service can connect and perform their lookups.

To save time, these dropped connections are not recorded in the log files.

The data for this document was extracted from reports generated by Webalizer v2.21.

HTTP Usage Overview

Notes:

[1] There is a small bias when taking an average from a set of averages; however, calculating the actual average as (sum of all hits) ÷ (number of days in dataset), or 1,546,813,738 ÷ 365, we get 4,237,845, which we consider to be a small enough difference.

[2] Similar to note #1, the directly-calculated average number of visits is actually 17,002,123 ÷ 365, or 46,581.

Hits

The following two graphs show the average number of hits (or requests) per day, as well as the total number of hits (or requests) per month that were made to the DBpedia service.

The sudden spike in February was attributed to a small number of research institutes making a huge number of requests to the DBpedia service. We used this situation to test our new dual-instance setup to make sure we had ample room for future growth.

Apart from this spike, the graphs continues to show a general month-on-month increase in usage, continuing a trend that goes back all the way to the DBpedia 3.3 dataset.

Visits

The following two graphs show the average number of unique visits per day, as well as the total number of visits per month that clients made to the DBpedia service.

We consider multiple hits (or requests) by the same client with less than 30 minutes between two requests to comprise a single “visit.” A gap of more than 30 minutes between requests starts a second “visit.”

The sudden increase in number of visits since June 2016 can be partly attributed to a number of new web services such as iSpecies, LodLive, and LODmilla, which use DBpedia for look-ups. Whether these web services run in the client’s browser or on an aggregating server, interactive services will always results in more distinct visits to the DBpedia service.

While web crawlers (like Google, DuckDuckGo, Bing, and Yahoo) normally do not make many distinct visits, hosted projects on Google’s UserContents, Amazon, and other cloud services do contribute to the number of visits.

OpenLink may analyze the visitor data from these various services separately in a future analytics document.

Again we continue to see a month-on-month increase in visits, continuing the same trend since the public availability of the initial DBpedia 3.3 dataset.

Sites

This last graph shows the number of unique IP addresses that made requests to the DBpedia service.

Since it is not possible to track individual IP addresses behind a NAT firewall, these figures should not be taken as absolute.

Links

Previous Reports

Some of the statistics in this document were previously published as part of:

Link to spreadsheet