In many use cases that we as bakdata see at our customers’ sites, relevant information is hidden in the connections between entities, e.g., when analyzing relationships between users, dependencies between items, or connections between sensors. Such use cases are usually modeled in a graph. Earlier this year, Amazon released its new graph database, Neptune. In this post, we want share our first insights, the good experiences, and the ones that might be improved over time.
Graph databases promise to handle highly connected datasets better than their relational equivalents. In such datasets, relevant information is usually stored in the links between entities. For the purpose of testing Neptune, we used the amazing open data project MusicBrainz. MusicBrainz collects any imaginable kind of metadata about music, e.g., information about artists, songs, album releases, or concerts, but also which artist collaborated with whom for a song, who produced a song, or when an album was released in which country. MusicBrainz can be seen as a huge network of entities that are somehow related to the music industry.
The MusicBrainz dataset is provided as a CSV dump of a relational database. In total, the dump contains about 93 million rows across 157 tables. While some of these tables hold master data about, such as, artists, events, recordings, releases, or tracks, others — link tables — store relationships between artists and other artists, recordings, or releases, among others. This shows the inherit graph structure of the dataset. When transforming the dataset into RDF triples, we faced roughly 500 million triples.
Based on experiences and impressions from project partners that we work with, we imagine a setting, in which this knowledge base is used to derive new information. Additionally, we envision it to be updated regularly, e.g., by adding new releases or updating members of bands.
As expected, the setup of Amazon Neptune is straightforward. It is documented in quite some detail. With only a few clicks you have a graph database up and running. However, when it comes to more detailed configuration, it is hard to find the correct information. Hence, we want to point out one configuration parameter.
Amazon claims that Neptune focuses on low latency, transactional workloads, which is why there is a default query timeout of 120 seconds. We, however, tested more analytical use cases, in which we regularly hit the limit. This timeout can be modified by creating a new parameter group for Neptune and setting neptune_query_timeout to an appropriate limit.
Below we will discuss how we loaded MusicBrainz’ data with Neptune in detail.
Relations to Triples
First, we transformed MusicBrainz’ data into RDF triples. For each table, we therefore defined a template that defines how each column is represented in a triple. In this example, each row from the artist table is mapped to twelve RDF triples.
The suggested way to load large amounts of data into Neptune is the bulk loading process via S3. After uploading your triple files to S3, you initiate the loading via POST request. In our case, this took about 24h for 500 million triples. We expected this to be faster.
To avoid this long process each and every time we start Neptune, we chose to recover the instance from a snapshot, in which these triples are already loaded. Starting from a snapshot is significantly faster, but it still takes about an hour, until Neptune is available for querying.
When initially loading triples into Neptune, we faced various errors.
Some of these were parsing errors as shown above. By today, we still haven’t figured out what exactly went wrong at this point. A bit more details would definitely help here. This error occurred for about 1% of the inserted triples. But, for the matter of testing Neptune, we accepted the fact that we only work on 99% of MusicBrainz’ information.
Even though this might be a no-brainer for people familiar with SPARQL, keep in mind that RDF triples need to be annotated with explicit data types, which is again error-prone.
As mentioned above, we don’t want to use Neptune as a static data store, but rather as a flexible and evolving knowledge base. Therefore, we needed to figure out ways to insert new triples when the knowledge base changes, e.g., when a new album is published, or when we want to materialize derived knowledge.
Neptune supports insert statements via SPARQL queries, both with raw data and based on subselects. Below we will discuss both approaches.
One of our goals was to insert data in a streaming fashion. Consider an album release in a new country. In terms of MusicBrainz, this means that for a release — which subsumes albums, singles, EPs, etc. — a new entry is appended to the release-country table. In RDF, we map this information to two new triples.
Another goal was to derive new knowledge from the graph. Suppose we want to retrieve the number of releases that each artist published in their career. Such a query is rather complex and takes more than 20 minutes in Neptune, which is why we need to materialize the result in order to reuse this new knowledge in some other query. We therefore add triples with this information back to the graph by inserting the result of a subquery.
Adding single triples to the graph takes a few milliseconds, while the execution time for inserting the result of a subquery is dominated by the execution time of the subquery itself.
Even though we did not actively use it, Neptune also allows deleting triples based on subselects or explicit data, which can be used to update information.
By introducing the previous subselect that returns the number of releases for each artist, we already introduced a first kind of query that we want to answer using Neptune. Querying Neptune is straight forward by sending a POST request to the SPARQL Endpoint, as shown below:
Besides that, we implemented a query that returns a profile for artists containing information about their name, age, or country of origin. Bear in mind that artists may be human beings, bands, or orchestras. Additionally, we enrich this data with information about the number of releases an artists published in and up to a year. For solo artists, we also add information about the bands this artist was member of in each year.
Due to the complexity of such a query, we were only able to run point queries for specific artist, e.g., Elton John, but not for all artists. Neptune does not seem to optimize this query by pushing down filters into the subselects. Therefore, one should filter each subselect manually by the artist name.
Neptune is charged both on an hourly base and per I/O operation. For our testing we used the smallest Neptune instance, which costs $0.384/h. In the case of the query above that calculates the profile for a single artist, Amazon charged us a few ten thousand I/O operations — which means costs of $0.02.
First, Amazon Neptune fulfills most of its promises. Being a managed service, it is a graph database that is extremely easy to setup and can be run without lots of configuration. Here are our five key take-aways:
- Bulk loading is simple, but slow. But, it can get complicated due to error messages that are not very helpful
- Stream loading supports everything we expected and was sufficiently fast
- Querying is simple, but not interactively enough for running analytical queries
- SPARQL queries should be optimized manually
- Amazon fees are difficult to determine because it is hard to estimate the amount of data scanned by a SPARQL query