Elasticsearch as a Graph Database

As a CTO, it’s your continual job to be in control of the technologies used in your organization, and decide when it is time to move forward —abandon a current technology in a search for a better, more suitable one.

My case is about a project that includes a very large property Graph Database — containing about 10 billion vertices (entities) and 250 billion edges
At that point, this data was stored on an instance of Titan DB, with a HBase backend.

So, why move forward? What’s wrong with Titan? 
It turns out that Titan functions pretty well as a Graph Database — meaning that if you have an ID of an entity and you want to retrieve data about it from the DB — that’s a pretty simple task that Titan does quickly, efficiently and simply. Also, given an ID — an anchor — a Graph Traversal is easy (such as, ‘give me the out-edges of the out-vertices of vertex X’).
But our case started to become more complex — until that point, all we needed was this very task — we have built a UI that enables data exploration over the Graph. Given an entity ID, our users explored other entities on the Graph using our UI. That simple. Titan was working fine.

But then, we wanted to expand our functionality — to enable free-text search of entities residing in the Graph.
Here, we encountered a limitation. Titan’s built-in indexing mechanism is limited to exact-match searches, and we wanted to allow our users more than that. We wanted our users to be able to search for entities that includes, starts-with or ends-with user-provided strings.

In a search for a better solution, we found out that Titan also enables integration with third-party indexing engines, such as Elasticsearch.
That sounded like a pretty good direction. But wait a minute… If all of our data has to be searchable, and we are using an external indexing engine, that means that the entire data has to be stored twice — one time on Titan backend (which is HBase), and another time on Elasticsearch.

So, we though to ourselves… Why not use Elasticsearch as a Graph Database?

Why Elasticsearch?

Elasticsearch is a search engine that operates on JSON documents.
It is a mature open-source project, and it does not require any special hardware or setup — it works out-of-the-box.

Elasticsearch can be easily scaled horizontally without any downtime.
It is easy to use using an intuitive REST API.

Advantages

  • Search: As mentioned earlier, using Elasticsearch for storing our data will make data searchable. We will be able to leverage ES features such as string contains, regular-expressions, splitting strings into words etc.
  • Deletion: ES enables deletion of massive amount of documents, a feature that is missing in Titan — Titan does not enable you to remove hundreds of millions of edges or documents using a query.
  • Partition: ES stores documents on Indices — as an abstraction, theses are containers of documents. This feature can be leveraged to partition the edges/vertices on the graph — for example, we can create 3 different indices, each of them will store different types of edges. Then you can make ES search (or delete) only data residing in specific indices, and not on others. This is something you cannot do with Titan.
  • OLAP: If you store your data on ES, you immediately enjoy the ability to process your data using Spark or MapReduce (using Elasticsearch-Hadoop). To be precise — Spark is also supported by Titan (using SparkGraphComputer), but this functionality operates very slowly using Titan.

Disadvantages

  • Graph query language: Since its originally not a Graph Database, ES does not support any graph query language, while Titan supports Gremlin for queries. That means that each query has to be modeled as a ES query, something that can be sometimes difficult. It also possible to implement or use a library that converts a graph-query into an ES query, such as Unipop.
  • Maximal sizes: As mentioned earlier, data on ES is stored on indices. Each ES index consists of one or more shards.
    As a rule of a thumb — a shards with more documents will operate slower. That means that you want to limit the number of documents stored in each shard, making each of your indices have an upper limit of documents.
    With that in mind, you must plan your indices so they will be suitable for the volumes of documents that you have to store. If you don’t plan it ahead, your system might suffer from slowness.
    The bottom line is — using ES makes you plan volumes of data ahead, while (Titan over) HBase is flexible and does not require that.

Modeling a graph into documents

While Titan has special objects for representing graph elements — Vertex, Edge, etc— ES supports a unified JSONs — no special types at all.

Modeling graph elements as JSONs is a bit challenging — you must find a way to link edges to vertices.

Ways we found out:

  1. Nested types: these are documents containing other documents. 
    It can be leveraged for storing edges as nested documents inside vertex documents.
    While seems suitable at first, it has pitfalls — if a vertex is a supernode — containing lots of edges, each retrieval of the vertex will require lots of I/O.
    In addition, it will require an implementation of a locking mechanism — think of a case of two parallel processes that wish to add an edge of a single vertex — one of them has to wait for the other to finish.
    And as ES is not transactional — this has to be externally implemented.
  2. Parent-child relationship: similar to nested types.
  3. Independently: Storing each vertex and each edge as a separate document, linking edges to their in/out vertex by specifying their IDs on the edge document itself — for example, ‘inid’ and ‘outid’.
    No locking mechanism is required — you can add or delete an edge without worrying about concurrency issues. Each edge is an independent document.

We eventually picked the last choice, as it required no further implementation, and made out documents as lightweight as possible.

Each vertex document consisted of an id field and additional primitive-type properties.

Each edge document consisted of an id, outid (id of the outgoing edge), inid (id of the ingoing edge) and additional primitive-type properties.

As we wanted all of the properties to be searchable — each property was configured as ‘index’: ‘not_analyzed’.

We set ‘docvalues’ and ‘_all’ to false.

Planning the indices

As mentioned before, our graph consists about 250 billion edges. Each edge includes about 10 properties of a primitive type.

Our common query is — given a vertex ID — get all of the out edges exists.
Meaning that each of the queries has to search through all of the 250 billion edges — no limit of time — you always query everything.

Our focus here was on queries. We wanted our system to operate as quick as Titan operates — we did not want our users to feel the change.
And in order to achieve that — we needed to plan our ES cluster to contain as less shards as possible (without overloading the shards) — so that each query will require as less I/O operations as possible.

We set a shard size — 120 million documents, all fields indexed as ‘not_analyzed’. Such a shard weighs about 22 GB.

Our initial direction was creating a time-based indices —for example, an index per day, per month or per year. But doing so we still had to query each index per each query, which is lots of I/O.

So we took another approach — use ES indices as HBase regions:
We’ll have the following indices: ‘edges-10’, ‘edges-11’, ‘edges-12’, … ‘edges-20’… ‘edges-98’, ‘edges-99’.
Each index will contain only documents that the value of their ‘outid’ property starts with the digits of the index name — for example, ‘edges-12’ will only contain documents with ‘outid’ that starts with 12.

Doing so, we minimize the number of indices that participate in a query to one.

We also added the ‘out’ field as ES routing field, so that only one shard will participate in a query.

Performance

The current Titan cluster was running on 12 physical servers, 16 cores and 100 GB of RAM. HBase was weighing about 20TB.

We wanted to examine which environment — virtual or physical — is better for the ES cluster.

We tested our system on 1 index with 1 shard. Our purpose was to measure performance on a single shard, as we planned our indices so that only a single shard will participate.

We had the following configurations for our tests:

  1. Physical + SATA: a single 32-CPUs, 200 GBs of RAM server, connected to a 7200-RPM SATA physical disk.
    (32 GBs were used as the ES JVM heap-size)
  2. Virtual + SAN: a single 10-CPUs, 16 GBs of RAM server, connected to SAN standard storage (not SSD).
    (10 GBs were used as the ES JVM heap-size)
  3. Virtual + SAN SSD: a single 10-CPUs, 16 GBs of RAM server, connected to SAN standard storage with SSD.
    (10 GBs were used as the ES JVM heap-size)

Write performance

We streamed data from Titan to ES using Spark with 45 workers.
The bulk size was set to 5 MB.

We got the following paces:

  • Physical: 45k docs per sec
  • Virtual + SAN: 30k docs per sec
  • Virtual + SAN SSD: 26k docs per sec

The physical instance writes 50% more documents per sec than the virtual instance.
In addition, the writing pace using the SSD SAN is smaller than the one on the standard SAN — probably due to a CPU bottleneck.

Query performance

When benchmarking query performance, you must be 100% sure that the results you are getting are not influenced by a caching mechanism.
The caching mechanisms that can influence our benchmarking results are: 
ES heap cache, Linux page-cache, and SAN cache.

In order to eliminate the three of them, we created the following experiment flow:

  1. Clear the page cache (by running ‘echo 3 > /proc/sys/vm/drop_caches’)
  2. Restart ES service
  3. Wait for the cluster to be green
  4. Wait that I/O is settled down (using iostat on linux)
  5. Run dummy query
  6. Clear the page cache (by running ‘echo 3 > /proc/sys/vm/drop_caches’)
  7. Immediately run query

The dummy query was:

URL: experiment-index/_search?routing=12345
{
"size": 10,
"query": {
"constant_score": {
"filter": {
"terms": {
"outid": [
"XXX"
]
}
}
}
}
}

The query itself was the same — replacing ‘XXX’ with a real vertex identifier.

We picked new vertex identifier on each test — but selected only vertices with exactly 5 outgoing edges. By doing that we made sure that ES has to averagely perform the same number of I/O reads per each query.

We got the following times for not-optimized shards with 30–40 segments each:

  • Physical: 132 millis
  • Virtual + SAN: 191 millis
  • Virtual + SAN SSD: 63 millis

And after optimizing the shard to 10-segments maximum:

  • Virtual + SAN: 119 millis
  • Virtual + SAN SSD: 49 millis

Average read latencies (as seen on iostat):

  • Physical: 5.2 millis
  • Virtual + SAN: 6.1 millis
  • Virtual + SAN SSD: 1.3 millis

Average number of read operations was 27 on non-SSD and 12 on SSD.

Screenshot from iostat on the virtual non-SSD SAN:

iostat results — virtual non-SSD SAN

Screenshot from iostat on the SSD SAN:

iostat results — virtual SSD SAN

In addition, we wanted to measure the total time that ES spends on non-I/O operations (we called it the applicative time)— therefore we performed the following experiment on a single vertex ID:

  1. Restart ES service
  2. Wait for the cluster to be green
  3. Wait that I/O is settled down (using iostat on linux)
  4. Run dummy query
  5. Immediately run query

After the second run of the experiment, the pages that ES needs for its query reside in the page-cache, and there is no need to retrieve it from the SAN.
We found out that the applicative time of ES is 17 millis.

Conclusion

Generally, ES seems like a decent choice for storing Graph elements.

More specifically, it seems like ES performs better on virtual environment using distant storage than on a physical environment using a standard SATA.

More tests should be performed in order to get more accurate conclusions.