Neo4j-OGM and Spring Data Neo4j

How to choose an unique identifier for your database entities

Published in

Neo4j Developer Blog

7 min readMay 8, 2019

I work for Neo4j, where I’m mainly responsible for maintaining Neo4j-OGM, our object graph mapper, and Spring Data Neo4j, our implementation of the Spring Data repository abstraction. While this current work is mostly graph related, I also worked and I am still working with relational databases, too.

During the last months I noticed that some issues that I encountered in the context of object relational mapping, keeps reappearing in object graph mapping, most of the time with similar consequences.

Today I’m focus on one of those things: Identifiers.

I’ll use the wording database entities when I speak about things that have an identity inside the database. I’ll use application entities when I speak about identifiable objects inside your application. When using Neo4j-OGM, those are Neo4j-OGM @NodeEntity, when using JPA or Hibernate with a relational database, that will be @Entity. Usually those corresponds to each other. Both entities are not entities in the pure sense of Domain Driven Design.

All database entities stored in Neo4j, Nodes as well as Relationships, need an identifier that allows them to be retrieved. Neo4j creates or reuses internally unique values for those, representing record-offsets in the datastore. The internal id of a database entity can be retrieved via calling id() with the Node or Relationship in question. This internal id is very similar to the row-id of a relational database.

Most of the time your business should not be concerned with the physical ids of things. However, your application will want to identify things as well. If you’re lucky, you’ll find a business key that is unique and won’t change (Hint: People’s names are not good candidates, as are most other things that you’d think of first!).

Congratulations, you found a natural key. If this is not possible, then you need surrogate key.

The Neo4j internal id could be a surrogate key, as it can be easily extracted and used. However, it might turn out that it is problematic to share with other systems for long time storage or give long-time guarantees (i.e. that the same resource will be reachable under the given id): The graph database might reuse the record and with it the id, it when a database entity has been deleted.

So what are your options?

Use the internal database entity id as id for your application entity

You’re mapping code would look like this (except for a proper name, which is here used to differentiate the different approaches):

@NodeEntity
public class PersonWithInternalSurrogateKey {

   @Id @GeneratedValue
   private Long id;

   private String name;

   public Long getId() {
      return id;
   }

   public String getName() {
      return name;
   }

   public void setName(String name) {
      this.name = name;
   }
}

Notice that you don’t need a setter for the id as it is immutable. You also can add a pair of equals/hashCode method here based on the id. Although I personally am not a fan of that (as the hash code will return different values before and after persisting the entity as the attribute id will be set by Neo4j-OGM), it is according to Java’s Object#equals and #hashCode contracts:

"Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCodemethod must consistently return the same integer, provided no information used in equals comparisons on the object is modified."

Your Spring Data repository will then look like this:

import org.springframework.data.repository.CrudRepository;

public interface PersonWithInternalSurrogateKeyRepository 
   extends CrudRepository<PersonWithInternalSurrogateKey, Long> {
}

The ID used in the repository refers to the attribute annotated with @Id. The flow of using it is roughly like this:

@Test
public void flowWithInternalSurrogateKeys() {
   PersonWithInternalSurrogateKey newPerson 
     = new PersonWithInternalSurrogateKey();
   newPerson.setName("Homer");

   PersonWithInternalSurrogateKey savedPerson
      = surrogateKeyRepository.save(newPerson);
   assertThat(savedPerson.getId()).isNotNull();

   Optional<PersonWithInternalSurrogateKey> loadedPerson
      = surrogateKeyRepository.findById(savedPerson.getId());
   assertThat(loadedPerson).isPresent();
}

Advantages: It is pretty clear that the id attribute is the surrogate business key, it takes no further effort or configuration to use it.
Disadvantage: It is tied to Neo4js internal database id, which is not unique to our application entity only over a database lifetime.

Use externally provided surrogate keys

The @GeneratedValue annotation can take a class implementing org.neo4j.ogm.id.IdStrategy as parameter. Neo4j-OGM provides InternalIdStrategy (the default) and UuidStrategy out of the box. The later generates new UUIDs for each entity and returns them as java.util.UUID. An application entity using that would look like this (Getters and setters remain the same, apart from the result type):

public class PersonWithExternalSurrogateKey {
   @Id @GeneratedValue(strategy = UuidStrategy.class)
   private UUID id;

   private String name;
}

The repository doesn’t change much either:

import java.util.UUID;

import org.springframework.data.repository.CrudRepository;

public interface PersonWithExternalSurrogateKeyRepository
   extends CrudRepository<PersonWithExternalSurrogateKey, UUID> {
}

Nor does the flow:

@Test
public void flowWithExternalSurrogateKeys() {
   PersonWithExternalSurrogateKey newPerson 
    = new PersonWithExternalSurrogateKey();
   newPerson.setName("Homer");

   PersonWithExternalSurrogateKey savedPerson
    = externalSurrogateKeyRepository.save(newPerson);
   assertThat(savedPerson.getId()).isNotNull();

   Optional<PersonWithExternalSurrogateKey> loadedPerson
    = externalSurrogateKeyRepository.findById(savedPerson.getId());
   assertThat(loadedPerson).isPresent();
}

We have to discuss two separate things regarding advantages and disadvantages. The assignment itself and the UUID-Strategy. A universally unique identifier is meant to be unique for practical purposes. To quote Wikipedia: “Thus, anyone can create a UUID and use it to identify something with near certainty that the identifier does not duplicate one that has already been, or will be, created to identify something else.”
Our strategy uses Java internal UUID mechanism, employing a cryptographically strong pseudo random number generator. In most cases that should work fine, but your milage might vary.
That leaves the assignment itself:

Advantage: The application is in full control and can generate a unique key that is just unique enough for the purpose of the application. The generated value will be stable and there won’t be a need to change it later on.
Disadvantage: The generated strategy is applied on the application side of things. In those days most applications will be deployed in more than one instance to scale nicely. If your strategy is prone to generate duplicates than inserts will fail as uniques of the primary key will be violated. So while you don’t have to think about a unique business key in this scenario, you have to think more what to generate. A solution could be a strategy that uses an hilo-algorithm similar that Hibernate uses or solutions like Snowflake applied by twitter or Sonyflake.

Using business or natural keys as primary keys

In my book, this is the optimal case, but very rare and hard to decide. Things like names tend to change, even social numbers. German tax payers get assigned a unique id right after birth, given they have a permanent address. This might be a candidate:

@NodeEntity
public class TaxPayer {
   @Id
   private String taxId;

   private String name;

   public TaxPayer(String taxId, String name) {
      this.taxId = taxId;
      this.name = name;
   }

   public String getTaxId() {
      return taxId;
   }

   public String getName() {
      return name;
   }

   public void setName(String name) {
      this.name = name;
   }

   @Override
   public boolean equals(Object o) {
      if (this == o)
         return true;
      if (!(o instanceof TaxPayer))
         return false;
      TaxPayer taxPayer = (TaxPayer) o;
      return Objects.equals(taxId, taxPayer.taxId);
   }

   @Override
   public int hashCode() {
      return Objects.hash(taxId);
   }
}

You’ll notice that the whole entity makes much more sense from an application point of view. The tax id doesn’t change, so I can provide a sane equals/hashCode pair. Due to restrictions in Neo4j-OGM, the field must not be final, though.

While the repository change only in regard with the type of identifiable used, the flow can now different:

@Test
public void flowWithBusinessKeys() {
   TaxPayer taxPayer = new TaxPayer("4711", "Michael");

   TaxPayer savedTaxPayer = taxPayerRepository.save(taxPayer);
   assertThat(savedTaxPayer).isEqualTo(taxPayer);

   Optional<TaxPayer> loadedPayer 
     = taxPayerRepository.findById("4711");
   assertThat(loadedPayer).isPresent();
}

With the assigned business key it is not necessary to use the returned object instead of the one initially created. I would stick to the flow using the returned object nevertheless, as you might use our auditing support, that allows attributes like updatedAt or updatedByto be filled with the current timestamp or user.

Advantages: Using a business or natural key as primary key is natural. The entity in question is clearly identified and it feels most of the time just right in the further modelling of your domain.
Disadvantages: Business keys as primary keys will be hard to update once you realise that the key your found is not as stable as you thought. Often it turns out that it can change, even when promised otherwise. Apart from that, finding identifier that are truly unique for a thing is hard.

Don’ts

While some other combinations are possible, I do not recommend them.
Look at this bad person together with their repository:

public class BadPerson {
   @Id @GeneratedValue
   private Long internalId;

   private Long id;

   public Long getInternalId() {
      return internalId;
   }

   public Long getId() {
      return id;
   }

   public void setId(Long id) {
      this.id = id;
   }
}public interface BadPersonRepository extends CrudRepository<BadPerson, Long> {
}

Which id would the findById method take? Well, I had to try it out for myself:

@Test
public void flowWithBadPersons() {
   BadPerson person1 =  new BadPerson();
   person1.setId(42L);

   BadPerson savedPerson = badPersonRepository.save(person1);

   assertThat(
     badPersonRepository.findById(savedPerson.getInternalId())
   ).isPresent();
   assertThat(badPersonRepository.findById(42L)).isEmpty();
}

So: The internal id (internalId) it is , because the method findById comes from the repository abstraction. You would need to overwrite with a custom query to fulfil your implicit expectation. So please just avoid such setups, they are confusing and error prone.

Conclusion

Neo4j-OGM can deal with internal and external surrogate keys. From an application perspective my preferred solution would be finding natural business keys. As this is sometimes really hard, there are easy ways to generate surrogate keys. Be aware that those need to be unique as well. That might be even difficult for some UUID implementations.

If you don’t represent nodes as external resources in your application and therefore you’re not relying on longterm guarantees about the ids, internal keys should work for you fine.

Note: All examples used are available here:

michael-simons/neo4j-sdn-ogm-tips

A curated list of Neo4j SDN and OGM tips developed while answering questions on SO or for customers. …

github.com

Photo by Aaron Burden on Unsplash.