Try. And then retry. There can be failure.

Michael Simons
Neo4j Developer Blog
17 min readAug 24, 2020

With persistent network connections between things, the exceptional case should be expected and not considered to be a surprise. There are dozens of reasons why a remote connection may be closed or be rendered unusable. In such scenarios precautions must be taken so that your logic succeeds eventually.

Introduction

General considerations

Most database management systems (DBMS) these days provides client libraries aka drivers that provide stateful connections to the DBMS. Establishing such connections includes among other things acquiring a transport link, a network session on both the client and the server and of course, user authentication. These days connections are usually encrypted — or at least they should be — so the whole TLS ceremony, including certificate verification comes before that. All that makes creating connections expensive. Therefore once created, these network connections are kept intact as long as it’s reasonable.

Having established and verified a connection once gives no guarantee that this connection will remain healthy until it is closed. There are plenty of reasons that a connection can become useless:

  • The physical connection fails
  • The server goes away
  • Parts of the server being in use go away (i.e. the members of a cluster)
  • The server figures that the connection has not been used and terminates it
  • An intermediate component (firewall, load balancer) decides that the connection should be terminated

In the Java world JDBC is an established standard for connecting to relational databases. JDBC connections are rarely handled in isolation directly from within an application but most often through connection pools. In connection pooling, after a connection is created, it is placed in the pool. After it was “closed” the connection is put back into the pool, so it can be used again so that a new connection does not have to get re-established. If all the connections are being used, a new connection is established and added to the pool. Connection pools can also be configured to verify or test connections before they handle out a lease or to keep them alive.

Depending on the pool, this might or might not save you from the pain when a connection goes away. The pool can create a new one, hand it out and you’re good to go.

This is all fine when you execute small operations that complete fast or at least way before a connection may fail. It won’t help you with longer running transactions, involving multiple calls on a connection. The first calls may succeed, the next one fails. A pool cannot not do anything here, when it handed out the connection, it worked.

To sum this up: Stateful connections are costly to acquire and there are many good reasons to keep them around once created. Connection failures however are not something extraordinary, they happen frequently.

As a developer you have to create your application in such a way that it is able to mitigate connection failures. There are various tools that help you with the basic connection management in pools. When things fail in mid flight, it is up to you and your requirements to decide whether you want to fail hard or retry a transaction.

Neo4j

Why is this topic important?

Neo4j can be run as a database cluster, not only to make it very scalable but also make it very resilient against the occasional loss of a server or changes in infrastructure. Clients are usually routed to one of the members of a cluster that fulfills the needs of that client (either read-only operations or read-and-write operations). If the member to which the client is currently connected goes away, the cluster itself takes care of the cluster’s reorganization, but the client needs to handle the exceptional state: A stateful connection that has become stale.

That becomes even more relevant in cloud deployments of the Neo4j database, such as Neo4j Aura, in which the members of a cluster are rotated quite often as part of normal operations (software upgrades, resiliency etc…).

The Neo4j Driver

I am writing this post from a Java perspective, thus I mainly focus on the Neo4j Java driver, but the core aspects apply to each supported driver.

Each Neo4j driver instance does connection pooling for you. You point it to a server or a cluster and it creates an internal connection pool for you. With .session() (or .asyncSession() respectively .rxSession()) on the driver object, a session with an underlying connection will be acquired and handed to you.

If there are no more connections in the pool or if all connections are closed, a new connection is created.

All the work in that sessions happens inside of transactions. All the time.

Transactions are atomic units of work executing one or more Cypher statements. Transactions may contain read or write operations, and will generally be routed to an appropriate server for execution, where they will be carried out in their entirety. In case of a transaction failure, the whole transaction needs to be retried from the beginning.

The driver offers two modes in which the transaction management is done for you and one mode where you are responsible.

You will find the following information in the driver’s manual under “Transactions”.

Automatic commits: This is the simplest form: You run one query directly on the session. Opening a transaction before and committing it afterwards is done for you. This is probably also the most short-lived form of a transaction, depending only on how long your single statement takes to run.

Either way: Not only is the transaction managed for you but also making sure the session and its underlying physical connection is usable is done for you. This all happens in one go. The statement will however not be retried if the connection breaks down during execution. Which is kind of what you expect from such a kind of operation, in this way it resembles the JDBC autocommit mode.

Transactional functions: All the drivers provide an API that takes in a unit of work callback, defined in terms of the programming language. In the Java world that is a functional interface called TransactionWork<T> with a single method called T execute(Transaction tx). T represents a type parameter (the type of the object returned by the unit of work) and the one and only parameter is the ongoing transaction. That way it can be easily implemented using a lambda.

Round and round again.

Those functions can be passed around inside the driver and they can be retried (re-executed) when things fail during execution. Failure can happen as described before (the acquisition of the physical connection fails) or in-flight (connection is lost) or on transient errors (like deadlocks or timeouts).

To allow the driver to safely execute those functions multiple times, there are some hard requirements:

  1. The functions must be idempotent and are not allowed to return an unconsumed result set from any ongoing query.
    Idempotency should be an obvious requirement: The function may be applied multiple times until success and it shouldn’t have a different outcome on the second run than at first try.
  2. The second requirement may not be that obvious. The function will use the passed Transaction object to execute queries. The result of those queries is tied to that transaction. The transaction will be closed after the function ran, in both failure and success states. At that very moment, the behaviour of the result will become undetermined, as it is tied to the transaction. If the method returns anything from it, than it must be mapped into a stable object before the end of the functions.

You might wonder why 2 of the following examples require additional libraries respective work to instead of relying on the builtin retry-mechansim:

Many application frameworks, Spring being one of them, run their own transaction management. Those application level transaction have a bigger scope than the drivers one. A larger unit of work, orchestrated by the application. They may even integrate with another transaction manager by chaining multiple transactions. In such scenarios we cannot just force the outer transaction into something the driver can understand. Even if it was the case, many users enjoy the ease of declarative transaction. They can annotate their business logic with @Transactional and the framework does the heavy lifting of wrapping the annotated functions inside a transaction.

Unmanaged transaction: Unmanaged transactions work by opening a session (which will ensure that at this very moment a working connection can be established), and than explicitly opening up a transaction with beginTransaction(), working on it and committing as you see fit.

This is the API which many higher-level abstractions/libraries will use with the driver. Such abstractions — like Spring Data Neo4j — use transactions managed by the application (or framework) itself and translate them into database transactions accordingly. Mark Paluch and I spoke at Devoxx 2019 about those topics, find our reactive transaction master class here.

Unmanaged transactions provide freedom of interaction with the database transactions as necessary — for example keeping it open as long as a large result set is streamed and processed further — but delegate the responsibility of handling cases of failures back to the client code.

Examples

The following examples are all Java based and use Spring Boot as an application runtime.

The reason for that: I work in the team that is responsible for our Spring integration and have a lot of Java experience so in the end, I can write idiomatic Java much better than say idiomatic Python or Go.

The main ideas should be portable to other languages too.

The examples all work in the Neo4j Movie Graph. For your convenience I have added Neo4j migrations to the setup of each project. It creates the dataset for you.

Each of the different services offer a read-only REST service under http://localhost:8080/api/movies, giving you a list of movies.

A second endpoint, http://localhost:8080/api/movies/watched/ takes a movie title records a "watch/view" of the movie. This endpoint requires authentication as couchpotato with password secret.

All three example services use the same MovieController to orchestrate a MovieService looking like this

public interface MovieService {        Collection<Movie> getAllMovies();        Integer watchMovie(String userName, String title);
}

The implementations of the movie service, especially watchMovie is bloated and complicated on purpose.

The general flow is

  1. getting the movie, then
  2. getting the person that is authenticated and then
  3. updating the number of times the movie is watched.

I know how to write this in one Cypher statement, but the idea is to have a slight window of time between operations on the database in which I can kill the connection or introduce arbitrary failure.

All the following examples are available on Github: Neo4j Java Driver retry examples.

Shared configuration

The examples share the following configuration

Config.builder()
.withMaxConnectionLifetime(5, TimeUnit.MINUTES)
.withMaxConnectionPoolSize(1)
.withLeakedSessionsLogging();

or expressed as properties in Spring Boot 2.3 with our starter on the classpath in application.properties.

org.neo4j.driver.pool.max-connection-lifetime=5m
org.neo4j.driver.pool.metrics-enabled=true
org.neo4j.driver.pool.log-leaked-sessions=true
org.neo4j.driver.pool.max-connection-pool-size=1

or with Spring Boot 2.4 upwards as

spring.neo4j.pool.max-connection-lifetime=5m
spring.neo4j.pool.metrics-enabled=true
spring.neo4j.pool.log-leaked-sessions=true
spring.neo4j.pool.max-connection-pool-size=1

This is NOT a configuration I recommend in any form in production. Especially the pool size of one effectively disables the pool, but allows for easy testing our retries via Neo4j’s dbms.listConnections() and dbms.killConnection()functions.

Application using the Java driver

This describes the application named driver_with_tx_function in the GitHub repository. Not relevant for our example, but it usesspring-boot-starter-web, spring-boot-starter-security and neo4j-java-driver-spring-boot-starter which gives you the Neo4j Java driver.

Given the service holds an instance of org.neo4j.driver.Driver like this:

@Service
public class MovieService {
private final Driver driver; MovieService(Driver driver) {
this.driver = driver;
}
}

the function reading all the movies can be implemented like this:

public Collection<Movie> getAllMovies() {

// This is a transactional function, a unit of work
TransactionWork<List<Movie>> readAllMovies = tx -> {
// A mapping function, extracted for readability
Function<Record, Movie> recordToMovie =
r -> new Movie(r.get("m").get("title").asString());

// The only interaction with the database
return tx.run(
"MATCH (m:Movie) RETURN m ORDER BY m.title ASC"
).list(recordToMovie);
};

try (Session session = driver.session()) {
// The actual moment the unit of work
// is passed to the driver
return session.readTransaction(readAllMovies);
}
}

The whole unit of work is basically atomic. It doesn’t modify state, so it is safe to retry. The result set is consumed before the unit of work is left (via the list(transformer)method). When passed to readTransaction the driver tries to execute it for a maximum of 30s by default.

The ceremony looks very similar for a write scenario:

public Integer watchMovie(String userName, String title) {    // The unit of work
TransactionWork<Integer> watchMovie = tx -> {
// Split onto multiple queries to have
// some window for disaster
var userId = tx.run(
"MERGE (u:Person {name: $name}) RETURN id(u)",
Map.of("name", userName)
).single().get(0).asLong();
var movieId = tx.run(
"MERGE (m:Movie {title: $title}) RETURN id(m)",
Map.of("title", title)
).single().get(0).asLong();
// With some random delay added as well
InsertRandom.delay();
var args = Map.of("movieId", movieId, "userId", userId);
return tx.run(""
+ "MATCH (m:Movie), (u:Person) "
+ "WHERE id(m) = $movieId AND id(u) = $userId "
+ "WITH m, u "
+ "MERGE (u) - [w:WATCHED] -> (m) "
+ "SET w.number_of_times = COALESCE(w.number_of_times,0)+1 "
+ "RETURN w.number_of_times AS numberOfTimes", args)
.single().get("numberOfTimes").asInt();
};
try (Session session = driver.session()) {
// The actual call, this time
// in a `writeTransaction`
return session.writeTransaction(watchMovie);
}
}

All the merge-operations (i.e. graph updates) in those statements will be committed or none at all. Care must be taken not calling a stored procedure that does internal commits or using a statement with PERIODIC COMMIT or creating other side-effects.

The execution of the watchMovie unit of work will be retried for 30 seconds by default.

Now let’s look at Spring’s @Transactional, Object-Graph-Mappers like Neo4j-OGM or for more value-add Spring Data Neo4j.

Application using Neo4j-OGM and Spring Data inside Spring transactions

The following example can be found in the sdn_ogm application of the example repository.

Spring offers a declarative way of defining transactional boundaries in the service layer of an application via the @Transactional annotation. This depends on Spring’s TransactionManager. In Spring’s case this TransactionManager is responsible for the scope and propagation of a transaction and also on which type of exceptions the transaction should be rolled back.

Spring’s transaction manager has no built-in understanding of retries. That is treated as an application level concern.

In addition to @Transactional, Spring transactions can also be used with the TransactionTemplate, which is a similar unit-of-work callback, but the restrictions mentioned just above stay valid.

Assume an OGM based service like this

@Service
public class MovieServiceBasedOnPureOGM implements MovieService {
private final org.neo4j.ogm.session.Session session; public MovieServiceBasedOnPureOGM(Session session) {
this.session = session;
}
}

The session is not a Driver, but an OGM session!

Looking at the read method above implemented with OGM we find

@Transactional(readOnly = true)
public Collection<Movie> getAllMovies() {
return session.loadAll(Movie.class);
}

There’s no way to use/inject the neo4j-driver’s builtin retries as the operation is just passed through the drivers APIs. The same is true for the write case.

Again, please note that this is of course implemented badly to test out retries:

@Transactional
public Integer watchMovie(String userName, String title) {
var user = Optional.ofNullable(
session.queryForObject(
User.class,
"MATCH (u:Person) WHERE u.name = $name " +
"OPTIONAL MATCH (u) -[w:WATCHED] -> (m:Movie)" +
"RETURN u, w, m",
Map.of("name", userName)
)).orElseGet(() -> new User(userName));
var movie = Optional.ofNullable(
sessiom.queryForObject(
Movie.class,
"MATCH (m:Movie) WHERE m.title = $title RETURN m",
Map.of("title", title))
).orElseGet(() -> new Movie(title));
InsertRandom.delay(); int numberOfTimes = user.watch(movie);
session.save(user);
return numberOfTimes;
}

getAllMovies and watchMovie now defines our transactional units of work, as the lambdas in the previous section did before.

To avoid defining custom queries completely, we can replace the interaction with the session with Spring Data repositories like that:

@Service
public class MovieServiceBasedOnSDN implements MovieService {
interface MovieRepository extends Neo4jRepository<Movie, Long> { Optional<Movie> findOneByTitle(String title);
}
interface UserRepository extends Neo4jRepository<User, Long> { Optional<User> findOneByName(String name);
}
private final MovieRepository movieRepository; private final UserRepository userRepository; public MovieServiceBasedOnSDN(
MovieRepository movieRepository,
UserRepository userRepository) {
this.movieRepository = movieRepository;
this.userRepository = userRepository;
}
@Override @Transactional(readOnly = true)
public Collection<Movie> getAllMovies() {
return (Collection<Movie>) movieRepository.findAll();
}
@Override @Transactional
public Integer watchMovie(String userName, String title) {
var user = userRepository.findOneByName(userName)
.orElseGet(() -> new User(userName));
var movie = movieRepository.findOneByTitle(title)
.orElseGet(() -> new Movie(title));
InsertRandom.delay(); int numberOfTimes = user.watch(movie);
userRepository.save(user);
return numberOfTimes;
}
}

The transactional units of work stay the same and it reads better but there’s still no way we can facilitate the drivers builtin retry mechanism.

As explained earlier: Expect those things to fail! With the code in place, you can handle this on the calling side like this:

@PostMapping("/watched")
public Integer watched(
Principal principal, @RequestBody String title
) {
try {
return this.movieService
.watchMovie(principal.getName(), title);
} catch(Exception e) {
throw new ResponseStatusException(HttpStatus.I_AM_A_TEAPOT);
}
}

Or do retries on your own in that catch block. Or you do it in the client at the caller who can decide to auto-retry or leave it to the end user to do so.

Regardless of what you do: It is the applications responsibility to handle these errors!

One way of doing this is a library named Resilience4j. Resilience4j is a lightweight fault tolerance library inspired by Netflix Hystrix, but designed for functional programming.

The library offers not only retries, but also circuit breakers, bulkheads and more. In generally, it offers several ways to make your application more resilient against inevitable exceptional states and behaviors of other services it depends upon.

The easiest way add Resilience4j to your Spring project is via a starter: io.github.resilience4j:resilience4j-spring-boot2:1.5.0. In addition, you have to add org.springframework.boot:spring-boot-starter-aop to enable the declarative usage via the@Retry annotation.

Those dependencies gives you support to configure Resilience4j via application properties and provides all beans necessary in the Spring context.

Resilience4j can be configured programmatically but we are using the provided configuration properties:

# This is represents the default config
resilience4j.retry.configs.default.max-retry-attempts=10
resilience4j.retry.configs.default.wait-duration=1s
# Those are the same exceptions the driver itself would retry on
resilience4j.retry.configs.default.retry-exceptions=\
org.neo4j.driver.exceptions.SessionExpiredException,\
org.neo4j.driver.exceptions.ServiceUnavailableException
# Only to make log entries appear immediate
resilience4j.retry.configs.default.event-consumer-buffer-size=1
resilience4j.retry.instances.neo4j.base-config=default

This creates a retry object named neo4j which tries 10 attempts and waits for a second in between. It only retries on exceptions of the given type.

An exponential backoff interval can be enabled by setting resilience4j.retry.configs.default.enable-exponential-backoff=true.

How to use this?

some proper resilience

If you want to stick with the declarative approach, all you have to do is annotate the service class as a whole or individual methods with
@Retry(name = "neo4j") like this:

import io.github.resilience4j.retry.annotation.Retry;import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
@Service
@Retry(name = "neo4j")
public class MovieServiceBasedOnPureOGM implements MovieService {
private final org.neo4j.ogm.session.Session session; public MovieServiceBasedOnPureOGM(Session session) {
this.session = session;
}
@Transactional(readOnly = true)
public Collection<Movie> getAllMovies() {
// Implementation see above
}
@Transactional
public Integer watchMovie(String userName, String title) {
// Implementation see above
}
}

And that’s effectively all there is.

If you prefer doing it in a programmatic approach without using annotations, you can inject the registry of Retry objects into the calling side and run your transactional unit of work like this:

import io.github.resilience4j.retry.RetryRegistry;import java.security.Principal;
import java.util.Collection;
import org.neo4j.tips.cluster.sdn_ogm.domain.Movie;
import org.neo4j.tips.cluster.sdn_ogm.domain.MovieService;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
@RequestMapping("/api/movies")
public class MovieController {
private final MovieService movieService; private final RetryRegistry retryRegistry; public MovieController(
MovieService movieService,
RetryRegistry retryRegistry) {
this.movieService = movieService;
this.retryRegistry = retryRegistry;
}
@GetMapping({ "", "/" })
public Collection<Movie> getMovies() {
// Get the configured retry
return retryRegistry.retry("neo4j")
// Chose one of the fitting methods and
// execute your service
.executeSupplier(this.movieService::getAllMovies);
}
@PostMapping("/watched")
public Integer watched(
Principal principal, @RequestBody String title) {
return retryRegistry.retry("neo4j")
.executeSupplier(() ->
this.movieService.watchMovie(principal.getName(), title));
}
}

Please note that you cannot do this inside the service method annotated with @Transactional. If you would, you would get the boundaries exactly the wrong way: The retry would happen inside the transaction. You want to have the transaction retried.

The Neo4j driver itself does retry on two additional cases: When it receives a transient exception from the server with two well defined error codes. This is rather easy to replicate by a Java Predicate:

public class RetryOGMSDNExceptionPredicate implements Predicate<Throwable> {    @Override
public boolean test(Throwable throwable) {
Throwable ex = throwable;
if (throwable instanceof CypherException) {
ex = throwable.getCause();
}
if (ex instanceof TransientException) {
String code = ((TransientException) ex).code();
return !"Neo.TransientError.Transaction.Terminated".equals(code) && !"Neo.TransientError.Transaction.LockClientStopped".equals(code);
} else {
return
ex instanceof SessionExpiredException ||
ex instanceof ServiceUnavailableException;
}
}
}

As OGM happens to wrap exceptions it catches into CypherException we can unwrap those as well.

To add this predicate to your Resilience4j config, add this to your configuration:

resilience4j.retry.configs.default.retry-exception-predicate=\
your.package.RetrySDN6ExceptionPredicate

Note: We will be adding a pre-build predicate to OGM that you can use for your convenience.

Application using Spring Data Neo4j 6 inside Spring transactions

The upcoming version 2.4 of Spring Boot will contain a completely revamped Spring Data Neo4j without Neo4j-OGM but still containing all the mapping features. The same application using a milestone of SDN 6 (formerly known as SDN/RX) is available as sdn6.

The predicate looks a bit different, but all the rest applies.

Running the examples

The examples require Java 11. I provided a simple client for the application. Built and run it like this:

./mvnw clean compile
./mvnw exec:java -Dexec.mainClass="org.neo4j.tips.cluster.client.Application"

It will keep on calling localhost:8080 and expects one of the services running.

To run the pure driver based server or the SDN/OGM examples, use

./mvnw spring-boot:run -Dspring-boot.run.arguments="--org.neo4j.driver.uri=neo4j://YOUR_DATABASE:7687 --org.neo4j.driver.authentication.password=YOURPASSWORD"

To run the SDN 6 example, the properties are a bit different

./mvnw spring-boot:run -Dspring-boot.run.arguments="--spring.neo4j.uri=neo4j://YOUR_DATABASE:7687 --spring.neo4j.authentication.password=YOURPASSWORD"

To make the the SDN/OGM respectively the SDN 6 example use the Spring Data repositories, add --spring.profiles.active=use-sdn to the run arguments.

All applications provide metrics for the driver (how many connections have been created) under http://localhost:8080/actuator/metrics/neo4j.driver.connections.created.

The SDN/OGM and the SDN 6 application that use Resilience4j provide additional metrics about retries, such as:

Summary

The Neo4j Java Driver and libraries such as Neo4j-OGM and Spring Data Neo4j works just fine against Neo4j clusters and cloud solutions like Aura. All three transaction modes (auto commit, managed and unmanaged transactions) can be used.

A library using unmanaged transactions just works just fine against that dynamic environment.

However, applications must plan and prepare for connection failures — regardless whether the database is deployed standalone or as a cluster. This is expected like with other databases. Connection failures can be mitigated by using built-in retry mechanisms of our drivers or using external solutions.

In the Java world, you have two options to deal with this for Neo4j:
Using the built-in mechanisms or a tool like Resilience4j.

Resilience4j allows shaping those retries in a very fine grained way. We haven’t discussed what happens at the n-th retry: Either the thing fails completely or an alternative is called. Such a last resort would keep services available for the users with retries enabled later on.

If you can not use either transactional functions or Resilience4j, make sure you keep the session object as short lived as possible. The Neo4j drivers are smart enough to validate the physical connection before handing out a session. When you open a fresh one, use and than close it immediately, chances are rather low that you suffer from an inflight loss of connectivity.

Then you still have the option of handling the error from the calling client’s side which might have more context available or can just ask the user what to do.

Happy Coding, Michael

Many thanks to Michael Hunger and Gerrit Meier for proofreading the post and fixing my errors.

--

--

Michael Simons
Neo4j Developer Blog

👨‍👩‍👦‍👦👨🏻‍💻🚴🏻 — Father, Husband, Programmer, Cyclist. Author of @springbootbuch, founder of @euregjug. Java champion working on @springdata at @neo4j.