Event Store Migration and Conversion

Published in

AxonIQ

11 min readDec 22, 2020

How to move or transform events in bulk

There are several situations where you might want to copy or move the contents of your Event Store, not just to build a view model, but to either move events to a new system or perform large-scale conversions on the contents. Quite often an Axon Server based Event Store can be moved using a simple file copy, but what if this isn’t possible, we’re moving between different event store implementations, or we only want a selection of the contents?

In this blog, we’ll look at some of the scenarios and the different approaches available.

Scenario 1: Moving from Axon Server to Axon Server

Suppose you started with Axon Server Standard Edition and want to move to Axon Server Enterprise Edition to take advantage of the enhanced Availability or Security options, then how to transfer the Event Store may not be an obvious path. Especially if you want to use the multi-context feature to consolidate several SE installations. Although the simple path is based on file copies, you may have a problem if the current installation does not give you easy access to the underlying file system. This situation may also arise when moving from a local disk using Kubernetes cluster to one using Network or Cloud storage, or when moving to a VM-based installation. Let’s start with the simplest scenarios, and look at how events and snapshots are stored.

Note: Most of what is discussed here is largely relevant for recovery scenarios as well, but should not be applied in those situations without some careful consideration. Trying to get Axon Server back up and running after you lost part or all of its event store should include steps to minimize the actual changes applied, to prevent loss of data. In a migration scenario we can assume not only clean shutdowns, but also an empty target event store.

The Axon Server Event Store

Events and snapshots in Axon Server are stored in files that are pre-allocated on disk, which means they are administratively created at a certain size, even though no actual contents have been stored yet. On the filesystem this uses a facility called “sparse files”, where an index is kept of actual blocks of data that have been written, while the parts that are untouched are left absent. In practice this means you could have a file announced as being 256MiB in size, while in actuality using no data other than that of the directory entry. Write some bits and bobs to a few random locations within those claimed megabytes, and the size remains unchanged, while the disk usage has increased to cover those bits and bobs, rounded up to account for the actual blocks of data, often in multiples of 512 bytes. Copy the file however, and you’ll get the full size both in shown and used size, because it’ll behave as if all the data is there, just mostly zeroes, which luckily compresses nicely. For Axon Server the situation is a bit simpler, as it does no random walks through the file, but adds data from the start until the next bit no longer fits the limit of the configured segment size.

Axon Server by default creates its Event Store in the “data” subdirectory of where it’s living, together with the Control Database (a file named “axonserver-controldb.mv.db”), using one subdirectory per context. Axon Server SE has only a single context and it is named “default”, while EE has a directory per context, and in those directories you’ll see numbered files. Files ending in “.events” are used to store events, while those ending in “.snapshots” contain, you guessed it, the aggregate snapshots. These files are referred to as “segment files”, and new segments are created automatically when the old ones fill up. To enable Axon Server to quickly find data in the older segments, they will have indexes and bloom filters. If, on startup, these additional files do not exist, they will be created, so if we want to copy the contents of the entire event store, the event and snapshot files are the most important. However, going over a few gigabytes or terabytes of data to recreate indices is a costly activity, so unless we expressly want them recreated, copying them along is best.

Axon Server has a REST API endpoint at “/v1/backup/filenames” which returns the names of the segment files that have been “closed”, meaning they are not for the current segment. You can also add a parameter to pass the last segment already backed up, so the list will only contain new files added since then. Being a write-only store, we can be assured no changes have been made to the older ones. Also note that the endpoint will only return the names of the files, not the contents, so we still need to access the underlying disk to get them. The obvious alternative is to use an incremental backup service provided on your storage platform. Please note that if your requirements for disaster recovery cannot be met with only the closed segments, you can actually copy the current segment as well, if you also have the Control DB backed up to indicate how far Axon Server has progressed. If, in a full-cluster recovery situation, all three nodes have different ideas about this level of progress, the one with the most confirmed data will be elected leader, and update the others. To correctly implement such strict requirements however, you’ll also want to backup the Replication Logs, which (in an EE cluster) store the data ready for distribution. It actually contains the most recent changes consolidated for a replication group, so if the group contains more than one context, you will not be able to single out a particular context. The part of the log which has been confirmed by all members will be regularly cleaned, so the log will generally be pretty small.

Summarizing, if your goal is to provide for a backup, you should at the very least copy all closed segments, or else be complete and include current segment, logs and Control DB. For SE there are no logs, but for the rest the same choice exists. For an EE-to-EE migration scenario, where you want to use the data to seed a new (different) context, you cannot reuse logs and Control DB, so another tactic must be applied. We’ll get back to that after the next section, which discusses how to restore the backup.

Seeding a Context with a Backup

To use the files we have just backed up, we need to copy them to the new server, and place them in the correct directory. If this is SE-to-SE, or even a straight EE-to-EE copy, you can include the directory itself in the backup and unpack the archive on the other side, but if you want the files to end up in a different context you only want the files. When Axon Server starts up, and the context is already known, it will notice the files and ensure that its understanding of the last segment matches what it sees. If the context is (as yet) unknown, it will ignore the files. When you create the context, it will attempt to create a directory for it, and instead find the context already filled.

Assuming default directories, the steps are:

As mentioned before, during the initial check, if no index files are present, Axon Server will create them. Axon Server uses a scan backwards from the before-last segment, until a segment with index is found. Note that you should therefore check that there are no holes, and remove all index files after such a hole if you do find one. Another step in the check is for the structural integrity of the last few segments. If Axon Server was shut down cleanly it will only check the last two segments. If the shutdown was not clean a configurable number of segments is checked. The property for this is “axoniq.axonserver.snapshot.validation-segments”, and it defaults to 10 segments.

After all checks have finished, Axon Server will determine the last segment and last position in that segment, and record it in the ControlDB. When the next event or snapshot comes in, or a heartbeat message is sent from the leader, it compares the new entry’s index with the expected location for the next entry. In case of a mismatch, most likely because it has an incomplete store, it communicates this to the current leader, who will provide the node with any missing data. If “axoniq.axonserver.replication.force-snapshot-on-join” is true, which is the default, the backlog is sent in a more efficient snapshot format (not to be confused with the aggregate snapshots) instead of the “normal” transaction-based packaging. Only when the node is up-to-date will it start partaking in the confirmation of transactions. Note that, if you start out with only the leader having a complete store, for example because they all had different opinions about the current state, the cluster will not be functional until a majority of the nodes are up-to-date.

Programmatically Transferring Events

If, for whatever reason, you are unable to obtain copies of the Event Store files, but you have a working server (or cluster), you have two alternatives: you can use a “normal” TrackingEventProcessor and stream (a specific kind of) events normally, or you can plug into the (lower-level) event stream directly and get them that way. Both methods will actually let us inspect and optionally transform the events, but neither will include a stream of the snapshots. Note that if you do decide to modify the event payload or metadata, do not build a completely new EventMessage object, but only set the changed data on the retrieved one, or you’ll get new message identifiers and timestamps. With a programmatic approach you also gain the possibility to split a context, or even merge two or more, but then you need to generate new identifiers to ensure there are no gaps or overlapping segments.

Sometimes the simplest approach is the right one, and writing an event handler to copy events is a pretty simple, and easy to control, method. You have full control over what you do with the events, including the option to make changes to payload or meta-data. However, if you then also publish the events in the regular manner, you may not get an exact copy, because other events may be published to the target context as well. Additionally, if the context is not empty when you start, your global sequence numbers (or event tokens) will not match, although you can ensure sequencing. The alternative is to use the Axon Server Connector and plug directly into a stream from Axon Server. On the writing side you can do the same, which allows you to do more efficient batching of groups of events. You still have the option of transforming events and meta-data, but you can now enforce the sequence numbering is unchanged. Given that Axon Server will complain if a sequence number reuse is attempted, you should ensure no other applications are publishing to the same context.

To use the (new) Axon Server connector, all you need to do is to include the coordinates for it in your project:

<dependency>
    <groupId>io.axoniq</groupId>
    <artifactId>axonserver-connector-java</artifactId>
    <version>4.4.5</version>
</dependency>

Now you can obtain an event channel to Axon Server:

private EventChannel axonServerConnection(String servers, String context,
                                          String token,
                                          boolean useTls) {
    Final String clientId = UUID.randomUUID().toString();
    AxonServerConnectionFactory.Builder sourceConnectionFactoryBuilder =
            AxonServerConnectionFactory
                .forClient("migrationApplication", clientId)
                .routingServers(servers)
                .token(token);
        if (source.tlsEnabled()) {
            sourceConnectionFactoryBuilder.useTransportSecurity();
        }
        return sourceConnectionFactoryBuilder
                   .build()
                   .connect(context)
                   .eventChannel();
    }

On this channel you can then either call “openStream()” to read events, or “appendEvents()” to store them. The first lets you pass the global sequence token you want to start at, the other has a “getLastToken()” to find out how far you were.

Scenario 2: Moving to Axon Server

If you started your application when Axon Server wasn’t available yet, or your company’s standard solutions didn’t include it, you may have an event store in an RDBMS using the Framework’s JpaEventStorageEngine or JdbcEventStorageEngine, or one in MongoDB using the MongoEventStorageEngine. As is discussed in an earlier blog-post, these storage engines work reasonably well but are far from ideal given the specific requirements for an “append-only” event store. Migration to an Axon Server cluster would solve that, and a handy migration tool is available for the job. However, if you’re combining it with an upgrade of the application, you could use the programmatic approach as well. It may also be that you’re not able to bring the application down for the migration, in which case you’ll have to use a two-stage process.

The first stage uses the migration tool to move the bulk of the events and snapshots, and then continue with a siphoning process that uses a tracking event processor to replay events to the new environment. Note that the migration tool will keep track of its progress, so it can be run several times to catch up before switching to the siphoning approach.

Moving to Axon Server using the Programmatic Approach

If the Migration Tool is not an option, you can also decide to “roll your own” migration utility, which may even be required if your starting point does not use the standard EventStorageEngine implementations. In that case, you can use the programmatic approach as discussed before, using the Axon Server Connector to store events efficiently, or publishing through the Framework. Note this also allows you to connect directly to the current Event Store, which gives you access to all necessary meta-data it may provide. Please do note that the global sequence number must follow a continuous range, so if the source has gaps, you’ll have to store that number in the meta-data and generate a fresh sequence instead.

Scenario 3: Moving from Axon Server

As should be obvious by now, moving data from Axon Server to another event store will require the programmatic approach, as there are no ready-made tools available. Just as before, you can stream events using a regular handler, or else use the Axon Server Connector to read events in a more efficient manner. In a sense this method can also be used to create human-readable backup copies of the event store, for example in a JSON format. Combined with a corresponding JSON-to-Axon-Server tool, this would even allow for a customized backup implementation, where the storage format is vendor-neutral.

Concluding Remarks

Migrating event data to and from Axon Server in bulk is an activity you should not take lightly, as it generally indicates a large change in your infrastructure. Not discussed in this article is of course the simplest approach for creating an off-site backup, which is using a Passive Backup Axon Server node, or a pair of Active ones. This will do fine for most requirements involving disaster recovery, where smaller disasters can simply be taken care of by having a cluster rather than a single node. If however, your goal is to migrate data from one event store to another, such as when you first start using an Axon Server cluster, then I hope you will now see there are several options available, and for some, we have a ready-made solution available. Please let us know if you think we missed a scenario that needs covering or any other feedback you may have. As always, questions are welcome on our Discuss site, which is frequented by both the AxonIQ team and many users.

Website: https://axoniq.io/

Youtube channel: https://www.youtube.com/c/AxonIQ

Reference Guide: https://docs.axoniq.io/reference-guide